[00:03:39] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4043.ulsfo.wmnet with OS trixie [00:03:56] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4047.ulsfo.wmnet with OS trixie [00:04:23] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4047.ulsfo.wmnet with OS trixie [00:10:48] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4040.ulsfo.wmnet with OS trixie [00:11:43] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS trixie [00:11:53] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [00:13:07] (03PS3) 10RLazarus: mw-*: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251187 (https://phabricator.wikimedia.org/T419637) [00:14:25] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp4040.ulsfo.wmnet [reason: trixie reimaging] [00:15:31] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [00:16:02] (03CR) 10RLazarus: [C:03+2] mw-*: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251187 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [00:18:19] (03Merged) 10jenkins-bot: mw-*: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251187 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [00:19:33] scapping the envoy update to mediawikis everywhere [00:21:40] !log rzl@deploy2002 Started scap sync-world: https://gerrit.wikimedia.org/r/1251187 T419637 [00:21:44] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [00:22:38] !log rzl@deploy2002 rzl: https://gerrit.wikimedia.org/r/1251187 T419637 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:23:50] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage [00:23:56] !log rzl@deploy2002 rzl: Continuing with sync [00:27:24] !log rzl@deploy2002 Finished scap sync-world: https://gerrit.wikimedia.org/r/1251187 T419637 (duration: 07m 12s) [00:27:28] T419637: Upgrade Envoy to v1.35.9 - https://phabricator.wikimedia.org/T419637 [00:30:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4047.ulsfo.wmnet with reason: host reimage [00:31:35] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#11704947 (10tstarling) >>! In T353891#11704396, @bd808 wrote: > https://lists.wikimedia.org/postorius/lists/mediawiki-... [00:31:53] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage [00:39:10] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage [00:39:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1251199 [00:39:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1251199 (owner: 10TrainBranchBot) [00:41:45] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4042.ulsfo.wmnet with OS trixie [00:42:01] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp4042.ulsfo.wmnet [reason: trixie reimaging] [00:45:11] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp4044.ulsfo.wmnet [reason: trixie reimaging] [00:45:47] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS trixie [00:51:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1251199 (owner: 10TrainBranchBot) [00:55:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4047.ulsfo.wmnet with OS trixie [00:59:48] (03PS1) 10Gerrit Patch Uploader: ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) [00:59:50] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [01:05:03] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4043.ulsfo.wmnet with OS trixie [01:06:35] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [01:08:59] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp4043.ulsfo.wmnet [reason: trixie reimaging] [01:09:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1251201 [01:09:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1251201 (owner: 10TrainBranchBot) [01:09:51] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [01:14:18] (03PS3) 10Dzahn: jenkins: add ci::httpd profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) [01:17:11] (03PS4) 10Dzahn: jenkins: add ci::httpd profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) [01:17:55] (03CR) 10Dzahn: [C:03+2] "unblocked https://gerrit.wikimedia.org/r/c/operations/puppet/+/1250752" [puppet] - 10https://gerrit.wikimedia.org/r/1250755 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:18:14] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1250752/8263/contint2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:18:39] (03CR) 10Dzahn: [V:03+1 C:03+2] jenkins: add ci::httpd profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:22:16] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4047.* [01:22:40] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint1003.wikimedia.org with reason: setup [01:23:23] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2003.wikimedia.org with reason: setup [01:24:27] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop on contint prod - setup in progress on new jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/1250752 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:26:39] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on contint2003.wikimedia.org with reason: T418521 [01:26:43] T418521: setup 2 contint machines for jenkins - https://phabricator.wikimedia.org/T418521 [01:26:53] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on contint1003.wikimedia.org with reason: T418521 [01:27:50] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1251201 (owner: 10TrainBranchBot) [01:37:03] (03PS1) 10RLazarus: mw-videoscaler: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251203 (https://phabricator.wikimedia.org/T419637) [01:37:20] !log contint1003/contint2003 - every time(?) we setup machines with puppet using our httpd module and PHP - and puppet runs for the first time we run into the same old issue with "Exec[ensure_present_mod_php" failing and "Considering conflict mpm_worker for mpm_prefork"sudo a2dismod mpm_event". The fix is: 'sudo a2dismod mpm_event' and run puppet again. T418521 [01:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:24] T418521: setup 2 contint machines for jenkins - https://phabricator.wikimedia.org/T418521 [01:38:18] (03PS1) 10RLazarus: kubernetes: Set default Envoy version to 1.35.9 [puppet] - 10https://gerrit.wikimedia.org/r/1251204 (https://phabricator.wikimedia.org/T419637) [01:41:27] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS trixie [01:43:26] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11704980 (10RobH) Tech is running late, their dispatcher called me to let me know. They were set to be onsite at 7AM, but it will now be closer to 10:30AM / 19:30 Pacific [01:45:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:30] (03PS2) 10RLazarus: mw-parsoid: Delete values-canary.yaml and values-migration.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251192 (https://phabricator.wikimedia.org/T386246) [01:48:38] (03PS3) 10Dzahn: jenkins: add proxy_jenkins profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) [01:48:56] (03CR) 10Scott French: mw-videoscaler: Update to Envoy 1.35.9 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251203 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [01:50:10] (03PS2) 10RLazarus: mw-videoscaler: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251203 (https://phabricator.wikimedia.org/T419637) [01:50:19] (03CR) 10Scott French: [C:03+1] kubernetes: Set default Envoy version to 1.35.9 [puppet] - 10https://gerrit.wikimedia.org/r/1251204 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [01:50:49] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1250748/8264/contint1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:51:17] (03CR) 10RLazarus: mw-videoscaler: Update to Envoy 1.35.9 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251203 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [01:53:40] (03CR) 10Scott French: [C:03+1] mw-videoscaler: Update to Envoy 1.35.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251203 (https://phabricator.wikimedia.org/T419637) (owner: 10RLazarus) [01:54:52] (03CR) 10Dzahn: [V:03+1 C:03+1] "merging this will enable the /ci URL on the domain we host on this and proxy to jenkins. jenkins.discovery.wmnet/ci/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [01:56:24] (03PS4) 10Dzahn: jenkins: add proxy_jenkins profile to role [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) [02:00:47] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:01:27] (03CR) 10Dzahn: "what this adds can be seen in puppet repo (puppet:///modules/contint/apache/proxy_jenkins) or existing contint1002/2002 with "cat /etc/ap" [puppet] - 10https://gerrit.wikimedia.org/r/1250748 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [02:04:59] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:06] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 18s) [02:14:59] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:15:42] (03PS1) 10Dzahn: jenkins: add envoy and config for jenkins.discovery.wmnet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1251205 (https://phabricator.wikimedia.org/T418521) [02:23:29] 06SRE, 10envoy, 06ServiceOps new, 10ServiceOps-Services-Oids: Upgrade Envoy to v1.35.7 - https://phabricator.wikimedia.org/T410975#11704988 (10RLazarus) 05In progress→03Resolved Resolving; the remaining hosts will go straight to 1.35.9 in T419637 instead. >>! In T410975#11681081, @MLechvien-WMF wr... [02:33:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:55] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11705018 (10RobH) Tech is onsite and performing the hw power distro board swap on cp5022 [02:57:03] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet [reason: trixie reimaging] [02:59:59] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1018.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:07:23] (03CR) 10ArielGlenn: "Tiny typo, otherwise looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251148 (owner: 10Daniel Kinzler) [03:16:14] (03PS1) 10Dzahn: jenkins: enable the jenkins service if using new role [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) [03:16:49] (03CR) 10CI reject: [V:04-1] jenkins: enable the jenkins service if using new role [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [03:23:11] (03PS2) 10Dzahn: jenkins: enable the jenkins service if using new role [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) [03:28:47] (03PS3) 10Dzahn: jenkins: enable the jenkins service if using new role [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) [03:32:03] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1251208/8267/" [puppet] - 10https://gerrit.wikimedia.org/r/1251208 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [03:55:16] (03CR) 10ArielGlenn: "Except for Claime's changes related to the spec endpoint, it makes sense to me and looks right. I did not attempt to test this however." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [04:26:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:27:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932 (10Scardenasmolinar) 03NEW [05:26:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:45:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:13] (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [05:57:14] (03CR) 10Dreamrimmer: idwiki: Remove unused user groups on Indonesian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) (owner: 10Codename Noreste) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260313T0600) [06:04:59] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:14:58] (03CR) 10Anzx: [C:03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [06:14:59] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:15:45] (03CR) 10CI reject: [V:04-1] ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [06:17:28] (03CR) 10Anzx: ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [06:38:11] (03CR) 10Anzx: ptwiki: Enable block action for the abuse filter (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [06:59:59] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1018.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260313T0700) [07:19:51] (03CR) 10Arnaudb: [C:03+2] mailman: update the web frontend firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1251009 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [07:23:51] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind CDN - https://phabricator.wikimedia.org/T286066#11705274 (10ABran-WMF) 05In progress→03Resolved mailman-web is now fully behind CDN: ` ~ $ curl -s https://lists1004.wikimedi... [07:54:17] (03PS1) 10Elukey: kserve: Remove caBundle occurrences [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251220 (https://phabricator.wikimedia.org/T419040) [07:55:50] !log installing 6.12.74 on Trixie hosts [07:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:41] !log installing Linux 6.12.74 on Trixie hosts [07:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:47] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11705361 (10ABran-WMF) 05Stalled→03In progress [08:37:33] !log elukey@cumin1003 START - Cookbook sre.kafka.change-confluent-distro-version Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [08:41:35] elukey@cumin1003 change-confluent-distro-version (PID 3160738) is awaiting input [08:41:57] (03CR) 10Elukey: [C:03+2] role::kafka::test::broker: move to Confluent Kafka 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/1249940 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [08:46:56] (03CR) 10Silvan Heintze: [C:03+1] "LGTM, tried it out locally - works nicely 👍" [dumps] - 10https://gerrit.wikimedia.org/r/1251169 (https://phabricator.wikimedia.org/T401296) (owner: 10WMDE-leszek) [08:52:25] (03PS10) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [08:52:47] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [09:01:18] !log elukey@cumin1003 END (FAIL) - Cookbook sre.kafka.change-confluent-distro-version (exit_code=99) Change Confluent distribution for Kafka A:kafka-test-eqiad cluster: Change Confluent distribution. [09:02:32] (03CR) 10Elukey: [C:03+2] confluent: update kafka.sh with kafka-leader-election [puppet] - 10https://gerrit.wikimedia.org/r/1248496 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [09:05:07] (03PS1) 10Daniel Kinzler: rest gateway: allow 250k req/h for CG-NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251268 [09:05:21] (03PS1) 10Kevin Bazira: ml-services: add policy-violation-gpt-staging isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251269 (https://phabricator.wikimedia.org/T418350) [09:22:19] (03CR) 10Clément Goubert: rest-gateway: exclude action API `action=cspreport` from rate limiting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [09:25:49] (03PS11) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [09:25:58] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [09:27:41] (03CR) 10Clément Goubert: rest-gateway: exclude action API `action=cspreport` from rate limiting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [09:28:21] (03CR) 10Dpogorzelski: [C:03+1] kserve: Remove caBundle occurrences [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251220 (https://phabricator.wikimedia.org/T419040) (owner: 10Elukey) [09:28:55] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-ctrl1001.eqiad.wmnet [09:29:34] (03CR) 10Elukey: [C:03+2] kserve: Remove caBundle occurrences [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251220 (https://phabricator.wikimedia.org/T419040) (owner: 10Elukey) [09:30:06] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-ctrl1002.eqiad.wmnet [09:32:44] !log installing Linux 6.1.164 on Bookworm hosts [09:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:51] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:34:03] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:34:17] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-ctrl1001.eqiad.wmnet [09:34:59] FIRING: [3x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:35:36] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-ctrl1002.eqiad.wmnet [09:35:39] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-worker1001.eqiad.wmnet [09:36:03] (03CR) 10Clément Goubert: [C:03+1] rest gateway: allow 250k req/h for CG-NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251268 (owner: 10Daniel Kinzler) [09:39:06] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:39:13] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:39:20] (03PS1) 10Bartosz Wójtowicz: ml-services: Add CoPE-A-9B experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251272 (https://phabricator.wikimedia.org/T418832) [09:40:57] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-worker1001.eqiad.wmnet [09:41:00] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-worker1002.eqiad.wmnet [09:42:39] (03PS2) 10Bartosz Wójtowicz: ml-services: Add CoPE-A-9B experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251272 (https://phabricator.wikimedia.org/T418832) [09:44:12] (03CR) 10Effie Mouzeli: [C:03+2] mw-parsoid: Delete values-canary.yaml and values-migration.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251192 (https://phabricator.wikimedia.org/T386246) (owner: 10RLazarus) [09:45:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:09] (03Merged) 10jenkins-bot: mw-parsoid: Delete values-canary.yaml and values-migration.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251192 (https://phabricator.wikimedia.org/T386246) (owner: 10RLazarus) [09:46:38] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-worker1002.eqiad.wmnet [09:46:41] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-worker1003.eqiad.wmnet [09:47:20] (03PS12) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [09:49:47] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1251061 (owner: 10L10n-bot) [09:50:37] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:50:39] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [09:50:48] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:51:55] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-worker1003.eqiad.wmnet [09:51:59] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-worker1004.eqiad.wmnet [09:57:24] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-worker1004.eqiad.wmnet [09:57:27] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-worker1005.eqiad.wmnet [09:57:49] (03PS13) 10Jcrespo: mediabackups: Initial puppetization of Versity S3 gateway for replacing minio [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) [09:57:56] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251113 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo) [09:57:57] (03CR) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [09:58:51] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm [10:00:46] (03PS24) 10Effie Mouzeli: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) [10:01:27] !log jelto@cumin1003 conftool action : set/pooled=no; selector: name=tcp-proxy7001.magru.wmnet [10:01:48] (03CR) 10Blake: mw-web: upsize for single-DC serving (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [10:02:44] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-worker1005.eqiad.wmnet [10:02:48] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-worker1006.eqiad.wmnet [10:03:52] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy7001.magru.wmnet [10:04:22] (03PS2) 10Blake: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) [10:04:59] FIRING: CoreRouterInterfaceDown: Core router interface down - pfw1-codfw:reth2 (fasw1-f5 2x25G) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=pfw1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:05:52] (03PS3) 10Blake: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) [10:06:34] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [10:07:01] (03CR) 10Harroyo-wmf: hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250575 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [10:07:48] (03CR) 10Blake: mw-web: upsize for single-DC serving (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [10:07:56] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy7001.magru.wmnet [10:08:14] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-worker1006.eqiad.wmnet [10:08:17] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-worker1007.eqiad.wmnet [10:09:50] !log jelto@cumin1003 conftool action : set/pooled=yes; selector: name=tcp-proxy7001.magru.wmnet [10:12:02] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy7002.magru.wmnet [10:12:14] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11705635 (10MatthewVernon) @Reedy you did the 1.43 backports (at least according to gerrit), can you have a look at this, please? I c... [10:13:42] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-worker1007.eqiad.wmnet [10:13:46] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host tools-k8s-worker1008.eqiad.wmnet [10:14:18] (03CR) 10Kamila Součková: [C:03+1] rest gateway: allow 250k req/h for CG-NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251268 (owner: 10Daniel Kinzler) [10:15:53] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [10:16:31] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [10:16:33] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy7002.magru.wmnet [10:18:19] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy6002.drmrs.wmnet [10:18:23] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host zuul2002.codfw.wmnet [10:18:37] !log arnaudb@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host zuul2002.codfw.wmnet [10:18:59] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host zuul2002.codfw.wmnet [10:19:03] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tools-k8s-worker1008.eqiad.wmnet [10:22:48] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zuul2002.codfw.wmnet [10:23:00] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy6002.drmrs.wmnet [10:24:49] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org [10:26:05] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [10:26:26] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy6001.drmrs.wmnet [10:27:35] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:27:38] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host zuul1002.eqiad.wmnet [10:27:40] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:27:55] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm [10:28:16] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm [10:28:24] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:28:29] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:29:28] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:29:36] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [10:29:45] (03PS2) 10Brouberol: kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) [10:31:04] (03Merged) 10jenkins-bot: memcached: add memcached restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1211089 (https://phabricator.wikimedia.org/T408925) (owner: 10Effie Mouzeli) [10:31:07] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy6001.drmrs.wmnet [10:31:35] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zuul1002.eqiad.wmnet [10:31:40] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org [10:32:54] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy5002.eqsin.wmnet [10:33:20] PROBLEM - Host db1258 #page is DOWN: PING CRITICAL - Packet loss = 100% [10:33:20] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host zuul2001.codfw.wmnet [10:34:27] Here [10:34:52] Oh it's not my shift yet :D [10:35:30] I'm sorta here [10:35:35] (at a hackathon) [10:35:41] let me ack the page [10:35:48] we should depool the server [10:37:19] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depool', diff saved to https://phabricator.wikimedia.org/P89852 and previous config saved to /var/cache/conftool/dbconfig/20260313-103719-ladsgroup.json [10:37:21] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zuul2001.codfw.wmnet [10:37:24] (03PS1) 10Majavah: apt: Add keyfile for debian-debug/backports [puppet] - 10https://gerrit.wikimedia.org/r/1251275 [10:37:26] I'm ooto today but I can take a look [10:38:08] <_joe_> there's people oncall, they can look [10:38:24] I go back to the hackathon [10:38:27] please use -sre, this is too noisy [10:38:28] <_joe_> it's a single server down :) [10:38:39] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:39:19] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8268/co" [puppet] - 10https://gerrit.wikimedia.org/r/1251275 (owner: 10Majavah) [10:39:59] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy5002.eqsin.wmnet [10:40:13] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy5001.eqsin.wmnet [10:43:12] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11705753 (10TheDJ) [10:44:33] (03PS1) 10Kgraessle: Fix broken survey links on PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251276 (https://phabricator.wikimedia.org/T419950) [10:45:13] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy5001.eqsin.wmnet [10:45:38] (03PS2) 10Kgraessle: Fix broken survey links on PersonalDashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251276 (https://phabricator.wikimedia.org/T419950) [10:45:51] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy4002.ulsfo.wmnet [10:46:01] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1019.eqiad.wmnet with reason: host reimage [10:47:40] RECOVERY - Host db1258 #page is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [10:48:37] (03CR) 10Jcrespo: "Unsure if debug and backports are intended to be there, I commented on ticket we may want to disable them by default, moritz to say." [puppet] - 10https://gerrit.wikimedia.org/r/1251275 (owner: 10Majavah) [10:49:20] PROBLEM - Host db1258 #page is DOWN: PING CRITICAL - Packet loss = 100% [10:50:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1019.eqiad.wmnet with reason: host reimage [10:50:19] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy4002.ulsfo.wmnet [10:50:41] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy4001.ulsfo.wmnet [10:50:46] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [10:50:49] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad [10:52:17] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-master-eqiad [10:52:38] (03PS1) 10Tiziano Fogli: slothslos: add inject-labels plugin to the Sloth command line [puppet] - 10https://gerrit.wikimedia.org/r/1251279 (https://phabricator.wikimedia.org/T414579) [10:52:40] RECOVERY - Host db1258 #page is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [10:54:00] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet [10:55:05] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw [10:55:19] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy4001.ulsfo.wmnet [10:56:13] !log jayme@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-master-codfw [10:56:49] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy3002.esams.wmnet [10:57:37] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet [10:57:41] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet [10:59:15] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 22:00:00 on db1258.eqiad.wmnet with reason: depooled, likely to flap over the weekend [10:59:59] FIRING: KubernetesCalicoDown: dse-k8s-worker1018.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1018.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260313T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260313T1100). [11:01:28] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy3002.esams.wmnet [11:01:28] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet [11:01:32] !log filippo@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1008-dev.eqiad.wmnet [11:01:44] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy3001.esams.wmnet [11:02:44] (03CR) 10Effie Mouzeli: [C:03+1] mcrounter: Run spec tests on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1248768 (owner: 10Muehlenhoff) [11:05:39] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1018.eqiad.wmnet with reason: host reimage [11:06:25] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy3001.esams.wmnet [11:07:03] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy2002.codfw.wmnet [11:08:17] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1008-dev.eqiad.wmnet [11:08:22] (03PS1) 10Cathal Mooney: cmooney: remove temp. ssh key [homer/public] - 10https://gerrit.wikimedia.org/r/1251284 [11:08:32] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host zuul1001.eqiad.wmnet [11:08:59] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:09:06] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-master-eqiad [11:09:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:09:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1019.eqiad.wmnet with OS bookworm [11:09:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1018.eqiad.wmnet with reason: host reimage [11:10:48] (03PS1) 10Vgutierrez: cache::haproxy: Ensure that lua files get deployed before starting [puppet] - 10https://gerrit.wikimedia.org/r/1251285 [11:10:59] (03PS1) 10Cathal Mooney: data.yaml: remove temp. ssh key for user cmooney [puppet] - 10https://gerrit.wikimedia.org/r/1251286 [11:11:00] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:11:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:11:08] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:11:30] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy2002.codfw.wmnet [11:11:37] (03PS2) 10Vgutierrez: cache::haproxy: Ensure that lua files get deployed before starting [puppet] - 10https://gerrit.wikimedia.org/r/1251285 [11:11:59] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy2001.codfw.wmnet [11:12:10] FIRING: [2x] BFDdown: BFD session down between cr2-drmrs and 2620:0:860:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:12:25] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zuul1001.eqiad.wmnet [11:12:53] 10ops-eqiad, 06Data-Persistence, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1258 went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11705964 (10jcrespo) [11:12:58] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-master-codfw [11:12:59] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251285 (owner: 10Vgutierrez) [11:13:00] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:13:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:13:08] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:13:39] FIRING: CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (2620:0:860:fe0a::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr2-drmrs:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:13:40] (03PS1) 10Daniel Kinzler: rest-gateway: do not limit pre-flight requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251289 (https://phabricator.wikimedia.org/T418969) [11:14:01] 10ops-eqiad, 06Data-Persistence, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1258 went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11705972 (10jcrespo) [11:14:09] 10ops-eqiad, 06Data-Persistence, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1258 went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11705975 (10jcrespo) [11:15:27] (03CR) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [11:16:00] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:16:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:16:08] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:16:20] (03PS2) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) [11:16:22] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host contint1003.wikimedia.org [11:16:25] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy2001.codfw.wmnet [11:16:51] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy1002.eqiad.wmnet [11:17:10] FIRING: [4x] BFDdown: BFD session down between cr2-drmrs and 2620:0:860:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:18:00] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:18:03] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: do not limit pre-flight requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251289 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [11:18:08] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:18:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:18:41] 10ops-eqiad, 06Data-Persistence, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1258 went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11705982 (10jcrespo) p:05Triage→03Medium [11:19:40] (03PS3) 10Vgutierrez: cache::haproxy: Ensure that lua files get deployed before starting [puppet] - 10https://gerrit.wikimedia.org/r/1251285 [11:20:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:20:08] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:20:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:20:14] 10ops-eqiad, 06Data-Persistence, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1258 connection went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11705994 (10jcrespo) [11:21:22] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy1002.eqiad.wmnet [11:21:49] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host tcp-proxy1001.eqiad.wmnet [11:21:59] !log arnaudb@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host contint1003.wikimedia.org [11:22:10] RESOLVED: [3x] BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:23:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:23:08] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:23:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:26:06] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy1001.eqiad.wmnet [11:26:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:26:08] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:26:41] (03CR) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [11:27:12] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:27:53] (03PS3) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) [11:28:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr2-drmrs:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:28:44] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-backup1003.eqiad.wmnet [11:29:06] (03PS4) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) [11:30:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:30:08] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:30:16] btullis@cumin1003 reimage (PID 3193047) is awaiting input [11:30:20] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-backup2003.codfw.wmnet [11:32:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [11:32:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1018.eqiad.wmnet with OS bookworm [11:32:25] FIRING: [4x] BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:32:40] RESOLVED: [4x] BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:33:50] (03PS1) 10Clément Goubert: rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 [11:34:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:34:08] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:34:18] (03PS1) 10Btullis: Revert "Temporarily puto dse-k8s-worker101[8-9] into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1251292 [11:34:43] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-backup1003.eqiad.wmnet [11:36:07] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-backup2003.codfw.wmnet [11:36:08] (03PS1) 10Btullis: Update HaproxyKafkaNoMessages for team-data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1251293 (https://phabricator.wikimedia.org/T419829) [11:36:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:37:08] (03CR) 10Btullis: [C:03+2] Revert "Temporarily puto dse-k8s-worker101[8-9] into insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1251292 (owner: 10Btullis) [11:37:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:10] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:24] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-backup1004.eqiad.wmnet [11:38:18] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251285 (owner: 10Vgutierrez) [11:38:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: db1258 connection went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11706061 (10jcrespo) [11:39:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:39:08] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:39:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: db1258 connection went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11706062 (10jcrespo) [11:41:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:42:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:42:08] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:42:25] FIRING: [3x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:43:08] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-backup1004.eqiad.wmnet [11:43:12] (03PS2) 10Codename Noreste: idwiki: Remove unused user groups on Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251193 (https://phabricator.wikimedia.org/T419105) [11:43:26] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-backup2004.codfw.wmnet [11:44:02] (03CR) 10Blake: "Done" [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [11:44:20] (03CR) 10Daniel Kinzler: rest-gateway: More action API ratelimit exclusion (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 (owner: 10Clément Goubert) [11:44:52] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime too cookbooks - https://phabricator.wikimedia.org/T419967#11706077 (10MoritzMuehlenhoff) [11:45:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251286 (owner: 10Cathal Mooney) [11:45:41] (03CR) 10Cathal Mooney: [C:03+2] data.yaml: remove temp. ssh key for user cmooney [puppet] - 10https://gerrit.wikimedia.org/r/1251286 (owner: 10Cathal Mooney) [11:46:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:46:08] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:46:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [11:47:23] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: do not limit pre-flight requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251289 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [11:47:25] RESOLVED: [2x] BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:48:09] (03PS5) 10Effie Mouzeli: role::mediawiki::memcached::wikifunctions: add new role [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) [11:48:13] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251059 (https://phabricator.wikimedia.org/T419831) (owner: 10Effie Mouzeli) [11:48:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:49:08] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:49:08] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:49:12] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-backup2004.codfw.wmnet [11:49:52] (03Merged) 10jenkins-bot: rest-gateway: do not limit pre-flight requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251289 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [11:50:27] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1016.eqiad.wmnet [11:51:14] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:51:19] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:51:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: db1258 connection went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11706089 (10Jclark-ctr) a:03Jclark-ctr [11:53:10] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:53:10] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:53:12] (03PS2) 10Majavah: apt: Add keyfile for debian-debug/backports [puppet] - 10https://gerrit.wikimedia.org/r/1251275 (https://phabricator.wikimedia.org/T419957) [11:53:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:53:51] (03PS1) 10Clément Goubert: api-gateway: Bump Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251295 [11:54:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 2620:0:860:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:54:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1018.eqiad.wmnet [11:54:20] (03CR) 10Daniel Kinzler: [C:03+2] api-gateway: Bump Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251295 (owner: 10Clément Goubert) [11:54:37] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Add CoPE-A-9B experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251272 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [11:54:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:54:45] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1019.eqiad.wmnet [11:55:27] (03PS2) 10Daniel Kinzler: rest gateway: make no-limit policy bypass rate limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251148 [11:55:52] (03PS2) 10Gerrit Patch Uploader: ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) [11:55:53] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [11:56:04] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Add CoPE-A-9B experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251272 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [11:56:07] (03CR) 10Daniel Kinzler: rest-gateway: exclude action API `action=cspreport` from rate limiting (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [11:56:37] (03Merged) 10jenkins-bot: api-gateway: Bump Chart.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251295 (owner: 10Clément Goubert) [11:58:17] (03Merged) 10jenkins-bot: ml-services: Add CoPE-A-9B experimental deployment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251272 (https://phabricator.wikimedia.org/T418832) (owner: 10Bartosz Wójtowicz) [11:58:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: db1258 connection went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11706097 (10Jclark-ctr) I have replaced the cable and the SFP-T. This could be a good candidate to start with swapping over to 10G, since this server already has a... [11:58:54] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-drmrs (2620:0:860:fe0a::2) - group Confed_drmrs - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_drmrs&var-bgp_neighbor=cr2-drmrs - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:59:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:59:14] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:59:36] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:59:37] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: db1258 connection went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11706098 (10Jclark-ctr) 05Open→03Resolved @MatthewVernon please open new ticket if you would like to look at upgrading to 10g on this server but it has bee... [11:59:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1018.eqiad.wmnet [12:01:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1019.eqiad.wmnet [12:01:19] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup1016.eqiad.wmnet [12:02:20] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:02:47] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:03:49] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host vrts2002.codfw.wmnet [12:06:08] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251298 (https://phabricator.wikimedia.org/T360794) [12:07:13] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:07:28] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:10:19] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2002.codfw.wmnet [12:10:24] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: reboot [12:10:49] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host aphlict2001.codfw.wmnet [12:11:41] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: allow 250k req/h for CG-NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251268 (owner: 10Daniel Kinzler) [12:13:16] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aphlict2001.codfw.wmnet [12:13:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: db1258 connection went down at 10:43Z - https://phabricator.wikimedia.org/T419958#11706142 (10jcrespo) ^ @Jclark-ctr Matthew (and Effie) were only the people on call. This should be directed to the owners of the service, the DBAs: @Ladsgroup... [12:13:54] (03Merged) 10jenkins-bot: rest gateway: allow 250k req/h for CG-NATs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251268 (owner: 10Daniel Kinzler) [12:14:20] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host doc1004.eqiad.wmnet [12:15:03] (03PS3) 10Clément Goubert: rest gateway: make no-limit policy bypass rate limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251148 (owner: 10Daniel Kinzler) [12:15:08] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:15:41] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:15:45] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2015.codfw.wmnet [12:17:11] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:17:32] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:18:09] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1017.eqiad.wmnet [12:18:17] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host doc1004.eqiad.wmnet [12:19:25] (03PS6) 10Clément Goubert: rest-gateway: exclude action API `action=cspreport` from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 [12:19:38] (03PS7) 10Clément Goubert: rest-gateway: exclude action API `action=cspreport` from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 [12:20:27] (03PS2) 10Clément Goubert: rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 [12:24:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1004.wikimedia.org [12:26:15] (03CR) 10Clément Goubert: [C:03+1] rest gateway: make no-limit policy bypass rate limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251148 (owner: 10Daniel Kinzler) [12:26:33] (03PS4) 10Clément Goubert: rest gateway: make no-limit policy bypass rate limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251148 (owner: 10Daniel Kinzler) [12:26:56] (03PS3) 10Clément Goubert: rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 [12:27:03] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup2015.codfw.wmnet [12:28:14] (03CR) 10AKhatun: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251298 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [12:28:19] (03PS8) 10Clément Goubert: rest-gateway: exclude action API `action=cspreport` from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 [12:28:23] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2016.codfw.wmnet [12:28:28] (03PS4) 10Clément Goubert: rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 [12:28:50] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: make no-limit policy bypass rate limits. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251148 (owner: 10Daniel Kinzler) [12:29:14] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup1017.eqiad.wmnet [12:29:31] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1018.eqiad.wmnet [12:30:25] ^My reboots are nominativelly failing because they don't recover al alerts- but they are succeeding on being rebooted (hosts are not fully setup yet) [12:31:10] PROBLEM - Host pki1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:31:21] (03Merged) 10jenkins-bot: rest gateway: make no-limit policy bypass rate limits. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251148 (owner: 10Daniel Kinzler) [12:31:52] !log jelto@cumin1003 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [12:32:08] FIRING: [11x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_cloud_wmnet_ca_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:33:43] (03PS9) 10Clément Goubert: rest-gateway: exclude action API `action=cspreport` from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 [12:33:52] (03PS5) 10Clément Goubert: rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 [12:34:25] (03CR) 10Daniel Kinzler: [C:03+1] rest-gateway: exclude action API `action=cspreport` from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [12:35:20] (03PS6) 10Clément Goubert: rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 [12:36:00] (03CR) 10Clément Goubert: rest-gateway: More action API ratelimit exclusion (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 (owner: 10Clément Goubert) [12:37:08] FIRING: [44x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:37:37] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: exclude action API `action=cspreport` from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [12:37:46] (03CR) 10Daniel Kinzler: [C:03+1] rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 (owner: 10Clément Goubert) [12:38:39] FIRING: JobUnavailable: Reduced availability for job cfssl in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:39:19] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup2016.codfw.wmnet [12:39:53] (03Merged) 10jenkins-bot: rest-gateway: exclude action API `action=cspreport` from rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251100 (owner: 10Clément Goubert) [12:40:27] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup1018.eqiad.wmnet [12:40:45] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 (owner: 10Clément Goubert) [12:42:35] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:42:47] (03Merged) 10jenkins-bot: rest-gateway: More action API ratelimit exclusion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251291 (owner: 10Clément Goubert) [12:43:18] PROBLEM - Host phab1005 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:26] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1019.eqiad.wmnet [12:43:36] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2017.codfw.wmnet [12:44:02] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - Status - issue on wdqs1025:9290 - https://phabricator.wikimedia.org/T419664#11706309 (10Jclark-ctr) 05Open→03Resolved replaced Failed Power supply With Dell Rma [12:44:08] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:44:43] !log rebooted phab1005 - waiting for it to come back [12:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:51] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:46:44] RECOVERY - Host phab1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [12:47:41] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:48:17] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:50:11] very weird, pki1002 is down, I cannot ssh to it, checking [12:50:32] !log powercycle pki1002 [12:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:37] elukey: see -sre [12:50:43] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:51:16] moritzm: :( [12:53:17] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:53:40] RECOVERY - Host pki1002 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [12:53:56] elukey: powercycle seems to have helped, but maybe we should actually drop it from wikikube-staging and reprovision it to UEFI? [12:54:02] I need an emergency deploy for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1251276 -- context is https://phabricator.wikimedia.org/T419950, are SRE ok with a deployment? (cc: thcipriani brennen effie, Emperor). I have someone to deploy (me). [12:54:09] the specific error from SEL is BIOS specific... [12:54:10] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup1019.eqiad.wmnet [12:54:45] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup2017.codfw.wmnet [12:54:45] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1020.eqiad.wmnet [12:54:45] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2018.codfw.wmnet [12:55:40] moritzm: it should already be UEFI, but dell calls BIOS-like option like that. Not sure if it is an indication of a actual legacy failure [12:57:04] katherine_g: o/ what is the current impact to external users? It is not super clear from the task.. The question is - would it be ok to deploy on Monday instead? If not, why? [12:57:08] RESOLVED: [44x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_aux_front_proxy_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:58:07] elukey: ah, you're totally right [12:58:08] katherine_g: elukey it would be great if we'd move the conversation eother on -sre or -serviceops [12:58:25] FIRING: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:39] RESOLVED: JobUnavailable: Reduced availability for job cfssl in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:59:49] (03CR) 10Milimetric: [C:03+1] Add stream config for attribution research (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250249 (https://phabricator.wikimedia.org/T417050) (owner: 10TChin) [13:00:36] effie: seems fine in here too for visibility, but ok for me. I think it should be just a quick sync before allowing it [13:03:33] we do tell devs to ask in -operations :D [13:04:02] otherwise same question as e.lukey katherine_g, what's the user impact? [13:04:12] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:10] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:15] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup1020.eqiad.wmnet [13:05:22] there are 2 conversations going on in parallel and bot traffic, it is not easy to keep up [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:30] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup2018.codfw.wmnet [13:05:35] but it could be just me [13:05:53] elukey: claime: just updated the ticket with screenshots, basically we went live with Personal Dashboard last night and are encouraging users to take a survey that doesn't work- this change would add the survey that's working [13:08:59] (03PS1) 10Aude: Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) [13:11:31] !log jelto@cumin1003 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [13:12:29] (03PS2) 10Aude: Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251309 (https://phabricator.wikimedia.org/T419163) [13:12:41] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2019.codfw.wmnet [13:13:13] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [13:13:39] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2020.codfw.wmnet [13:14:49] katherine_g: one follow up question - I see that the survey is out for 4 pilot wikis, are those high traffic ones? Moreover, are you expecting a lot of users to take the survey over the weekend? [13:18:52] my reading of this issue is that it doesn't impact a lot of users, and it can probably be deployed on Monday within a proper deployment window. [13:18:56] (03CR) 10Muehlenhoff: [C:03+2] Rebuild against latest package versions in Bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1251091 (owner: 10Muehlenhoff) [13:19:05] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [13:19:21] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [13:20:25] elukey: medium-sized wikis: id.wiki, tr.wiki, th.wiki, simple.wiki. We can wait until monday- thanks [13:22:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251276 (https://phabricator.wikimedia.org/T419950) (owner: 10Kgraessle) [13:23:58] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup2019.codfw.wmnet [13:24:30] (03PS2) 10Majavah: P:acme_chief: cloud: require package for config file [puppet] - 10https://gerrit.wikimedia.org/r/977044 [13:24:38] !log jynus@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host backup2020.codfw.wmnet [13:24:44] (03CR) 10Majavah: P:acme_chief: cloud: require package for config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977044 (owner: 10Majavah) [13:25:15] (03CR) 10CI reject: [V:04-1] P:acme_chief: cloud: require package for config file [puppet] - 10https://gerrit.wikimedia.org/r/977044 (owner: 10Majavah) [13:26:00] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [13:26:16] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [13:26:37] (03PS3) 10Majavah: P:acme_chief: cloud: require package for config file [puppet] - 10https://gerrit.wikimedia.org/r/977044 [13:28:01] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251298 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [13:30:10] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251298 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [13:30:45] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host gerrit2002.wikimedia.org [13:32:25] (03PS1) 10Muehlenhoff: Record LDAP access for aputhin [puppet] - 10https://gerrit.wikimedia.org/r/1251316 [13:32:28] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org [13:33:17] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host etherpad2002.codfw.wmnet [13:33:40] (03CR) 10A smart kitten: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [13:33:51] katherine_g: thanks a lot for explaining! [13:33:54] really appreciated [13:34:09] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for aputhin [puppet] - 10https://gerrit.wikimedia.org/r/1251316 (owner: 10Muehlenhoff) [13:34:11] elukey: np [13:34:59] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:35:00] (03CR) 10CI reject: [V:04-1] ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [13:35:41] (03PS1) 10Gerrit Patch Uploader: ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251318 (https://phabricator.wikimedia.org/T419312) [13:35:43] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251318 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [13:35:47] The helm release bad status will be fixed on monday :) [13:36:34] (03CR) 10A smart kitten: ptwiki: Enable block action for the abuse filter (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [13:36:43] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit2002.wikimedia.org [13:37:16] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad2002.codfw.wmnet [13:42:23] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host etherpad1004.eqiad.wmnet [13:42:46] !log aokoth@cumin1003 START - Cookbook sre.hosts.reboot-single for host lists2001.wikimedia.org [13:44:13] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host vrts1004.eqiad.wmnet [13:45:07] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11706596 (10cmooney) @papaul please tell them to keep the case low as they have not yet fixed it [13:45:46] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:45:57] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [13:46:20] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad1004.eqiad.wmnet [13:48:07] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1004.eqiad.wmnet [13:49:05] !log aokoth@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists2001.wikimedia.org [13:51:31] (03PS3) 10Gerrit Patch Uploader: ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) [13:51:32] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [13:53:05] !log arnaudb@cumin1003 START - Cookbook sre.hosts.reboot-single for host gerrit1003.wikimedia.org [13:55:46] (03CR) 10A smart kitten: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [13:56:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 3 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11706659 (10ABran-WMF) [13:58:25] RESOLVED: [23x] SystemdUnitFailed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service on pki1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:18] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit1003.wikimedia.org [14:01:20] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1003.eqiad.wmnet [14:03:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11706688 (10Jclark-ctr) [14:04:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11706714 (10Jclark-ctr) [14:04:39] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11706718 (10Jclark-ctr) [14:04:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11706719 (10Jclark-ctr) [14:06:21] (03Abandoned) 10Majavah: ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251318 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [14:09:40] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1003.eqiad.wmnet [14:13:15] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudbackup1004.eqiad.wmnet [14:14:17] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudbackup2003.codfw.wmnet [14:19:44] (03PS2) 10Btullis: Update HaproxyKafkaNoMessages for team-data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1251293 (https://phabricator.wikimedia.org/T419829) [14:20:14] (03PS3) 10Btullis: Update HaproxyKafkaNoMessages for team-data-engineering [alerts] - 10https://gerrit.wikimedia.org/r/1251293 (https://phabricator.wikimedia.org/T419829) [14:22:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [14:22:16] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1004.eqiad.wmnet [14:22:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [14:23:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [14:24:49] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:25:18] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich-next: apply [14:25:29] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [14:25:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup2003.codfw.wmnet [14:27:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [14:27:12] (03PS3) 10Brouberol: kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) [14:27:28] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudbackup2004.codfw.wmnet [14:28:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [14:28:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [14:28:33] (03CR) 10Brouberol: kafka-mirrormaker: allow multiple releases to be installed in the same namespace (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [14:29:03] (03PS4) 10Brouberol: kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) [14:29:35] (03PS5) 10Brouberol: kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) [14:29:37] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt - jclark@cumin1003" [14:29:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt - jclark@cumin1003" [14:29:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:29:59] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:30:17] (03PS6) 10Brouberol: kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) [14:31:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:31:35] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1373.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:31:37] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1033.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:31:40] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1251285 (owner: 10Vgutierrez) [14:31:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [14:32:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1034.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:32:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [14:32:26] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1035.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:32:35] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wdqs1035.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:33:25] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet [14:33:33] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1035.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:33:39] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:33:52] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:35:46] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Rebooting clouddb1013 T419960 [14:35:55] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1020 [14:36:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [14:36:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [14:37:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1020 [14:37:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11706985 (10Jclark-ctr) @elukey i am unable to provision these it is a brand new Supermicro Model. keeps failing [14:37:21] (03CR) 10Anzx: [C:03+1] ptwiki: Enable block action for the abuse filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [14:37:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11706986 (10Jclark-ctr) [14:38:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11706992 (10Jclark-ctr) [14:38:46] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.codfw.wmnet [14:39:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11706995 (10Jclark-ctr) [14:39:43] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1021 [14:39:51] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup2004.codfw.wmnet [14:40:02] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1021 [14:40:13] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1022 [14:40:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1022 [14:40:26] !log jclark@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1023 [14:40:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1023 [14:40:49] (03PS2) 10Herron: slothslos: add inject-labels plugin to the Sloth command line [puppet] - 10https://gerrit.wikimedia.org/r/1251279 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [14:42:38] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1033.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:43:10] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1034.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:43:14] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1015.eqiad.wmnet [14:44:14] (03CR) 10Herron: [C:03+2] slothslos: add inject-labels plugin to the Sloth command line [puppet] - 10https://gerrit.wikimedia.org/r/1251279 (https://phabricator.wikimedia.org/T414579) (owner: 10Tiziano Fogli) [14:44:23] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1373.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:45:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [14:45:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [14:46:21] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet [14:46:22] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1035.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:48:39] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1015.eqiad.wmnet [14:50:36] (03PS1) 10Daniel Kinzler: rest-gateway: add IPs to list of CGNAT addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251374 [14:52:22] (03PS5) 10Elukey: sre.hosts.provision: allow no-pxe settings for NIC on Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) [14:52:57] (03CR) 10Elukey: sre.hosts.provision: allow no-pxe settings for NIC on Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [14:55:57] (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [14:57:10] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s1 [14:57:16] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3 [14:57:51] (03CR) 10Kamila Součková: [C:03+1] ratelimit-media: Initial service deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1226814 (https://phabricator.wikimedia.org/T414439) (owner: 10Clément Goubert) [14:58:48] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1013.eqiad.wmnet [14:58:49] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1013.eqiad.wmnet [14:59:03] (03PS2) 10Trueg: wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) [15:00:35] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [15:00:47] (03CR) 10CI reject: [V:04-1] wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [15:02:51] (03PS3) 10Trueg: wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) [15:03:14] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [15:04:37] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: add IPs to list of CGNAT addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251374 (owner: 10Daniel Kinzler) [15:07:41] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: add IPs to list of CGNAT addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251374 (owner: 10Daniel Kinzler) [15:07:59] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2006-dev.codfw.wmnet [15:08:20] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudidp2001-dev.codfw.wmnet [15:12:23] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudidp2001-dev.codfw.wmnet [15:14:55] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: add IPs to list of CGNAT addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251374 (owner: 10Daniel Kinzler) [15:16:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet [15:16:48] (03PS1) 10Elukey: sre.hosts.provision: Allow more optional BIOS values for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) [15:17:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:17:33] (03Merged) 10jenkins-bot: rest-gateway: add IPs to list of CGNAT addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251374 (owner: 10Daniel Kinzler) [15:18:59] (03CR) 10JMeybohm: [C:03+1] kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:19:02] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:19:04] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:19:50] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:21:44] (03CR) 10Elukey: [C:03+1] kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:22:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:22:11] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:22:36] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:22:46] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:22:54] elukey@cumin1003 provision (PID 3230066) is awaiting input [15:23:12] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:23:30] (03PS1) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:24:59] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:25:15] (03PS2) 10Elukey: sre.hosts.provision: Allow more optional BIOS values for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) [15:25:23] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:25:59] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:26:08] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1020.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:26:21] (03PS2) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:27:23] (03CR) 10Elukey: "Needs more work, SOL_COM2ConsoleRedirection is also failing for some reason, these hosts are very new." [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [15:27:25] FIRING: [3x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:27:37] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: allow multiple releases to be installed in the same namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250631 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [15:27:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:27:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:27:46] (03PS3) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:28:15] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.codfw.wmnet [15:28:26] (03PS4) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:28:50] (03PS5) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:31:39] (03CR) 10Gmodena: wikidata-platform: wdqs-queryhammer chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [15:32:09] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738#11707313 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:32:19] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudgw2002-dev - https://phabricator.wikimedia.org/T419738#11707318 (10Jhancock.wm) that did it. ty [15:33:27] (03PS6) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:34:38] (03PS7) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:34:55] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [15:35:33] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [15:35:33] !log vgutierrez@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [15:35:58] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2005-dev.codfw.wmnet [15:36:08] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2010-dev.codfw.wmnet [15:36:16] (03PS8) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:36:27] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [15:36:27] !log vgutierrez@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [15:36:58] (03PS9) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:37:16] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [15:37:16] !log vgutierrez@cumin1003 END (FAIL) - Cookbook sre.loadbalancer.check-ipip (exit_code=99) [15:37:39] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:38:35] (03PS10) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:38:55] !log vgutierrez@cumin1003 START - Cookbook sre.loadbalancer.check-ipip [15:38:56] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.loadbalancer.check-ipip (exit_code=0) [15:39:27] (03PS4) 10Trueg: wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) [15:40:42] (03CR) 10Vgutierrez: "tested with test-cookbook using the following args `--dc ulsfo --query "P{ncredir4004.ulsfo.wmnet}" ncredir ncredir-https`:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [15:42:22] (03PS11) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:43:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet [15:43:49] (03PS3) 10Jelto: profile::reboot::unattended: add class to mark hosts for unattended reboots [puppet] - 10https://gerrit.wikimedia.org/r/1251406 [15:43:49] (03CR) 10Jelto: [V:03+1] "I would like to reboot all of the passive and insetup hosts with a single cookbook run. Do you think this approach makes sense?" [puppet] - 10https://gerrit.wikimedia.org/r/1251406 (owner: 10Jelto) [15:46:04] (03PS1) 10Trueg: wikidata-platform: wdqs-queryhammer helmfile deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) [15:46:16] (03CR) 10JHathaway: [C:03+1] "I like this better, one suggestion." [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [15:47:29] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:48:01] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:48:55] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [15:50:40] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 195.200.68.151 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:52:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:52:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-magru and Telxius (2001:1498:1:966:1::251) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:54:16] (03PS12) 10Vgutierrez: sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) [15:54:51] PROBLEM - ensure kvm processes are running on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:55:12] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "should be okay to deploy (see also [#wikimedia-tech logs today](https://wm-bot.wmcloud.org/browser/index.php?start=03%2F13%2F2026&end=03%2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [15:55:51] RECOVERY - ensure kvm processes are running on cloudvirt1076 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:00:04] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS trixie [16:00:04] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:00:34] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [16:01:43] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: Provide check-ipip cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1251442 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [16:04:21] (03PS6) 10Elukey: sre.hosts.provision: allow no-pxe settings for NIC on Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) [16:04:21] (03PS3) 10Elukey: sre.hosts.provision: Allow more optional BIOS values for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) [16:04:34] (03CR) 10Elukey: sre.hosts.provision: allow no-pxe settings for NIC on Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [16:05:39] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251456 (https://phabricator.wikimedia.org/T408918) [16:07:39] (03CR) 10JHathaway: [C:03+1] sre.hosts.provision: allow no-pxe settings for NIC on Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [16:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:27] (03CR) 10AKhatun: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251456 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [16:10:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [16:11:06] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251456 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [16:13:24] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251456 (https://phabricator.wikimedia.org/T408918) (owner: 10JavierMonton) [16:14:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet [16:15:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor-dev2001.codfw.wmnet [16:16:08] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb1004.wikimedia.org [16:18:05] (03CR) 10Lerickson: wikidata-platform: wdqs-queryhammer chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [16:18:46] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4048.ulsfo.wmnet with OS trixie [16:19:10] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4048.ulsfo.wmnet with OS trixie [16:19:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor-dev2001.codfw.wmnet [16:20:05] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet [16:20:50] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Rebooting clouddb1014 T419960 [16:21:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1004.wikimedia.org [16:22:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudweb1003.wikimedia.org [16:23:22] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11707567 (10herron) [16:25:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet [16:27:35] Note: I'm going to be making some mw-experimental changes. [16:28:03] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1003.wikimedia.org [16:29:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet [16:29:38] (03CR) 10Trueg: wikidata-platform: wdqs-queryhammer chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [16:30:39] (03PS5) 10Trueg: wikidata-platform: wdqs-queryhammer chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) [16:33:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:40] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet [16:34:50] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1014.eqiad.wmnet [16:34:51] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1014.eqiad.wmnet [16:35:19] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Rebooting clouddb1015 T419960 [16:36:30] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet [16:39:49] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage [16:40:18] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS trixie [16:40:34] (03PS1) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) [16:42:27] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: allow no-pxe settings for NIC on Supermicro (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1249973 (https://phabricator.wikimedia.org/T400626) (owner: 10Elukey) [16:44:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4048.ulsfo.wmnet with reason: host reimage [16:50:23] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet [16:50:31] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1015.eqiad.wmnet [16:50:32] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1015.eqiad.wmnet [16:51:32] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Rebooting clouddb1016 T419960 [16:51:46] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet [16:54:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251200 (https://phabricator.wikimedia.org/T419312) (owner: 10Gerrit Patch Uploader) [16:54:52] (03PS4) 10Elukey: sre.hosts.provision: Allow more optional BIOS values for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) [16:55:57] (03CR) 10Elukey: "First change, provision still doesn't work on new dse-k8s hosts because more BIOS keys need to be reviewed, but this change seems self-con" [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [16:56:35] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4049.ulsfo.wmnet with OS trixie [16:57:00] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4049.ulsfo.wmnet with OS trixie [17:00:34] stashbot was out for a bit, some messages by fnegri and brett didn’t get logged as a result [17:00:35] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [17:00:52] (03PS1) 10AKhatun: stream: deploy edit-type stream to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251480 (https://phabricator.wikimedia.org/T351225) [17:01:50] dhinus: ^ [17:01:58] (had to look up the IRC name first, sorry ^^) [17:02:44] (03PS1) 10Muehlenhoff: thumbor: Update service image to latest rebuild [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251482 [17:03:13] Lucas_WMDE: thanks, I'm just doing some boring reboots but I'll re-log those anyway [17:03:18] ok, thanks! [17:04:31] (03PS3) 10Clément Goubert: rest-gateway: Log all 429 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251479 [17:04:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1251275 (https://phabricator.wikimedia.org/T419957) (owner: 10Majavah) [17:04:54] (03PS4) 10Clément Goubert: rest-gateway: Log 20% of 429 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251479 [17:06:22] (03CR) 10BCornwall: [C:03+1] wmnet: add sophroid svc IPs [dns] - 10https://gerrit.wikimedia.org/r/1248617 (https://phabricator.wikimedia.org/T418748) (owner: 10Jasmine) [17:06:58] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [17:07:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4048.ulsfo.wmnet with OS trixie [17:07:34] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [17:07:35] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet [17:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:45] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [17:08:29] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [17:08:47] !log (relogging failed sal) START - Cookbook sre.hosts.remove-downtime for clouddb1015.eqiad.wmnet [17:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:28] !log (relogging failed sal) END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1015.eqiad.wmnet [17:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:26] !log (relogging failed sal) DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1016.eqiad.wmnet with reason: Rebooting clouddb1016 T419960 [17:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:42] !log (relogging failed sal) conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet [17:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:03] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4048.* [17:11:52] (03CR) 10Kamila Součková: [C:03+1] rest-gateway: Log 20% of 429 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251479 (owner: 10Clément Goubert) [17:11:57] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1016.eqiad.wmnet [17:12:05] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1016.eqiad.wmnet [17:12:06] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1016.eqiad.wmnet [17:12:23] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Log 20% of 429 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251479 (owner: 10Clément Goubert) [17:15:27] (03Merged) 10jenkins-bot: rest-gateway: Log 20% of 429 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251479 (owner: 10Clément Goubert) [17:16:48] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage [17:16:51] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [17:17:02] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [17:17:22] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS trixie [17:20:10] (03PS2) 10Jforrester: Replace direct BagOStuff with WANObjectCache [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251487 (https://phabricator.wikimedia.org/T419666) [17:20:32] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4049.ulsfo.wmnet with reason: host reimage [17:20:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251487 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [17:24:13] gerrit seems to be having Problems [17:24:24] ^ [17:24:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:25:48] (03CR) 10CI reject: [V:04-1] Replace direct BagOStuff with WANObjectCache [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251487 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [17:26:22] (03PS2) 10Muehlenhoff: thumbor-plugins: Stop using pkg_resources [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1243135 [17:26:44] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [17:26:50] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [17:26:58] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [17:27:02] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [17:27:07] All done at our end. [17:29:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:00] (03CR) 10Aaron Schulz: "*no need" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251154 (https://phabricator.wikimedia.org/T419053) (owner: 10Aaron Schulz) [17:33:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:34:33] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [17:34:59] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:35:07] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [17:35:47] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4050.ulsfo.wmnet with OS trixie [17:36:17] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS trixie [17:36:40] (03CR) 10Jforrester: "recheck" [extensions/WikiLambda] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1251487 (https://phabricator.wikimedia.org/T419666) (owner: 10Jforrester) [17:37:43] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [17:37:57] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [17:38:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:41:59] (03CR) 10BCornwall: [C:03+2] cache::haproxy: Ensure that lua files get deployed before starting [puppet] - 10https://gerrit.wikimedia.org/r/1251285 (owner: 10Vgutierrez) [17:42:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:46:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4049.ulsfo.wmnet with OS trixie [17:47:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:48:58] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11708090 (10Jhancock.wm) it's definitely having some issues. -power cables reseated with no results. -replaced the BP1 and that error went... [17:49:18] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4049.* [17:56:42] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [17:58:11] PROBLEM - Host db1253 #page is DOWN: PING CRITICAL - Packet loss = 100% [17:58:44] !ack [17:58:45] Could not ack the alert. Please check the parameters. [17:58:57] !ack [17:58:58] Could not ack the alert. Please check the parameters. [17:59:01] !incidents [17:59:01] 7752 (UNACKED) Host db1253 (paged) [17:59:01] 7750 (RESOLVED) This is a test (please ignore) [17:59:01] 7749 (RESOLVED) This is a test (please ignore) [17:59:02] 7748 (RESOLVED) This is a test incident (please ignore) [17:59:02] 7747 (RESOLVED) Host db1258 (paged) [17:59:04] !ack 7752 [17:59:05] 7752 (ACKED) Host db1253 (paged) [17:59:22] here [17:59:37] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [17:59:41] That's a replica [17:59:55] depool? [18:00:19] <_joe_> isn't that the server that was already depooled this morning? [18:00:36] +1, I can check why it went down over serial [18:00:38] No it's another one [18:00:44] claime: are you depooling? [18:00:44] 1253 != 1258 [18:00:47] elukey: yes [18:01:37] raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results()) [18:01:39] awesome [18:02:29] so getsel doesn't have anything for today, the host is down though [18:02:34] going to powercycle it [18:03:22] !log powercycle db1253 - host not reachable via ssh, no events logged in racadm getsel, no console com2 available (blank screen) [18:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:42] (03PS1) 10Btullis: Disable the x1 section on an-redacteddb1001 until we can populate it [puppet] - 10https://gerrit.wikimedia.org/r/1251494 (https://phabricator.wikimedia.org/T407485) [18:04:06] Should I depool it manually with dbctl? [18:04:37] what did you use ? [18:04:44] https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#Depooling_a_replica [18:04:49] sre.mysql.depool [18:05:09] ah okok, let's use dbctl [18:05:12] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4051.ulsfo.wmnet with OS trixie [18:05:23] because the cookbook assumes the host is up (I think) [18:05:34] Just this host? [18:05:51] https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#Manual_depooling < is confusing [18:05:52] I'd say https://wikitech.wikimedia.org/wiki/Dbctl#Completely_depool_a_host = yes [18:06:08] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [18:06:28] RECOVERY - Host db1253 #page is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [18:06:28] PROBLEM - MariaDB read only s7 on db1253 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [18:06:37] PROBLEM - mysqld processes on db1253 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:06:39] PROBLEM - MariaDB Event Scheduler s7 on db1253 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [18:06:41] !log cgoubert@cumin1003 dbctl commit (dc=all): 'Depool db1253', diff saved to https://phabricator.wikimedia.org/P89856 and previous config saved to /var/cache/conftool/dbconfig/20260313-180640-cgoubert.json [18:06:53] it's depooled [18:06:53] perfect, the host is up now [18:07:01] PROBLEM - MariaDB Events s7 on db1253 is CRITICAL: CRITICAL - Failed to query events: ERROR 2002 (HY000): Cant connect to local server through socket /run/mysqld/mysqld.sock (2) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [18:07:02] PROBLEM - MariaDB Replica SQL: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:07:03] PROBLEM - MariaDB Replica IO: s7 #page on db1253 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:07:04] PROBLEM - MariaDB Replica Lag: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:07:12] !ack [18:07:12] Could not ack the alert. Please check the parameters. [18:07:15] !incidents [18:07:16] 7754 (UNACKED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [18:07:16] 7755 (UNACKED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [18:07:16] 7756 (UNACKED) db1253 (paged)/MariaDB Replica Lag: s7 (paged) [18:07:16] 7752 (RESOLVED) Host db1253 (paged) [18:07:16] 7750 (RESOLVED) This is a test (please ignore) [18:07:17] 7749 (RESOLVED) This is a test (please ignore) [18:07:17] 7748 (RESOLVED) This is a test incident (please ignore) [18:07:17] 7747 (RESOLVED) Host db1258 (paged) [18:07:43] !ack 7754 7755 7756 [18:07:43] Could not ack the alert. Please check the parameters. [18:07:45] !ack 7754 [18:07:45] 7754 (ACKED) db1253 (paged)/MariaDB Replica SQL: s7 (paged) [18:07:45] !ack 7755 [18:07:45] 7755 (ACKED) db1253 (paged)/MariaDB Replica IO: s7 (paged) [18:07:46] !ack 7756 [18:07:46] 7756 (ACKED) db1253 (paged)/MariaDB Replica Lag: s7 (paged) [18:07:49] Maybe downtime it now x) [18:08:20] yep :) I think we need to do it via puppet no? I am not sure if the regular cookbook is enough [18:08:20] lemme check [18:08:24] Creating DBA task [18:09:37] https://phabricator.wikimedia.org/T420041 [18:10:01] !log elukey@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1253.eqiad.wmnet with reason: Host went down and paged, depooled [18:10:02] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [18:10:04] I am executing the cookbook [18:10:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [18:10:07] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:10:15] I don't recall if anything needs to be done on puppet [18:10:42] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1374.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:12:09] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11708155 (10Papaul) @cmooney yes can do [18:12:40] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1373.eqiad.wmnet with OS bookworm [18:12:51] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11708161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1373.eqiad.wmnet with OS bookworm [18:13:01] claime: seems fine now, wdyt? [18:13:25] elukey: yeah seems fine [18:15:10] (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [18:15:50] (03PS5) 10Eevans: services: add linked-artifacts service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) [18:17:13] I am very ignorant about how to start mariadb on it, it is probably /opt/wmf-mariadb1011/bin/ and then restart the replica [18:19:02] no ok mariadb.service is there now [18:20:09] shttps://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host#If_the_server_or_the_instance_crashed [18:20:20] going to check in a bit, bbiab [18:21:15] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4051.ulsfo.wmnet with OS trixie [18:21:36] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4051.ulsfo.wmnet with OS trixie [18:21:43] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1374.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [18:22:17] (03CR) 10JHathaway: "looks good, one question" [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [18:22:33] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1374.eqiad.wmnet with OS bookworm [18:22:38] (03CR) 10Btullis: wikidata-platform: wdqs-queryhammer chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251095 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [18:22:39] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11708171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1374.eqiad.wmnet with OS bookworm [18:24:09] (03CR) 10Btullis: "I noticed that you didn't specify the `.Values.mm.source` and `.Values.mm.target` and you can see the effect of it in the configmap render" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251453 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [18:24:16] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4050.ulsfo.wmnet with OS trixie [18:24:21] !log brett@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp4050.ulsfo.wmnet [18:30:49] (03CR) 10Elukey: sre.hosts.provision: Allow more optional BIOS values for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [18:34:11] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS trixie [18:34:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:34:59] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:34:59] (03CR) 10JHathaway: sre.hosts.provision: Allow more optional BIOS values for Supermicro (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1251424 (https://phabricator.wikimedia.org/T414216) (owner: 10Elukey) [18:35:12] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp4050.ulsfo.wmnet with reason: firmware updates [18:36:38] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage [18:39:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:40:06] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1033.eqiad.wmnet with OS trixie [18:40:07] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1034.eqiad.wmnet with OS trixie [18:40:17] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1035.eqiad.wmnet with OS trixie [18:41:39] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage [18:43:15] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1374.eqiad.wmnet with reason: host reimage [18:47:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage [18:48:31] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11708220 (10herron) [18:48:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11708221 (10Jclark-ctr) wikikube-worker1373 is having issues http booting Will be on hold till nokia switch gets updated in C4 [18:49:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11708223 (10Jclark-ctr) wdqs1033 is having issues http booting Will have to hold on that one till nokia switch in C4 gets updated [18:53:34] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:54:34] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:54:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11708236 (10Jclark-ctr) [18:55:08] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1034.eqiad.wmnet with reason: host reimage [18:55:21] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1035.eqiad.wmnet with reason: host reimage [18:57:51] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS trixie [18:58:08] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS trixie [18:58:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1034.eqiad.wmnet with reason: host reimage [19:00:26] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:00:41] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:00:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1374.eqiad.wmnet with OS bookworm [19:00:50] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11708267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1374.eqiad.wmnet with OS bookworm completed: - wikikube-worker1374 (... [19:01:39] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11708271 (10Jclark-ctr) [19:02:54] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1035.eqiad.wmnet with reason: host reimage [19:03:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11708274 (10Jclark-ctr) Accidentally put Procurement ticket for cookbook so ops-monitoring-bot posted in on Procurement [19:04:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11708278 (10Jclark-ctr) [19:07:01] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4050.ulsfo.wmnet [19:07:07] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11708313 (10herron) [19:11:17] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4051.ulsfo.wmnet with OS trixie [19:13:32] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:14:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4050.ulsfo.wmnet [19:15:57] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4050.ulsfo.wmnet [19:16:14] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4051.* [19:16:37] jclark@cumin1003 reimage (PID 3253296) is awaiting input [19:18:37] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:18:57] !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:18:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:18:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1034.eqiad.wmnet with OS trixie [19:19:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1035.eqiad.wmnet with OS trixie [19:23:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4050.ulsfo.wmnet [19:24:02] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [19:29:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [19:40:21] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [19:40:58] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4050.* [19:43:37] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:44:10] FIRING: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:44:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:46:35] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:46:44] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [19:46:49] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet [19:49:10] RESOLVED: [2x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:49:35] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:49:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:50:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:52:35] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:53:10] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet [19:54:22] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS trixie [19:54:42] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [19:54:54] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:55:26] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4052.* [19:57:35] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:59:25] FIRING: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:59:54] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:00:35] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:01:04] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [20:01:08] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2004-dev.codfw.wmnet [20:01:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:03:35] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:04:25] RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:04:54] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:05:40] FIRING: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:06:09] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:07:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2004-dev.codfw.wmnet [20:07:35] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:09:25] RESOLVED: [4x] BFDdown: BFD session down between cloudsw1-b1-codfw and 172.20.5.9 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cloudsw1-b1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:09:54] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:13:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053 (10AWesterinen) 03NEW [20:17:03] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11708565 (10AWesterinen) [20:20:23] (03PS1) 10Majavah: P:kafka::broker::monitoring: Fix legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1251539 (https://phabricator.wikimedia.org/T420034) [20:22:09] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8273/console" [puppet] - 10https://gerrit.wikimedia.org/r/1251539 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah) [20:25:10] (03PS1) 10Majavah: confluent: kafka::broker: Fix legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1251540 (https://phabricator.wikimedia.org/T420034) [20:29:55] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8274/console" [puppet] - 10https://gerrit.wikimedia.org/r/1251540 (https://phabricator.wikimedia.org/T420034) (owner: 10Majavah) [21:14:42] (03PS1) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) [21:15:24] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251552 [21:17:14] (03CR) 10SBassett: [C:04-1] "Hold for Monday 2026-03-16 mid-day infra deployment window. Will need an on-call SRE." [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [21:34:59] FIRING: [2x] HelmReleaseBadStatus: Helm release kserve/kserve on k8s-mlserve@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:34:30] (03CR) 10SomeRandomDeveloper: [C:04-1] Allow-list some additional domains to the currently enforcing CSP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1251550 (https://phabricator.wikimedia.org/T419502) (owner: 10SBassett) [22:34:59] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:52:50] !log taavi@deploy2002 ~ $ mwscript CentralAuth:attachAccount.php --wiki=metawiki --userlist backfiller.txt # unify unified Special:CentralAuth/MediaWikiAccountBackfiller on meta [22:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:54] (03PS1) 10Dreamy Jazz: Uninstall AbuseFilter from closed wikis with no AbuseFilter logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251582 (https://phabricator.wikimedia.org/T420052) [23:25:09] (03PS2) 10Dreamy Jazz: Uninstall AbuseFilter from closed wikis with no AbuseFilter logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251582 (https://phabricator.wikimedia.org/T420063) [23:37:38] (03PS1) 10Dreamy Jazz: Uninstall GlobalBlocking from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251589 (https://phabricator.wikimedia.org/T420062) [23:38:30] (03CR) 10CI reject: [V:04-1] Uninstall GlobalBlocking from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251589 (https://phabricator.wikimedia.org/T420062) (owner: 10Dreamy Jazz) [23:41:22] (03PS2) 10Dreamy Jazz: Uninstall GlobalBlocking from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251589 (https://phabricator.wikimedia.org/T420062) [23:41:54] (03PS3) 10Dreamy Jazz: Uninstall GlobalBlocking from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251589 (https://phabricator.wikimedia.org/T420062) [23:43:47] (03PS4) 10Dreamy Jazz: Uninstall GlobalBlocking from closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251589 (https://phabricator.wikimedia.org/T420062)