[00:05:16] PROBLEM - Host wikikube-worker1036 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 2193.41 ms [00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:34] RECOVERY - Host wikikube-worker1036 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [00:10:12] FIRING: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:20:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1254392 [00:39:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1254392 (owner: 10TrainBranchBot) [00:40:09] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11721374 (10Ladsgroup) >>! In T414805#11675775, @Ladsgroup wrote: >>>! In T414805#11668230, @Ladsgroup wrote: >> Top "file formats" f... [00:53:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1254392 (owner: 10TrainBranchBot) [01:00:47] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:09:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1254418 [01:09:42] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1254418 (owner: 10TrainBranchBot) [01:10:16] !log denisse@deploy2002 Started deploy [librenms/librenms@d152b36]: Upgrade LibreNMS to 25.11.0 [01:10:24] !log denisse@deploy2002 Finished deploy [librenms/librenms@d152b36]: Upgrade LibreNMS to 25.11.0 (duration: 00m 08s) [01:25:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1254418 (owner: 10TrainBranchBot) [01:38:12] !log denisse@deploy2002 Started deploy [librenms/librenms@9bdfb73]: Upgrade LibreNMS to 26.3.1 [01:38:31] !log denisse@deploy2002 Finished deploy [librenms/librenms@9bdfb73]: Upgrade LibreNMS to 26.3.1 (duration: 00m 19s) [01:50:34] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254445 (https://phabricator.wikimedia.org/T420424) [01:52:54] (03PS1) 10DDesouza: miscweb(design-blog): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254446 (https://phabricator.wikimedia.org/T344471) [01:55:09] (03PS1) 10DDesouza: Undeploy participant recruitment survey on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) [01:55:17] (03CR) 10CI reject: [V:04-1] Undeploy participant recruitment survey on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza) [01:55:56] (03PS1) 10DDesouza: Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) [01:56:01] (03CR) 10DDesouza: [C:03+2] miscweb(design-blog): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254446 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [01:56:05] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254445 (https://phabricator.wikimedia.org/T420424) (owner: 10DDesouza) [01:57:38] (03PS1) 10DDesouza: Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778) [01:58:25] (03PS2) 10DDesouza: Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275) [01:58:34] (03Merged) 10jenkins-bot: miscweb(design-blog): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254446 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [01:58:36] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254445 (https://phabricator.wikimedia.org/T420424) (owner: 10DDesouza) [02:00:48] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:03:36] (03PS2) 10DDesouza: Undeploy participant recruitment survey on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) [02:04:38] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [02:05:01] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [02:05:02] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [02:05:31] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [02:05:33] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [02:06:07] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [02:07:01] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [02:07:14] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [02:07:15] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [02:07:30] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [02:07:32] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [02:07:52] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [02:08:35] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 07m 47s) [02:08:40] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:40] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:04] (03PS3) 10RLazarus: function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) [02:53:07] (03PS1) 10Krinkle: labs: Remove redundant wgSkipSkins override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254463 [03:01:09] (03PS1) 10MusikAnimal: CM5: add more aggressive warnings about CM5 deprecation [extensions/CodeMirror] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254468 (https://phabricator.wikimedia.org/T373720) [03:06:54] (03CR) 10Bhsd: [C:03+1] CM5: add more aggressive warnings about CM5 deprecation [extensions/CodeMirror] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254468 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [03:07:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeMirror] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254468 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [03:08:57] (03Merged) 10jenkins-bot: CM5: add more aggressive warnings about CM5 deprecation [extensions/CodeMirror] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254468 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal) [03:09:46] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1254468|CM5: add more aggressive warnings about CM5 deprecation (T373720)]] [03:09:50] T373720: Deprecate use of CodeMirror 5 - https://phabricator.wikimedia.org/T373720 [03:11:49] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1254468|CM5: add more aggressive warnings about CM5 deprecation (T373720)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [03:18:12] !log musikanimal@deploy2002 musikanimal: Continuing with sync [03:22:09] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254468|CM5: add more aggressive warnings about CM5 deprecation (T373720)]] (duration: 12m 22s) [03:22:12] T373720: Deprecate use of CodeMirror 5 - https://phabricator.wikimedia.org/T373720 [04:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:12] FIRING: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:13:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:47] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T0600) [06:37:40] Deploying MinT/machinetranslation. Let's see how it goes! [06:38:20] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:54:41] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:59:30] (03CR) 10Arnaudb: [C:03+2] gerrit: add a ttl on ProxyPass to jetty [puppet] - 10https://gerrit.wikimedia.org/r/1254128 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [07:00:04] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:18] (03PS1) 10Arnaudb: trafficserver: Enable connection re-use for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1254746 (https://phabricator.wikimedia.org/T420189) [07:04:57] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:05:37] (03CR) 10Arnaudb: [C:03+2] trafficserver: Enable connection re-use for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1254746 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [07:06:13] (03PS1) 10KartikMistry: machinetranslation: reduce replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254751 (https://phabricator.wikimedia.org/T411058) [07:06:13] (03PS1) 10Arnaudb: Revert "trafficserver: Enable connection re-use for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1254750 [07:06:30] (03CR) 10Arnaudb: [V:03+2] Revert "trafficserver: Enable connection re-use for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1254750 (owner: 10Arnaudb) [07:07:54] (03CR) 10Arnaudb: [V:03+2 C:03+2] Revert "trafficserver: Enable connection re-use for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1254750 (owner: 10Arnaudb) [07:10:33] (03CR) 10KartikMistry: [C:03+2] machinetranslation: reduce replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254751 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry) [07:10:37] (03CR) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [07:12:40] (03Merged) 10jenkins-bot: machinetranslation: reduce replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254751 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry) [07:16:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:16:36] (03CR) 10Ayounsi: [C:03+2] "cool, thanks. Not strictly needed but best for the sake of completeness." [homer/public] - 10https://gerrit.wikimedia.org/r/1254293 (https://phabricator.wikimedia.org/T420361) (owner: 10Ssingh) [07:16:50] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [07:17:13] Another attempt ^ [07:18:43] (03Merged) 10jenkins-bot: definitions/static.net: add IPv6 addresses for nameservers [homer/public] - 10https://gerrit.wikimedia.org/r/1254293 (https://phabricator.wikimedia.org/T420361) (owner: 10Ssingh) [07:21:48] (03PS9) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) [07:21:57] (03CR) 10CI reject: [V:04-1] rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [07:22:00] (03CR) 10Daniel Kinzler: rest-gateway: per-route jwt overrides (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [07:22:22] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:25:45] (03PS14) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) [07:26:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:27:23] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11721743 (10ayounsi) [07:29:31] 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 4 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11721747 (10KartikMistry) @RLazarus After reducing `replicas`, I was able to deploy MinT in codfw. How to delete failing older pods... [07:29:37] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1254237 (https://phabricator.wikimedia.org/T419878) (owner: 10Ayounsi) [07:30:28] (03CR) 10Ayounsi: [C:03+2] Add benbuchenau to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1254237 (https://phabricator.wikimedia.org/T419878) (owner: 10Ayounsi) [07:31:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11721754 (10ayounsi) 05Open→03Resolved Change merged, should be live in ~30min. Please re-open if any issue. [07:35:37] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 32934 [07:41:32] ayounsi@cumin1003 peering (PID 4043258) is awaiting input [07:42:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster [07:45:30] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934 [07:45:48] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11721774 (10JMeybohm) >>! In T419967#11720994, @Ajuanca wrote: > What's task `T419960` about? I don't enough privilegies to access it. Yes, I think a parameter with explici... [07:49:59] (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus) [07:52:05] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11721782 (10Wellverywell) Oh, ok, thank you! So is this a bug in file descriptions on Commons? (Well, for another image, 480px produces an actual 480px image -- so what is the bug in... [07:58:40] 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11721786 (10JMeybohm) >>! In T411058#11721747, @KartikMistry wrote: > @RLazarus After reducing `replicas`, I was able to deploy MinT... [08:00:05] andre and brennen: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T0800) [08:02:26] PROBLEM - Host cloudgw1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:02:43] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [08:03:36] the cloud host down alerts are expected, part of T417393 [08:04:24] PROBLEM - Host wikikube-worker1157 is DOWN: PING CRITICAL - Packet loss = 100% [08:04:48] RECOVERY - Host wikikube-worker1157 is UP: PING WARNING - Packet loss = 0%, RTA = 629.26 ms [08:05:31] (03PS10) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) [08:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:54] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [08:08:55] 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11721800 (10KartikMistry) >>! In T411058#11721786, @JMeybohm wrote: >>>! In T411058#11721747, @KartikMistry wrote: >> @RLazarus Afte... [08:11:33] !log codfw/eqiad: Deployed MinT (T411058) [08:13:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:14:52] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:15:00] PROBLEM - Host cloudlb1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:15:51] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254820 (https://phabricator.wikimedia.org/T413811) [08:15:54] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254820 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [08:16:39] FIRING: CoreBGPDown: ... [08:16:39] Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (2a02:ec80:a000:201::2) - group cloud_host6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-c8-eqiad:9804&var-bgp_group=cloud_host6&var-bgp_neighbor=cloudlb1001 - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:17:02] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254820 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot) [08:21:12] andre: you are running the train arent you? :) [08:21:26] hashar: yes, sorry, should have communicated [08:21:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:21:39] no no it was to be expected, I am merely triple checking! [08:21:54] I'll restart the CI Jenkins for a plugin update once you are down and everything is stable [08:22:04] :) [08:22:22] yay [08:22:54] !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.20 refs T413811 [08:22:59] T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811 [08:27:44] hashar: looks stable enough to me, go ahead [08:28:40] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:28:58] andre: merci! [08:29:05] dr [08:29:29] !log Restarting CI Jenkins for plugin upgrade # T420347 [08:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:33] T420347: Quibble -c commands cause Jenkins Collapsible Section plugin to erase console output (Wikibase job in Jenkins do not include the full log) - https://phabricator.wikimedia.org/T420347 [08:31:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:33:40] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:38:40] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:41:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (80.249.209.211) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit4&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:43:40] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:44:31] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8297/co" [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh) [08:45:03] 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11721853 (10WMDE-leszek) Hello, I approve this request on WMDE's end. Thank you! [08:45:10] (03PS3) 10Slyngshede: service.yaml: update IPs for ulsfo-lb (text/upload/gerrit/ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh) [08:46:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:46:42] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-eqiad [08:46:51] (03PS1) 10Slyngshede: WMCS cloudgw: update IPs for ulsfo-lb (text/upload) [puppet] - 10https://gerrit.wikimedia.org/r/1254830 (https://phabricator.wikimedia.org/T418971) [08:47:07] (03CR) 10Vgutierrez: [C:03+1] service.yaml: update IPs for ulsfo-lb (text/upload/gerrit/ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh) [08:47:56] (03CR) 10Vgutierrez: [C:03+1] geo-resources: update IP addresses for ulsfo services [dns] - 10https://gerrit.wikimedia.org/r/1253503 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh) [08:47:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet [08:47:58] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet [08:50:07] FIRING: ProbeDown: Service ml-cache1001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#ml-cache1001-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:50:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet [08:51:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [08:52:44] aux-k8s-etcd1003, dse-k8s-etcd1001, kubestagemaster1005 will go down for a Ganeti reboot [08:53:30] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:53:54] PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:54:10] FIRING: BFDdown: BFD session down between cr1-esams and 2001:7f8:1::a500:3320:1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [08:54:12] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:55:07] RESOLVED: [2x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:26] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [08:55:36] RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms [08:55:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet [08:55:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet [08:56:06] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet [08:56:09] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet [08:56:18] RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [08:56:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:57:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet [08:58:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [08:58:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet [08:59:10] FIRING: [2x] BFDdown: BFD session down between cr1-esams and 2001:7f8:1::a500:3320:1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:00:10] FIRING: [3x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet [09:00:47] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet [09:02:07] !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool ulsfo [reason: no reason specified, T418971] [09:02:11] T418971: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971 [09:02:15] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo [reason: no reason specified, T418971] [09:02:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet [09:02:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet [09:03:26] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet [09:03:29] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet [09:04:00] 06SRE, 10SRE-swift-storage: upload.wikimedia.org serves .ogg audio files with content-type `application/ogg` instead of `audio/ogg`. - https://phabricator.wikimedia.org/T420422#11721895 (10Arendpieter) [09:04:10] RESOLVED: [2x] BFDdown: BFD session down between cr1-esams and 2001:7f8:1::a500:3320:1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:05:10] RESOLVED: [4x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet [09:07:16] (03PS2) 10Ayounsi: ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) [09:08:19] (03CR) 10Vgutierrez: [C:03+1] ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [09:08:25] (03CR) 10Ayounsi: ulsfo: add new LVS service IP range (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [09:08:33] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 23 hosts with reason: Update ULSFO LVS service IPs [09:08:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet [09:08:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet [09:08:50] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11721899 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b34081b8-989e-49aa-91c7-56b4548775e2) set by slyngshede@cumin1003 for 4:00:... [09:09:42] (03CR) 10Cathal Mooney: [C:03+1] "Matches other sites and what I had in https://phabricator.wikimedia.org/T408892#11330727 so LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [09:10:10] FIRING: [6x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:10:22] (03CR) 10Ayounsi: [C:03+2] ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [09:10:54] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet [09:10:58] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet [09:11:45] (03Merged) 10jenkins-bot: ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [09:12:37] (03PS1) 10Daniel Kinzler: rest-gateway: update readme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 [09:12:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet [09:12:45] (03CR) 10JMeybohm: [C:03+2] kubestagemaster: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1242289 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm) [09:12:45] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-eqiad [09:12:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet [09:13:11] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-codfw [09:14:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [09:14:24] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-staging-master-codfw@codfw [09:15:10] RESOLVED: [5x] ProbeDown: Service ganeti1024:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:15:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet [09:15:42] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2074.codfw.wmnet [09:15:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [09:16:41] (03CR) 10Slyngshede: [C:03+2] service.yaml: update IPs for ulsfo-lb (text/upload/gerrit/ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh) [09:17:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet [09:17:16] 06SRE, 10SRE-swift-storage: upload.wikimedia.org serves .ogg audio files with content-type `application/ogg` instead of `audio/ogg`. - https://phabricator.wikimedia.org/T420422#11721907 (10Arendpieter) The response appears to be coming from a Swift-backed object where the original object metadata is preserved... [09:17:18] (03PS1) 10Mszwarc: Enable autodemotion for 2FA-less CN admins and WMF T&S [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254851 (https://phabricator.wikimedia.org/T418580) [09:18:45] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet [09:18:49] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet [09:19:08] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:21:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet [09:22:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:22:57] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:23:10] !ack [09:23:10] Could not ack the alert. Please check the parameters. [09:23:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2074.codfw.wmnet [09:23:18] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2075.codfw.wmnet [09:23:46] !incidents [09:23:46] 7770 (UNACKED) [2x] ProbeDown sre (198.35.26.112 ip4 probes/service ulsfo) [09:23:46] 7768 (RESOLVED) NELHigh sre (thanos-rule@main tcp.timed_out) [09:23:46] 7767 (RESOLVED) [2x] ProbeDown sre (dse-k8s-ctrl2001:6443 probes/custom codfw) [09:23:55] !ack 7770 [09:23:55] 7770 (ACKED) [2x] ProbeDown sre (198.35.26.112 ip4 probes/service ulsfo) [09:24:04] hmm !ack all has been changed? [09:24:20] vgutierrez: anything we can help? [09:24:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [09:24:28] nope, it's related to the ulsfo maintenance [09:24:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [09:24:32] all good and expected [09:24:43] grand [09:24:57] 06SRE: upload.wikimedia.org serves .ogg audio files with content-type `application/ogg` instead of `audio/ogg`. - https://phabricator.wikimedia.org/T420422#11721927 (10Arendpieter) [09:26:20] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:26:20] RECOVERY - Host cloudgw1003 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [09:26:20] RECOVERY - Host cloudlb1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [09:26:26] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet [09:26:29] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1076.eqiad.wmnet [09:27:57] jayme@cumin1003 migrate-service-ipip (PID 4054043) is awaiting input [09:27:57] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:30:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2075.codfw.wmnet [09:30:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2076.codfw.wmnet [09:30:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:31:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:32:45] RESOLVED: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:34:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:35:30] FIRING: LibericaStaleConfig: Liberica instance lvs4010 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:35:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [09:35:46] liberica alert is ulsfo maintenance, all good [09:35:53] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:35:53] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: wikikube-staging-master-codfw@codfw [09:36:19] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1076.eqiad.wmnet [09:36:22] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1077.eqiad.wmnet [09:37:05] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-staging-master-eqiad@eqiad [09:37:35] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-codfw [09:37:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2076.codfw.wmnet [09:38:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2077.codfw.wmnet [09:39:44] jmm@cumin2002 drain-node (PID 3676199) is awaiting input [09:39:52] !log jayme@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:40:30] FIRING: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:40:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [09:40:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2003.codfw.wmnet [09:40:35] !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading A:lvs-secondary-ulsfo and A:liberica (T418971) [09:40:41] T418971: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971 [09:40:43] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading A:lvs-secondary-ulsfo and A:liberica (T418971) [09:40:50] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:40:50] !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: wikikube-staging-master-eqiad@eqiad [09:42:15] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1001.eqiad.wmnet [09:43:38] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1077.eqiad.wmnet [09:43:42] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1078.eqiad.wmnet [09:44:13] !log switched wikikube staging apiservers to IPIP and maglev in eqiad and codfw - T352956 [09:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:16] T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956 [09:44:33] !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.upgrade restart A:lvs-secondary-ulsfo and A:liberica [09:44:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2003.codfw.wmnet [09:44:37] 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11721959 (10Aklapper) @Sarmbruster: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/), so your 'LDAP User' ac... [09:44:39] RESOLVED: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:44:49] !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs4010.ulsfo.wmnet} and A:liberica [09:45:01] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4010.ulsfo.wmnet} and A:liberica [09:45:10] !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.admin pooling P{lvs4010.ulsfo.wmnet} and A:liberica [09:45:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2077.codfw.wmnet [09:45:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2078.codfw.wmnet [09:45:30] FIRING: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:45:32] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs4010.ulsfo.wmnet} and A:liberica [09:45:35] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restart A:lvs-secondary-ulsfo and A:liberica [09:45:42] !log installing postgresql-15 security updates [09:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:06] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1001.eqiad.wmnet [09:46:16] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1002.eqiad.wmnet [09:46:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [09:46:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [09:46:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [09:46:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1003.eqiad.wmnet [09:48:03] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254856 (https://phabricator.wikimedia.org/T418350) [09:48:37] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1002.eqiad.wmnet [09:48:43] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1003.eqiad.wmnet [09:49:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [09:50:27] (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254856 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:50:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1003.eqiad.wmnet [09:51:02] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1078.eqiad.wmnet [09:51:03] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1003.eqiad.wmnet [09:51:05] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1079.eqiad.wmnet [09:51:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2078.codfw.wmnet [09:51:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2079.codfw.wmnet [09:52:11] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254856 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:52:14] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [09:54:18] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254856 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:55:30] RESOLVED: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig [09:55:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [09:56:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [09:56:47] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:57:05] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [09:57:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno) [09:58:18] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [09:59:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2079.codfw.wmnet [09:59:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2080.codfw.wmnet [09:59:53] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1079.eqiad.wmnet [09:59:57] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1080.eqiad.wmnet [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1000) [10:01:32] !log slyngshede@cumin1003 START - Cookbook sre.hosts.remove-downtime for 23 hosts [10:01:40] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [10:01:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet [10:01:45] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 23 hosts [10:01:53] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [10:03:41] !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: no reason specified, no task ID specified] [10:03:48] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: pool ulsfo [reason: no reason specified, no task ID specified] [10:04:05] !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: no reason specified, T418971] [10:04:09] T418971: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971 [10:04:10] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: no reason specified, T418971] [10:04:58] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [10:05:18] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [10:05:30] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [10:05:32] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:06:21] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1080.eqiad.wmnet [10:06:25] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1081.eqiad.wmnet [10:06:30] jmm@cumin2002 drain-node (PID 3681874) is awaiting input [10:06:33] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254863 (https://phabricator.wikimedia.org/T360794) [10:06:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet [10:07:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2080.codfw.wmnet [10:07:41] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11722088 (10SLyngshede-WMF) 05Open→03Resolved @Papaul Done :-) [10:09:10] (03CR) 10Vgutierrez: [C:03+2] geo-resources: update IP addresses for ulsfo services [dns] - 10https://gerrit.wikimedia.org/r/1253503 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh) [10:09:38] !log vgutierrez@dns1004 START - running authdns-update [10:10:30] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [10:10:38] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet [10:11:12] (03CR) 10Vgutierrez: [C:03+2] WMCS cloudgw: update IPs for ulsfo-lb (text/upload) [puppet] - 10https://gerrit.wikimedia.org/r/1254830 (https://phabricator.wikimedia.org/T418971) (owner: 10Slyngshede) [10:11:22] !log vgutierrez@dns1004 END - running authdns-update [10:11:41] (03CR) 10DCausse: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254863 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:12:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet [10:13:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet [10:13:53] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1081.eqiad.wmnet [10:14:04] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet [10:14:10] !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [10:15:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet [10:16:25] (03PS1) 10Ayounsi: ulsfo: remove old LVS service IPs and range [homer/public] - 10https://gerrit.wikimedia.org/r/1254864 (https://phabricator.wikimedia.org/T418971) [10:16:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet [10:17:36] !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [10:17:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2081.codfw.wmnet [10:17:57] !log vgutierrez@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool ulsfo [reason: no reason specified, no task ID specified] [10:17:59] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo [reason: no reason specified, no task ID specified] [10:18:09] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1082.eqiad.wmnet [10:19:04] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-staging2003.codfw.wmnet [10:22:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet [10:23:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet [10:23:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet [10:23:50] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2003.codfw.wmnet [10:24:19] !log btullis@cumin1003 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster [10:25:04] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2081.codfw.wmnet [10:25:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2082.codfw.wmnet [10:25:24] (03PS1) 10Kgraessle: Deploy Extension:PersonalDashboard to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) [10:25:37] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1082.eqiad.wmnet [10:25:40] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1083.eqiad.wmnet [10:25:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet [10:26:02] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-staging2002.codfw.wmnet [10:29:58] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254863 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:30:15] (03CR) 10Muehlenhoff: [C:03+2] Remove ncredir4001/4002 [puppet] - 10https://gerrit.wikimedia.org/r/1253538 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [10:30:49] (03CR) 10Vgutierrez: [C:03+1] ulsfo: remove old LVS service IPs and range [homer/public] - 10https://gerrit.wikimedia.org/r/1254864 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [10:31:10] (03PS2) 10Muehlenhoff: Remove support for old Elastic releases [puppet] - 10https://gerrit.wikimedia.org/r/1247917 (https://phabricator.wikimedia.org/T388607) [10:31:19] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2002.codfw.wmnet [10:31:49] (03CR) 10Kamila Součková: [C:03+1] rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler) [10:32:04] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [10:32:06] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254863 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton) [10:32:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet [10:32:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet [10:32:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2082.codfw.wmnet [10:32:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet [10:32:35] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2083.codfw.wmnet [10:32:36] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1083.eqiad.wmnet [10:32:39] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1084.eqiad.wmnet [10:33:11] (03CR) 10Blake: "Sounds good, I'll make a note to deploy this on Monday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake) [10:34:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet [10:34:46] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:34:54] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:37:29] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [10:39:42] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1084.eqiad.wmnet [10:39:45] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1085.eqiad.wmnet [10:39:47] !log fabfur@cumin1003 START - Cookbook sre.dns.netbox [10:40:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet [10:40:48] (03CR) 10Kamila Součková: [C:03+1] "typo inline, otherwise LGTM, though I didn't nitpick the tests code" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [10:40:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2083.codfw.wmnet [10:40:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet [10:40:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2084.codfw.wmnet [10:40:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [10:43:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [10:44:57] (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254867 [10:45:21] (03CR) 10Muehlenhoff: [C:03+2] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff) [10:45:54] (03PS1) 10Ayounsi: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) [10:46:31] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1085.eqiad.wmnet [10:46:35] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1086.eqiad.wmnet [10:47:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11722329 (10ayounsi) @bvibber you can read and sign the L3 at the end of https://phabricator.wikimedia.org/L3 I don't see your email in the signat... [10:47:51] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11722330 (10ayounsi) [10:48:33] (03CR) 10Muehlenhoff: [C:03+2] Remove profile to build Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1243828 (owner: 10Muehlenhoff) [10:49:04] fabfur@cumin1003 netbox (PID 4067230) is awaiting input [10:49:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2084.codfw.wmnet [10:49:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [10:49:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2085.codfw.wmnet [10:49:59] (03Abandoned) 10Jgiannelos: Remove duplicate definition of site.v1.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1234927 (https://phabricator.wikimedia.org/T415877) (owner: 10Jgiannelos) [10:50:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [10:50:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [10:50:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [10:53:17] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1086.eqiad.wmnet [10:53:20] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1087.eqiad.wmnet [10:53:42] (03PS1) 10Vgutierrez: Remove deprecated 198.35.26.240/28 include [dns] - 10https://gerrit.wikimedia.org/r/1254869 [10:54:36] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [10:54:39] (03CR) 10CI reject: [V:04-1] Remove deprecated 198.35.26.240/28 include [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez) [10:55:10] (03CR) 10Ayounsi: [C:03+1] Remove deprecated 198.35.26.240/28 include [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez) [10:56:23] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [10:56:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2085.codfw.wmnet [10:56:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2086.codfw.wmnet [10:56:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [10:57:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet [10:57:10] !log fabfur@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:57:21] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1087.eqiad.wmnet [10:57:25] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet [10:58:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [10:58:32] (03PS2) 10Vgutierrez: Refresh 198.35.26.0 includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 [10:59:16] !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-jumbo-eqiad [10:59:27] (03CR) 10CI reject: [V:04-1] Refresh 198.35.26.0 includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez) [10:59:48] (03CR) 10Ayounsi: [C:03+1] Refresh 198.35.26.0 includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez) [11:00:04] (03PS9) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [11:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1100). [11:00:06] !log vgutierrez@cumin1003 START - Cookbook sre.dns.netbox [11:02:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [11:03:03] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:03:17] (03CR) 10Vgutierrez: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez) [11:03:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron) [11:03:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2086.codfw.wmnet [11:03:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2087.codfw.wmnet [11:04:17] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1088.eqiad.wmnet [11:04:18] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:04:21] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1089.eqiad.wmnet [11:05:13] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:05:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster [11:05:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:05:50] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11722408 (10MoritzMuehlenhoff) [11:06:44] (03PS3) 10Muehlenhoff: thumbor-plugins: Stop using pkg_resources [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1243135 [11:06:55] (03PS3) 10Vgutierrez: Refresh ulsfo includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 [11:07:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:08:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [11:08:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [11:10:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2087.codfw.wmnet [11:10:33] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet [11:10:51] (03CR) 10Ayounsi: [C:03+1] Refresh ulsfo includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez) [11:11:07] (03CR) 10Vgutierrez: [C:03+2] Refresh ulsfo includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez) [11:11:17] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1089.eqiad.wmnet [11:11:21] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1090.eqiad.wmnet [11:11:30] !log vgutierrez@dns1004 START - running authdns-update [11:12:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet [11:13:03] !log vgutierrez@dns1004 END - running authdns-update [11:14:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet [11:15:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [11:16:50] (03CR) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [11:17:25] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2088.codfw.wmnet [11:17:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2089.codfw.wmnet [11:18:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet [11:18:24] !log vgutierrez@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: no reason specified, no task ID specified] [11:18:28] !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: no reason specified, no task ID specified] [11:18:33] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1090.eqiad.wmnet [11:18:36] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1091.eqiad.wmnet [11:20:02] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1015 [11:20:16] !log btullis@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host dse-k8s-worker1015 [11:22:38] 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11722430 (10Clement_Goubert) >>! In T411058#11721747, @KartikMistry wrote: > @RLazarus After reducing `replicas`, I was able to depl... [11:22:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet [11:22:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet [11:23:40] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1091.eqiad.wmnet [11:23:41] (03PS15) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) [11:23:44] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1092.eqiad.wmnet [11:23:53] (03PS11) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) [11:24:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2089.codfw.wmnet [11:24:15] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2090.codfw.wmnet [11:25:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [11:26:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet [11:27:16] jouncebot: nowandnext [11:27:16] For the next 0 hour(s) and 32 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1100) [11:27:16] In 1 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1300) [11:27:38] (03PS1) 10Mszwarc: Tweak configuration of external link aggregate usage analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) [11:28:34] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:28:58] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:29:07] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [11:29:28] (03PS1) 10Filippo Giunchedi: rabbitmq: set pause_minority for cluster_partition_handling [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) [11:29:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [11:29:50] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [11:30:02] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:30:10] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [11:30:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet [11:30:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:30:46] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:30:53] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1347.eqiad.wmnet [11:30:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2090.codfw.wmnet [11:30:58] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1092.eqiad.wmnet [11:31:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2091.codfw.wmnet [11:31:29] (03CR) 10Filippo Giunchedi: "Deployment plan:" [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [11:31:31] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [11:34:53] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1020.eqiad.wmnet with OS bookworm [11:35:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11722471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host dse-k8s-... [11:35:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11722473 (10Jclark-ctr) a:03Jclark-ctr [11:35:37] (03PS1) 10Btullis: Switch some of the dse-k8s-worker hosts to/from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1254880 (https://phabricator.wikimedia.org/T418582) [11:35:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:36:11] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1347.eqiad.wmnet [11:37:25] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting: Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389#11722493 (10Clement_Goubert) 05Open→03In progress a:05VRiley-WMF→03Clement_Goubert Yep looking good.... [11:37:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [11:37:36] (03PS1) 10Ladsgroup: DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) [11:37:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1036.eqiad.wmnet [11:37:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [11:37:46] (03PS1) 10Ladsgroup: Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254882 (https://phabricator.wikimedia.org/T402792) [11:37:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2091.codfw.wmnet [11:37:53] (03CR) 10Ladsgroup: [C:03+2] DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:37:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet [11:37:57] (03CR) 10Ladsgroup: [C:03+2] Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254882 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:38:08] (03PS1) 10Ladsgroup: Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254883 (https://phabricator.wikimedia.org/T402792) [11:38:16] (03PS1) 10Ladsgroup: DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254884 (https://phabricator.wikimedia.org/T402792) [11:38:30] (03CR) 10Ladsgroup: [C:03+2] DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254884 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:38:35] (03CR) 10Ladsgroup: [C:03+2] Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254883 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:39:01] (03CR) 10Btullis: [C:03+2] Switch some of the dse-k8s-worker hosts to/from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1254880 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [11:39:35] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [11:39:58] FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:40:22] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler) [11:40:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet [11:40:35] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway rate limiting: add CORS headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [11:41:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [11:42:16] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:42:30] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:42:39] (03Merged) 10jenkins-bot: rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler) [11:42:41] (03Merged) 10jenkins-bot: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler) [11:44:32] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [11:44:43] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11722551 (10MatthewVernon) >>! In T414805#11721374, @Ladsgroup wrote: >>>! In T414805#11682308, @MatthewVernon wrote: >> @Ladsgroup t... [11:45:07] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [11:46:05] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [11:46:05] (03Abandoned) 10Sergio Gimeno: [Growth] Remove get-started notification variant delays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176254 (owner: 10Sergio Gimeno) [11:46:18] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage [11:47:13] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:47:18] (03PS1) 10Elukey: dse-k8s-services: update the base Airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254887 (https://phabricator.wikimedia.org/T402512) [11:47:28] !log sudo homer lsw1-e5-eqiad* commit 'wikikube-worker1307 to active' [11:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:57] (03CR) 10Elukey: "Tested in my airflow dev environment, all good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254887 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [11:48:04] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1307.eqiad.wmnet [11:48:05] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1307.eqiad.wmnet [11:48:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet [11:48:22] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:48:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet [11:48:37] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting: Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389#11722560 (10Clement_Goubert) 05In progress→03Resolved Host back Active and repooled, resolving. [11:48:41] (03PS1) 10Harroyo-wmf: Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) [11:48:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [11:49:04] 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11722566 (10KartikMistry) Pods look fine so far: ` kartik@deploy2002:/srv/deployment-charts/helmfile.d/services/machinetranslation... [11:49:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage [11:49:40] (03PS1) 10SomeRandomDeveloper: Revert "SpecialPreferences: Use Language Select Widget in language field" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254890 (https://phabricator.wikimedia.org/T419895) [11:49:40] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Updating for dse-k8s-worker1012 - btullis@cumin1003" [11:50:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Updating for dse-k8s-worker1012 - btullis@cumin1003" [11:50:38] (03PS1) 10SomeRandomDeveloper: Revert "SpecialPreferences: Use Language Select Widget in language field" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254891 (https://phabricator.wikimedia.org/T419895) [11:50:55] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:51:09] (03CR) 10CI reject: [V:04-1] DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:51:13] (03PS1) 10Sergio Gimeno: loggedOutWarning: dont set the schema for experiment events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254894 (https://phabricator.wikimedia.org/T420451) [11:51:30] (03PS1) 10Sergio Gimeno: loggedOutWarning: dont set the schema for experiment events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254895 (https://phabricator.wikimedia.org/T420451) [11:51:34] kubestagemaster1003 will go down for a Ganeti reboot [11:51:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet [11:51:54] (03CR) 10Ladsgroup: [C:03+2] "..." [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:52:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254895 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno) [11:52:05] (03Merged) 10jenkins-bot: DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:52:12] (03Merged) 10jenkins-bot: Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254882 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:52:19] (03Merged) 10jenkins-bot: DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254884 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:52:27] (03Merged) 10jenkins-bot: Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254883 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup) [11:52:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254894 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno) [11:53:58] PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:18] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [11:54:51] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:54:58] RESOLVED: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:55:18] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458 (10MPostoronca-WMF) 03NEW [11:55:39] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1254883|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254884|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]], [[gerrit:1254882|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254881|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]] [11:55:49] T402792: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792 [11:55:49] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [11:55:49] T416620: Make ProofreadPage follow thumb steps - https://phabricator.wikimedia.org/T416620 [11:55:49] T418178: imageinfo API requests for DJVU files don't follow thumbnail steps, allows upscaling - https://phabricator.wikimedia.org/T418178 [11:55:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:56:36] !log cgoubert@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-codfw [11:56:43] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1328-1372].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [11:57:09] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T420459 (10katiamusiolekwmde) 03NEW [11:57:28] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [11:57:42] (03CR) 10Urbanecm: [C:03+1] "LGTM. DBAs acknowledged this and are okay with the experiment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno) [11:57:48] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1254883|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254884|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]], [[gerrit:1254882|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254881|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]] synced to the testservers (see https://wikitech.wikimedia. [11:57:48] org/wiki/Mwdebug). Changes can now be verified there. [11:57:58] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1012.eqiad.wmnet [11:58:20] FIRING: [3x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:31] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:59:42] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1026.eqiad.wmnet with reason: host reimage [12:00:01] (03PS1) 10Ayounsi: Add suecarmol shell + add to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1254896 (https://phabricator.wikimedia.org/T419932) [12:00:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet [12:00:25] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11722662 (10ayounsi) [12:00:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [12:00:32] FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:00:34] RECOVERY - Host kubestagemaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [12:00:35] FIRING: [4x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:01:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet [12:01:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet [12:01:42] (03PS2) 10Ayounsi: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) [12:02:09] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad [12:02:11] (03CR) 10MVernon: [C:03+1] Add suecarmol shell + add to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1254896 (https://phabricator.wikimedia.org/T419932) (owner: 10Ayounsi) [12:02:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1026.eqiad.wmnet with reason: host reimage [12:02:27] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254883|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254884|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]], [[gerrit:1254882|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254881|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]] (duration: 06m 48s) [12:02:35] T402792: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792 [12:02:35] T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 [12:02:36] T416620: Make ProofreadPage follow thumb steps - https://phabricator.wikimedia.org/T416620 [12:02:36] T418178: imageinfo API requests for DJVU files don't follow thumbnail steps, allows upscaling - https://phabricator.wikimedia.org/T418178 [12:03:00] (03CR) 10Ayounsi: [C:03+2] Add suecarmol shell + add to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1254896 (https://phabricator.wikimedia.org/T419932) (owner: 10Ayounsi) [12:03:13] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:03:20] RESOLVED: [4x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:04:16] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:05:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254851 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [12:05:32] RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:48] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-eqiad [12:05:58] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:06:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11722683 (10ayounsi) 05Open→03Resolved Change is merged, you should be good to go in the next ~30min. Please re-open if any issues. [12:06:39] (03Merged) 10jenkins-bot: Enable autodemotion for 2FA-less CN admins and WMF T&S [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254851 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc) [12:06:59] (03CR) 10Ayounsi: [C:03+2] ulsfo: remove old LVS service IPs and range [homer/public] - 10https://gerrit.wikimedia.org/r/1254864 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [12:07:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet [12:07:08] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254887 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [12:07:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet [12:07:09] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1254851|Enable autodemotion for 2FA-less CN admins and WMF T&S (T418580)]] [12:07:18] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [12:08:25] (03Merged) 10jenkins-bot: ulsfo: remove old LVS service IPs and range [homer/public] - 10https://gerrit.wikimedia.org/r/1254864 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi) [12:09:03] btullis@cumin1003 reimage (PID 4075940) is awaiting input [12:09:15] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1254851|Enable autodemotion for 2FA-less CN admins and WMF T&S (T418580)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:09:36] !log mszwarc@deploy2002 mszwarc: Continuing with sync [12:10:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad [12:10:24] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:10:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet [12:10:58] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:12:52] (03CR) 10AOkoth: [C:03+2] miscweb: add wmf-navigator values - empty httpd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [12:13:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [12:13:30] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254851|Enable autodemotion for 2FA-less CN admins and WMF T&S (T418580)]] (duration: 06m 21s) [12:13:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:26] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:14:32] T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580 [12:14:50] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB [12:15:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi) [12:15:26] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:15:35] jclark@cumin1003 reimage (PID 4076099) is awaiting input [12:15:50] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:16:11] (03PS12) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) [12:16:16] (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [12:16:23] jmm@cumin2002 drain-node (PID 3707861) is awaiting input [12:16:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [12:17:15] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T420459#11722698 (10WMDE-leszek) I approve this request on WMDE's behalf. Thank you! [12:17:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11722699 (10WMDE-leszek) [12:18:29] (03Merged) 10jenkins-bot: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler) [12:19:16] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:19:58] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:20:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet [12:21:33] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:22:21] btullis@cumin1003 reimage (PID 4076647) is awaiting input [12:22:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:22:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm [12:23:26] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:23:50] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:24:28] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update for dse-k8s-worker1015 - btullis@cumin1003" [12:24:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update for dse-k8s-worker1015 - btullis@cumin1003" [12:25:00] !log btullis@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:25:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm [12:25:22] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [12:25:26] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:25:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet [12:25:38] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1020.eqiad.wmnet with reason: host reimage [12:25:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet [12:25:50] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:27:10] (03PS1) 10Muehlenhoff: installserver::dhcp: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1254904 [12:27:40] (03CR) 10CI reject: [V:04-1] installserver::dhcp: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1254904 (owner: 10Muehlenhoff) [12:27:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet [12:28:14] (03PS1) 10Btullis: Put dse-k8s-worker10[15,26] into service [puppet] - 10https://gerrit.wikimedia.org/r/1254905 (https://phabricator.wikimedia.org/T418582) [12:28:36] (03PS2) 10Muehlenhoff: installserver::dhcp: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1254904 [12:29:31] (03CR) 10Btullis: [C:03+2] Put dse-k8s-worker10[15,26] into service [puppet] - 10https://gerrit.wikimedia.org/r/1254905 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:30:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1020.eqiad.wmnet with reason: host reimage [12:31:07] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:32:48] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:32:53] jmm@cumin2002 drain-node (PID 3712455) is awaiting input [12:33:48] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [12:35:21] (03PS1) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [12:35:44] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[1328-1372].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [12:35:52] (03CR) 10CI reject: [V:04-1] mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:36:55] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:37:49] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:37:50] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [12:38:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254904 (owner: 10Muehlenhoff) [12:38:39] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:39:03] (03CR) 10Elukey: [C:03+2] dse-k8s-services: update the base Airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254887 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey) [12:39:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:39:22] (03PS1) 10Btullis: Update linux-base when installing backported kernel on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) [12:39:29] (03PS1) 10Ayounsi: ulsfo routed ganeti: add public range [puppet] - 10https://gerrit.wikimedia.org/r/1254909 (https://phabricator.wikimedia.org/T418993) [12:41:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Note that we'll also need to update the 6.12 backport soon, but the kernel update isn't signed yet (that is a step which needs" [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:41:28] (03CR) 10CI reject: [V:04-1] Update linux-base when installing backported kernel on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:42:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-jumbo-eqiad [12:42:29] (03PS2) 10Btullis: Update linux-base when installing backported kernel on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) [12:43:09] (03CR) 10Btullis: "Got it. Thanks. I will be on the lookout for it." [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:43:27] (03PS2) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [12:43:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet [12:43:58] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1243820 (owner: 10Muehlenhoff) [12:44:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:44:13] kubestagemaster1004, dse-k8s-etcd1002 will go down for a Ganeti reboot [12:44:42] ayounsi@cumin1003 netbox (PID 4090814) is awaiting input [12:45:03] PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:29] PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:34] (03CR) 10CI reject: [V:04-1] mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:46:52] (03CR) 10Btullis: [C:03+1] "Nice. Thanks for this. I have checked and I'm certain that it's not used anywhere." [puppet] - 10https://gerrit.wikimedia.org/r/1242407 (owner: 10Muehlenhoff) [12:46:52] (03PS1) 10Ayounsi: public1-virtual-ulsfo: add missing v6 PTR [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993) [12:47:36] (03CR) 10Btullis: [C:03+2] Update linux-base when installing backported kernel on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis) [12:47:39] PROBLEM - Host mc1039 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:41] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-codfw [12:48:13] (03PS3) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [12:48:37] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [12:49:08] (03PS1) 10Jforrester: Restore quotation-marks in ext.wikilambda.app messages [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254911 [12:49:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet [12:49:29] (03PS2) 10Jforrester: Restore quotation-marks in ext.wikilambda.app messages [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254911 (https://phabricator.wikimedia.org/T420456) [12:49:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet [12:50:02] FIRING: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:19] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [12:50:31] RECOVERY - Host kubestagemaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [12:50:32] FIRING: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:50:40] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:50:53] RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 2.94 ms [12:51:02] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1015.eqiad.wmnet [12:51:38] (03CR) 10Cathal Mooney: [C:03+1] public1-virtual-ulsfo: add missing v6 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:52:23] (03PS2) 10Ayounsi: public1-virtual-ulsfo: add missing v6 PTR [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993) [12:52:44] (03CR) 10Muehlenhoff: [C:03+2] matomo: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243820 (owner: 10Muehlenhoff) [12:52:45] (03CR) 10Cathal Mooney: [C:03+1] public1-virtual-ulsfo: add missing v6 PTR [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:53:30] (03CR) 10Ayounsi: [C:03+2] public1-virtual-ulsfo: add missing v6 PTR [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:53:42] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:53:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1012.eqiad.wmnet [12:53:45] jclark@cumin1003 reimage (PID 4076099) is awaiting input [12:54:07] !log ayounsi@dns1004 START - running authdns-update [12:54:31] RECOVERY - Host mc1039 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [12:54:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1012.eqiad.wmnet [12:54:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:55:01] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:55:02] RESOLVED: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:05] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:55:09] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254909 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [12:55:32] RESOLVED: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:55:38] !log ayounsi@dns1004 END - running authdns-update [12:55:42] (03PS1) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) [12:56:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet [12:57:11] (03PS1) 10Muehlenhoff: Add install4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1254915 (https://phabricator.wikimedia.org/T418993) [12:57:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet [12:57:52] (03CR) 10Kosta Harlan: [C:03+1] Tweak configuration of external link aggregate usage analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [12:57:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254891 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper) [12:58:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254890 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper) [12:58:53] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [12:58:54] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1020.eqiad.wmnet with OS bookworm [12:59:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11722891 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dse-k8s-work... [12:59:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1254909 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1300). [13:00:05] Sergi0 and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] (03CR) 10Ayounsi: [C:03+2] ulsfo routed ganeti: add public range [puppet] - 10https://gerrit.wikimedia.org/r/1254909 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [13:00:17] o/ [13:00:23] I can’t deploy, I’m in a meeting, sorry [13:00:32] I can self-deploy [13:00:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1012.eqiad.wmnet [13:00:34] hey. i can't deploy myself, i'd appreciate if someone could ship it [13:00:50] (03PS1) 10Mszwarc: Normalize external domain names in click analysis [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254916 (https://phabricator.wikimedia.org/T419837) [13:00:53] @MatmaRex can do [13:01:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248095 (owner: 10Bartosz Dziewoński) [13:01:28] might also add this no-op change while we're here ^ [13:01:37] (03CR) 10Ayounsi: [C:03+1] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1254904 (owner: 10Muehlenhoff) [13:02:19] (03CR) 10Ayounsi: [C:03+1] Add install4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1254915 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [13:02:20] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:02:29] (03PS2) 10Mszwarc: Normalize external domain names in click analysis [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254917 (https://phabricator.wikimedia.org/T419837) [13:02:55] ack [13:03:13] I'll do first wmf19/20 then config [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:04:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254917 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:04:13] ml-etcd1001 will go down for a Ganeti reboot [13:04:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet [13:04:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254916 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:04:36] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1027.eqiad.wmnet with reason: host reimage [13:04:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:04:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254894 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno) [13:04:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254895 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno) [13:04:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254891 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper) [13:04:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254890 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper) [13:05:00] FYI: I scheduled a few patches, I can self-deploy them when you both are done. Just ping me :) [13:05:33] @Msz2001 ack [13:05:51] (03PS4) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [13:06:16] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:06:19] (03CR) 10Muehlenhoff: [C:03+2] installserver::dhcp: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1254904 (owner: 10Muehlenhoff) [13:06:19] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [13:06:21] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:06:54] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1016 [13:07:20] !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [13:07:21] !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:08:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1016 [13:08:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1027.eqiad.wmnet with reason: host reimage [13:09:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2092.codfw.wmnet [13:09:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet [13:09:52] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1093.eqiad.wmnet [13:10:06] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [13:10:17] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:10:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet [13:10:29] (03Merged) 10jenkins-bot: loggedOutWarning: dont set the schema for experiment events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254894 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno) [13:10:32] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [13:10:55] (03Merged) 10jenkins-bot: loggedOutWarning: dont set the schema for experiment events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254895 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno) [13:11:45] (03PS5) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [13:11:54] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:12:07] (03PS1) 10Muehlenhoff: firewall::dhcp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1254919 [13:12:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet [13:13:06] (03PS2) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) [13:14:23] (03PS1) 10Daniel Kinzler: rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) [13:15:13] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [13:15:42] (03PS6) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [13:15:44] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:15:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt - jclark@cumin1003" [13:15:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:16:00] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1093.eqiad.wmnet [13:16:03] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1094.eqiad.wmnet [13:16:07] (03CR) 10Muehlenhoff: [C:03+2] firewall::dhcp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1254919 (owner: 10Muehlenhoff) [13:16:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet [13:16:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2092.codfw.wmnet [13:16:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2093.codfw.wmnet [13:20:12] (03Merged) 10jenkins-bot: Revert "SpecialPreferences: Use Language Select Widget in language field" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254891 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper) [13:20:20] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:21:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1026.eqiad.wmnet [13:21:50] (03Merged) 10jenkins-bot: Revert "SpecialPreferences: Use Language Select Widget in language field" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254890 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper) [13:21:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet [13:21:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet [13:22:11] (03PS7) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [13:22:11] (03PS3) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) [13:22:25] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1254894|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254895|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254891|Revert "SpecialPreferences: Use Language Select Widget in language field" (T419895)]], [[gerrit:1254890|Revert "SpecialPreferences: Use Language Select Widget in lang [13:22:25] uage field" (T419895)]] [13:22:32] T420451: '.experiment.coordinator' should be equal to one of the allowed values - https://phabricator.wikimedia.org/T420451 [13:22:32] T419895: UnexpectedValueException: Default '"sh-latn"' is invalid for preference variant of user [user] - https://phabricator.wikimedia.org/T419895 [13:23:20] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2093.codfw.wmnet [13:23:24] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2094.codfw.wmnet [13:23:42] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:24:32] !log sgimeno@deploy2002 somerandomdeveloper, sgimeno: Backport for [[gerrit:1254894|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254895|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254891|Revert "SpecialPreferences: Use Language Select Widget in language field" (T419895)]], [[gerrit:1254890|Revert "SpecialPreferences: Use Language Select Widget in [13:24:32] language field" (T419895)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:24:55] !log sgimeno@deploy2002 somerandomdeveloper, sgimeno: Continuing with sync [13:24:55] Seems to be fixed for me, no error anymore at https://sh.wikipedia.org/wiki/Posebno:Postavke when using mwdebug [13:25:10] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1094.eqiad.wmnet [13:25:14] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1095.eqiad.wmnet [13:26:01] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:27:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet [13:27:27] (03PS8) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [13:27:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1026.eqiad.wmnet [13:28:03] (03PS4) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) [13:28:11] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:28:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:28:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm [13:28:48] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254894|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254895|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254891|Revert "SpecialPreferences: Use Language Select Widget in language field" (T419895)]], [[gerrit:1254890|Revert "SpecialPreferences: Use Language Select Widget in lan [13:28:48] guage field" (T419895)]] (duration: 06m 23s) [13:28:50] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11722986 (10Aklapper) [13:28:54] T420451: '.experiment.coordinator' should be equal to one of the allowed values - https://phabricator.wikimedia.org/T420451 [13:28:54] T419895: UnexpectedValueException: Default '"sh-latn"' is invalid for preference variant of user [user] - https://phabricator.wikimedia.org/T419895 [13:29:43] 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11722996 (10ayounsi) [13:29:55] going with config changes now 1248095 and 1254216 [13:30:05] 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11723000 (10ayounsi) @KFrancis can you organize the NDA for this request ? Thanks [13:30:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248095 (owner: 10Bartosz Dziewoński) [13:30:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno) [13:30:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2094.codfw.wmnet [13:30:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2095.codfw.wmnet [13:30:40] To speed things up, I'll +2 my patches, so that CI starts to process them [13:31:06] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254916 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:31:16] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254917 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:31:17] (03Merged) 10jenkins-bot: filebackend: Remove outdated comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248095 (owner: 10Bartosz Dziewoński) [13:31:20] (03Merged) 10jenkins-bot: GrowthExperiments: increase edit and thanks query limit II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno) [13:31:50] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1248095|filebackend: Remove outdated comment]], [[gerrit:1254216|GrowthExperiments: increase edit and thanks query limit II (T341599)]] [13:31:52] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1095.eqiad.wmnet [13:31:54] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:31:56] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1096.eqiad.wmnet [13:32:25] (03CR) 10Herron: [C:03+2] systemd::timer::job: add ExecCondition support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron) [13:33:56] !log sgimeno@deploy2002 matmarex, sgimeno: Backport for [[gerrit:1248095|filebackend: Remove outdated comment]], [[gerrit:1254216|GrowthExperiments: increase edit and thanks query limit II (T341599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:34:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet [13:34:29] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11723038 (10ayounsi) @OKryva-WMF do you approve this request ? @thcipriani do you approve this request ? @MPostoronca-WMF could you generate a ed25519 key instead? [13:34:53] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11723049 (10ayounsi) [13:36:40] !log sgimeno@deploy2002 matmarex, sgimeno: Continuing with sync [13:36:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2095.codfw.wmnet [13:36:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:36:59] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11723054 (10ayounsi) @KFrancis could you organize the NDA signature for this request ? Thanks [13:36:59] (03Merged) 10jenkins-bot: Normalize external domain names in click analysis [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254916 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:37:01] (03Merged) 10jenkins-bot: Normalize external domain names in click analysis [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254917 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:37:01] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2096.codfw.wmnet [13:37:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11723057 (10ayounsi) [13:39:18] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1096.eqiad.wmnet [13:39:23] !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1097.eqiad.wmnet [13:39:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet [13:40:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet [13:40:36] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248095|filebackend: Remove outdated comment]], [[gerrit:1254216|GrowthExperiments: increase edit and thanks query limit II (T341599)]] (duration: 08m 47s) [13:40:40] T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599 [13:40:54] @Msz2001 all yours [13:41:01] ack, deploying [13:41:02] (03CR) 10Ssingh: [C:03+2] hcaptcha: Enable nginx caching for secure-api.js [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) (owner: 10Kosta Harlan) [13:41:06] (03CR) 10Muehlenhoff: [C:03+2] Add install4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1254915 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [13:41:30] sukhe: I'll merge your patch along, ok? [13:41:38] please do [13:41:40] thanks [13:41:42] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1254916|Normalize external domain names in click analysis (T419837)]], [[gerrit:1254917|Normalize external domain names in click analysis (T419837)]] [13:41:46] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [13:41:52] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1373.eqiad.wmnet with OS bookworm [13:41:54] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1033.eqiad.wmnet with OS trixie [13:42:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11723068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1373.eqiad.wmnet with OS bookworm [13:43:23] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2096.codfw.wmnet [13:43:44] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1254916|Normalize external domain names in click analysis (T419837)]], [[gerrit:1254917|Normalize external domain names in click analysis (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:44:03] (03PS9) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [13:44:27] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:45:13] !log mszwarc@deploy2002 mszwarc: Continuing with sync [13:45:31] (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:46:09] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1097.eqiad.wmnet [13:46:24] (03Merged) 10jenkins-bot: Tweak configuration of external link aggregate usage analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc) [13:47:39] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254925 [13:47:39] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254926 [13:47:40] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254927 [13:49:05] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254916|Normalize external domain names in click analysis (T419837)]], [[gerrit:1254917|Normalize external domain names in click analysis (T419837)]] (duration: 07m 23s) [13:49:10] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [13:49:44] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1254876|Tweak configuration of external link aggregate usage analysis (T419837)]] [13:50:20] !log sukhe@cumin1003 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox [13:50:21] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns1004.wikimedia.org [13:50:21] (03PS10) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [13:50:27] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [13:50:33] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-03-10-214300 to 2026-03-16-124858 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254928 (https://phabricator.wikimedia.org/T399344) [13:50:41] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-12-210521 to 2026-03-18-023444 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254929 (https://phabricator.wikimedia.org/T419092) [13:51:17] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:51:52] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1254876|Tweak configuration of external link aggregate usage analysis (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:52:27] !log mszwarc@deploy2002 mszwarc: Continuing with sync [13:52:31] (03PS5) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) [13:53:17] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage [13:54:24] PROBLEM - Host 2620:0:861:1:208:80:154:6 is DOWN: CRITICAL - Host Unreachable (2620:0:861:1:208:80:154:6) [13:54:44] RECOVERY - Host 2620:0:861:1:208:80:154:6 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [13:55:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:55:22] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1033.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:55:24] this should have been downtimed [13:55:27] the DNS host is depooled [13:55:45] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1033.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:56:25] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254876|Tweak configuration of external link aggregate usage analysis (T419837)]] (duration: 06m 41s) [13:56:29] T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837 [13:56:52] Finished deployments [13:57:06] !log UTC afternoon backport+config window done [13:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:11] thanks for deploying sergi0 [13:59:12] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1400) [14:00:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:00:13] Perfect timing. [14:00:36] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-03-10-214300 to 2026-03-16-124858 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254928 (https://phabricator.wikimedia.org/T399344) (owner: 10Jforrester) [14:01:21] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:01:40] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [14:02:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet [14:02:17] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254933 (https://phabricator.wikimedia.org/T418350) [14:02:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254911 (https://phabricator.wikimedia.org/T420456) (owner: 10Jforrester) [14:02:39] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-03-10-214300 to 2026-03-16-124858 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254928 (https://phabricator.wikimedia.org/T399344) (owner: 10Jforrester) [14:02:40] (03PS2) 10Kgraessle: Deploy Extension:PersonalDashboard to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) [14:04:04] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns1004.wikimedia.org [14:04:32] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:05:17] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:05:30] !log set graceful-shutdown on EdgeUno transit sessions [14:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:49] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:07:47] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:08:03] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:08:48] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:10:04] jmm@cumin2002 drain-node (PID 3731205) is awaiting input [14:10:28] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-12-210521 to 2026-03-18-023444 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254929 (https://phabricator.wikimedia.org/T419092) (owner: 10Jforrester) [14:10:52] (03Merged) 10jenkins-bot: Restore quotation-marks in ext.wikilambda.app messages [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254911 (https://phabricator.wikimedia.org/T420456) (owner: 10Jforrester) [14:10:53] (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254933 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [14:11:23] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1254911|Restore quotation-marks in ext.wikilambda.app messages (T420456)]] [14:11:28] T420456: In the default collapsed view, all strings appear as ⧼quotation-marks⧽ - https://phabricator.wikimedia.org/T420456 [14:11:33] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254933 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [14:12:52] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-12-210521 to 2026-03-18-023444 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254929 (https://phabricator.wikimedia.org/T419092) (owner: 10Jforrester) [14:13:19] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:13:27] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1254911|Restore quotation-marks in ext.wikilambda.app messages (T420456)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:13:36] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:13:39] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:14:00] (03CR) 10Vgutierrez: [C:04-1] "this alone is not enough:" [puppet] - 10https://gerrit.wikimedia.org/r/1242499 (https://phabricator.wikimedia.org/T419887) (owner: 10Cwhite) [14:14:03] !log jforrester@deploy2002 jforrester: Continuing with sync [14:14:03] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:14:05] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254933 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [14:14:33] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:15:05] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:15:16] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:16:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [14:16:22] (03CR) 10Herron: [C:03+2] "proceeding with this after discussion on irc" [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) (owner: 10Herron) [14:16:32] (03PS5) 10Herron: icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) [14:16:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4004.wikimedia.org [14:16:35] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:16:37] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:16:41] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:16:59] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [14:17:00] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1373.eqiad.wmnet with OS bookworm [14:17:03] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:17:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11723269 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1373.eqiad.wmnet with OS bookworm completed: - wikikube-worker1373 (... [14:17:16] (03CR) 10CI reject: [V:04-1] icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) (owner: 10Herron) [14:17:55] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254911|Restore quotation-marks in ext.wikilambda.app messages (T420456)]] (duration: 06m 32s) [14:17:59] T420456: In the default collapsed view, all strings appear as ⧼quotation-marks⧽ - https://phabricator.wikimedia.org/T420456 [14:18:41] (03PS6) 10Herron: icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) [14:19:04] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org [14:20:37] (03CR) 10Herron: [C:03+2] icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) (owner: 10Herron) [14:20:58] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4004.wikimedia.org - jmm@cumin2002" [14:21:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4004.wikimedia.org - jmm@cumin2002" [14:21:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:05] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install4004.wikimedia.org on all recursors [14:21:25] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) install4004.wikimedia.org on all recursors [14:21:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11723291 (10Jclark-ctr) [14:21:48] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11723293 (10Jclark-ctr) 05Open→03Resolved [14:24:10] FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:24:25] jmm@cumin2002 makevm (PID 3732975) is awaiting input [14:24:57] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install4004.wikimedia.org on all recursors [14:25:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install4004.wikimedia.org on all recursors [14:25:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet [14:25:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install4004.wikimedia.org - jmm@cumin2002" [14:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install4004.wikimedia.org - jmm@cumin2002" [14:28:39] jmm@cumin2002 makevm (PID 3732975) is awaiting input [14:29:02] (03PS1) 10Giuseppe Lavagetto: Equivalence of functions of inline patterns and patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1254936 [14:29:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:29:17] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Equivalence of functions of inline patterns and patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1254936 (owner: 10Giuseppe Lavagetto) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1400) [14:30:05] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1430) [14:30:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet [14:30:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet [14:31:29] !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "inline pattern and pattern equivalence - oblivian@cumin1003" [14:31:32] !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: inline pattern and pattern equivalence - oblivian@cumin1003 [14:32:25] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: inline pattern and pattern equivalence - oblivian@cumin1003 [14:32:27] !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "inline pattern and pattern equivalence - oblivian@cumin1003" [14:32:44] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org [14:33:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet [14:34:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install4004.wikimedia.org with OS bookworm [14:36:02] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum4001.ulsfo.wmnet [14:38:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1046.eqiad.wmnet [14:40:00] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:40:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:40:42] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:41:11] (03PS1) 10Ottomata: mw-page-html-content-change-enrich - increase taskmanager replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254938 (https://phabricator.wikimedia.org/T360794) [14:41:47] (03PS2) 10Ottomata: mw-page-html-content-change-enrich - increase taskmanager replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254938 (https://phabricator.wikimedia.org/T360794) [14:43:23] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11723367 (10MoritzMuehlenhoff) [14:43:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1046.eqiad.wmnet [14:44:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1046.eqiad.wmnet [14:44:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet [14:44:25] FIRING: [8x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:44:40] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum4001.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:44:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum4001.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:44:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:44:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum4001.ulsfo.wmnet [14:45:05] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723386 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum4001.ulsfo.wmnet` - durum4001.ulsfo.wmnet (**PASS... [14:45:12] !log cgoubert@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-eqiad [14:45:44] (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich - increase taskmanager replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254938 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [14:46:04] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum4002.ulsfo.wmnet [14:46:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet [14:47:15] (03CR) 10Slyngshede: [C:03+2] geo-maps: update Meta geo mapping [dns] - 10https://gerrit.wikimedia.org/r/1254092 (owner: 10Slyngshede) [14:47:33] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephmon2007-dev service implementation - https://phabricator.wikimedia.org/T420282#11723394 (10Andrew) p:05Triage→03Medium [14:47:44] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org [14:47:49] (03Merged) 10jenkins-bot: mw-page-html-content-change-enrich - increase taskmanager replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254938 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata) [14:48:04] !log slyngshede@dns1004 START - running authdns-update [14:48:39] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:48:58] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [14:49:49] !log slyngshede@dns1004 END - running authdns-update [14:50:52] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:52:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet [14:52:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet [14:53:54] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3074.esams.wmnet [reason: trixie reimaging] [14:54:17] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3074.esams.wmnet with OS trixie [14:54:25] FIRING: [12x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:54:25] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3075.esams.wmnet [reason: trixie reimaging] [14:54:53] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3075.esams.wmnet with OS trixie [14:56:43] jmm@cumin2002 decommission (PID 3739626) is awaiting input [14:56:46] 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11723455 (10Sarmbruster) >>! In T420410#11721959, @Aklapper wrote: > @Sarmbruster: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/se... [14:57:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet [14:57:45] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:58:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:58:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum4002.ulsfo.wmnet [14:58:37] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723461 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum4002.ulsfo.wmnet` - durum4002.ulsfo.wmnet (**PASS... [14:58:44] (03PS2) 10Arnaudb: gerrit: bump MaxRequestWorkers [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) [14:59:25] FIRING: [12x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:59:44] (03PS3) 10Arnaudb: gerrit: bump MaxRequestWorkers [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) [15:01:23] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11723484 (10Jgreen) These have all been updated to the frack management password. [15:01:24] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org [15:01:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1049.eqiad.wmnet [15:02:12] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1033.eqiad.wmnet with OS trixie [15:03:02] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723498 (10MoritzMuehlenhoff) [15:03:41] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:03:56] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-eqiad [15:04:04] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1328.eqiad.wmnet [15:04:45] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1329.eqiad.wmnet [15:05:08] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1330.eqiad.wmnet [15:05:14] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1331.eqiad.wmnet [15:06:15] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1332.eqiad.wmnet [15:06:18] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1333.eqiad.wmnet [15:06:55] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1334.eqiad.wmnet [15:07:01] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1335.eqiad.wmnet [15:07:07] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11723523 (10Jgreen) Note to self: since these are Supermicro, the default management user is "ADMIN" [15:07:14] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1336.eqiad.wmnet [15:07:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1049.eqiad.wmnet [15:07:25] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1337.eqiad.wmnet [15:07:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1049.eqiad.wmnet [15:08:51] (03CR) 10Cwhite: "Hmm, IIUC, this hard connection would make our dns servers a dependency for serving all of wikimediastatus.net. Keeping at least the www " [puppet] - 10https://gerrit.wikimedia.org/r/1242499 (https://phabricator.wikimedia.org/T419887) (owner: 10Cwhite) [15:09:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1050.eqiad.wmnet [15:09:07] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1328.eqiad.wmnet [15:09:45] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1338.eqiad.wmnet [15:09:56] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1329.eqiad.wmnet [15:10:06] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1339.eqiad.wmnet [15:10:13] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1330.eqiad.wmnet [15:10:18] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1331.eqiad.wmnet [15:10:22] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1340.eqiad.wmnet [15:10:27] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1341.eqiad.wmnet [15:11:09] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Post reimage - btullis@cumin1003" [15:11:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Post reimage - btullis@cumin1003" [15:11:22] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1333.eqiad.wmnet [15:11:40] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1332.eqiad.wmnet [15:11:50] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1342.eqiad.wmnet [15:11:55] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1343.eqiad.wmnet [15:12:02] 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11723563 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:12:05] (03PS1) 10Btullis: Add dse-k8s-worker1027 into service [puppet] - 10https://gerrit.wikimedia.org/r/1254952 (https://phabricator.wikimedia.org/T414787) [15:12:05] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1335.eqiad.wmnet [15:12:13] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1334.eqiad.wmnet [15:12:21] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1344.eqiad.wmnet [15:12:28] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1345.eqiad.wmnet [15:12:29] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1337.eqiad.wmnet [15:12:31] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1336.eqiad.wmnet [15:12:40] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1346.eqiad.wmnet [15:12:46] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1348.eqiad.wmnet [15:12:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1050.eqiad.wmnet [15:13:22] (03PS1) 10Ayounsi: network/data.yaml: add ulsfo routed ganeti public [puppet] - 10https://gerrit.wikimedia.org/r/1254953 (https://phabricator.wikimedia.org/T418993) [15:13:51] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254953 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [15:14:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm [15:14:49] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1338.eqiad.wmnet [15:15:08] !log imported jenkins 2.541.3 for bullseye/bookworm/trixie [15:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:11] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1339.eqiad.wmnet [15:15:16] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1340.eqiad.wmnet [15:15:29] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1254953 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [15:15:32] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1341.eqiad.wmnet [15:15:52] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1349.eqiad.wmnet [15:16:24] (03CR) 10Btullis: [C:03+2] Add dse-k8s-worker1027 into service [puppet] - 10https://gerrit.wikimedia.org/r/1254952 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [15:16:24] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org [15:16:40] (03CR) 10Ayounsi: [C:03+2] network/data.yaml: add ulsfo routed ganeti public [puppet] - 10https://gerrit.wikimedia.org/r/1254953 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [15:16:55] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1342.eqiad.wmnet [15:17:01] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1343.eqiad.wmnet [15:17:32] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1345.eqiad.wmnet [15:17:33] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1344.eqiad.wmnet [15:17:45] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1346.eqiad.wmnet [15:17:51] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1348.eqiad.wmnet [15:18:07] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3074.esams.wmnet with reason: host reimage [15:18:18] (03CR) 10Clément Goubert: [C:03+1] rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) (owner: 10Daniel Kinzler) [15:18:57] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3075.esams.wmnet with reason: host reimage [15:20:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1050.eqiad.wmnet [15:20:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1050.eqiad.wmnet [15:20:46] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1349.eqiad.wmnet [15:20:56] (03PS1) 10Ladsgroup: Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) [15:21:58] (03CR) 10CI reject: [V:04-1] Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup) [15:22:38] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1360.eqiad.wmnet [15:22:43] (03PS1) 10C. Scott Ananian: Limit legacy postprocessing cache to pages where DT does apply [extensions/DiscussionTools] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254956 (https://phabricator.wikimedia.org/T376183) [15:22:44] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1361.eqiad.wmnet [15:22:50] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1362.eqiad.wmnet [15:22:56] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1363.eqiad.wmnet [15:23:03] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1364.eqiad.wmnet [15:23:08] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1365.eqiad.wmnet [15:23:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet [15:23:14] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1366.eqiad.wmnet [15:23:20] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1367.eqiad.wmnet [15:23:27] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1368.eqiad.wmnet [15:23:32] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1369.eqiad.wmnet [15:24:03] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host install4004.wikimedia.org with OS bookworm [15:24:03] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install4004.wikimedia.org [15:24:07] ACKNOWLEDGEMENT - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 4134e5f01ac0575de459f204e1ba3c23cd5bfb2a, dns.git is f38df3b8f8408e4f3e4d008d1744ad43c7d241aa) Sukhbir Singh ACK https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:24:15] !log sukhe@dns1004 START - running authdns-update [15:24:25] FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:24:40] FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:25:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11723608 (10Jhancock.wm) it's a good thing we didn't wait for dell to send us a new drive. their portal says shipped but the drive still hasn't been delivered to codfw. [15:25:42] !log sukhe@dns1004 END - running authdns-update [15:25:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install4004.wikimedia.org with OS bookworm [15:25:52] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3074.esams.wmnet with reason: host reimage [15:26:05] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723610 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install4004.wikimedia.org wi... [15:26:09] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1017 [15:27:32] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1360.eqiad.wmnet [15:27:39] (03PS2) 10Jforrester: Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup) [15:27:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1017 [15:27:48] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1361.eqiad.wmnet [15:27:55] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1362.eqiad.wmnet [15:27:56] (03PS3) 10Jforrester: Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup) [15:28:02] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1363.eqiad.wmnet [15:28:07] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1364.eqiad.wmnet [15:28:11] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1365.eqiad.wmnet [15:28:12] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1370.eqiad.wmnet [15:28:18] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1371.eqiad.wmnet [15:28:19] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1366.eqiad.wmnet [15:28:24] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1367.eqiad.wmnet [15:28:26] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1372.eqiad.wmnet [15:28:31] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1368.eqiad.wmnet [15:28:36] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1369.eqiad.wmnet [15:29:03] jmm@cumin2002 drain-node (PID 3747404) is awaiting input [15:29:22] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage [15:30:05] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org [15:30:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:31:05] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1027.eqiad.wmnet [15:32:16] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11723627 (10RobH) Remote hands cleaned the patch cable and reseated the optic along with photos to show the work. This is now returned to #netops purview for moni... [15:32:29] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11723630 (10RobH) {F73035080} {F73035081} {F73035082} [15:33:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:33:16] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1370.eqiad.wmnet [15:33:23] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1371.eqiad.wmnet [15:33:36] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1372.eqiad.wmnet [15:34:05] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3075.esams.wmnet with reason: host reimage [15:34:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1051.eqiad.wmnet [15:34:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11723631 (10MatthewVernon) I remain grateful that we have spare disks available, so thanks again :) [15:35:06] PROBLEM - Host dse-k8s-worker1016 is DOWN: PING CRITICAL - Packet loss = 100% [15:35:25] FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:31] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11723635 (10herron) [15:35:38] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-eqiad [15:36:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage [15:37:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1027.eqiad.wmnet [15:37:29] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy1003.eqiad.wmnet [15:37:50] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update for dse-k8s-worker1016 - btullis@cumin1003" [15:38:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:38:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update for dse-k8s-worker1016 - btullis@cumin1003" [15:39:47] 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06Traffic: Decommission codfw cp hosts cp2027-cp2040 - https://phabricator.wikimedia.org/T419753#11723690 (10BCornwall) [15:39:52] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1022.eqiad.wmnet with reason: kernel update [15:40:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1051.eqiad.wmnet [15:40:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet [15:41:19] (03CR) 10Dzahn: [C:03+1] "lgtm. it seems we could even go as high as 2048 with our amount of RAM" [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [15:41:34] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3078.esams.wmnet with OS trixie [15:41:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/DiscussionTools] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254956 (https://phabricator.wikimedia.org/T376183) (owner: 10C. Scott Ananian) [15:41:46] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS trixie [15:41:58] btullis@cumin1003 reimage (PID 4173938) is awaiting input [15:42:21] (03PS1) 10Btullis: Put dse-k8s-worker101[67] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1254959 (https://phabricator.wikimedia.org/T414787) [15:42:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1052.eqiad.wmnet [15:42:36] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1014.eqiad.wmnet [15:43:25] (03CR) 10Btullis: [C:03+2] Put dse-k8s-worker101[67] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1254959 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis) [15:45:05] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns2005.wikimedia.org [15:45:28] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:45:53] (03CR) 10David Caro: "LGTM, though double check with Andrew first, he did a lot of tweaking might have experience with this setting" [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi) [15:46:44] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_codfw and not P{cp2041.codfw.wmnet} and A:cp [15:46:47] ml-etcd1003 will go down for a Ganeti reboot [15:46:48] jmm@cumin2002 drain-node (PID 3754269) is awaiting input [15:46:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1052.eqiad.wmnet [15:47:32] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1003.eqiad.wmnet [15:48:08] PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:41] !log klausman@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-serve-worker-eqiad [15:48:53] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1014.eqiad.wmnet [15:49:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_codfw and not P{cp2042.codfw.wmnet} and A:cp [15:49:40] FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.74 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:49:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:50:00] FIRING: [2x] JobUnavailable: Reduced availability for job trafficserver-upload in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:50:33] RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [15:51:31] PROBLEM - bacula sd process on backup1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-sd https://wikitech.wikimedia.org/wiki/Bacula [15:51:34] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3074.esams.wmnet with OS trixie [15:51:43] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1012.eqiad.wmnet [15:52:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1052.eqiad.wmnet [15:52:49] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1010.eqiad.wmnet [15:52:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1052.eqiad.wmnet [15:53:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1053.eqiad.wmnet [15:53:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job trafficserver-upload in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:54:00] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for dbproxy1022.eqiad.wmnet [15:54:01] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1008.eqiad.wmnet [15:54:01] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbproxy1022.eqiad.wmnet [15:54:25] FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.74 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:54:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1015.eqiad.wmnet [15:54:41] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1023.eqiad.wmnet with reason: kernel update [15:54:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:54:58] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:55:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:55:32] RECOVERY - bacula sd process on backup1012 is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-sd https://wikitech.wikimedia.org/wiki/Bacula [15:55:45] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2043.codfw.wmnet [15:56:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1053.eqiad.wmnet [15:56:37] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [15:57:03] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3074.esams.wmnet [reason: trixie reimaging] [15:57:09] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1008.eqiad.wmnet [15:57:15] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1010.eqiad.wmnet [15:57:40] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3078.esams.wmnet [reason: trixie reimaging] [15:58:00] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3078.esams.wmnet [reason: trixie reimaging] [15:58:00] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2044.codfw.wmnet [15:58:05] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1012.eqiad.wmnet [15:58:59] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3076.esams.wmnet [reason: trixie reimaging] [15:59:17] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns2005.wikimedia.org [15:59:45] (03Abandoned) 10Bking: dumps: Update cirrus index dumps path to point to new dumps [puppet] - 10https://gerrit.wikimedia.org/r/1210636 (owner: 10DCausse) [16:00:08] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3076.esams.wmnet with OS trixie [16:00:09] (03PS1) 10DLynch: Editcheck: fix tagging not happening for non-default checks [extensions/VisualEditor] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254965 [16:00:19] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3075.esams.wmnet with OS trixie [16:00:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/VisualEditor] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254965 (owner: 10DLynch) [16:00:30] (03PS2) 10Scott French: mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) [16:00:45] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet [16:00:55] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1003.eqiad.wmnet [16:01:59] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1028.eqiad.wmnet with reason: kernel update [16:02:43] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [16:03:41] FIRING: [3x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:04:37] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [16:04:38] !log klausman@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad [16:04:58] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [16:05:40] (03CR) 10Clément Goubert: [C:03+1] mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [16:06:00] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3078.esams.wmnet with reason: host reimage [16:06:01] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [16:06:03] (03CR) 10JMeybohm: [C:03+1] mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [16:06:08] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3079.esams.wmnet with reason: host reimage [16:07:03] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1003.eqiad.wmnet [16:07:56] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1009.eqiad.wmnet [16:08:03] btullis@cumin1003 reimage (PID 4173938) is awaiting input [16:08:22] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports concurrent request handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254967 (https://phabricator.wikimedia.org/T418350) [16:08:40] FIRING: [5x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:08:53] !log klausman@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad [16:09:36] !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [16:09:37] !log klausman@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad [16:09:57] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3078.esams.wmnet with reason: host reimage [16:11:12] !log powercycling ganeti1053 (stuck on reboot) [16:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:25] 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11723858 (10RobH) Remote Hands Directions: I can write up the directions for them to pull the patch and clean it, and also reseat the optic in t... [16:11:56] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [16:12:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [16:12:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm [16:12:53] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1029.eqiad.wmnet with reason: kernel update [16:13:41] FIRING: [3x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:13:42] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1011.eqiad.wmnet [16:13:48] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3079.esams.wmnet with reason: host reimage [16:14:11] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1009.eqiad.wmnet [16:14:15] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1013.eqiad.wmnet [16:14:17] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org [16:16:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1053.eqiad.wmnet [16:16:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1053.eqiad.wmnet [16:16:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host install4004.wikimedia.org with OS bookworm [16:16:57] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install4004.wikimedia.org with O... [16:18:46] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [16:18:48] (03PS11) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [16:18:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1054.eqiad.wmnet [16:19:25] FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:19:38] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1011.eqiad.wmnet [16:20:35] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1013.eqiad.wmnet [16:22:11] !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1012.eqiad.wmnet [16:24:11] (03PS1) 10Muehlenhoff: Add library hints for alsa-lib [puppet] - 10https://gerrit.wikimedia.org/r/1254971 [16:24:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1054.eqiad.wmnet [16:24:23] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3076.esams.wmnet with reason: host reimage [16:24:25] FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:24:39] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy2005.codfw.wmnet with reason: kernel update [16:27:43] (03CR) 10Muehlenhoff: [C:03+2] Add library hints for alsa-lib [puppet] - 10https://gerrit.wikimedia.org/r/1254971 (owner: 10Muehlenhoff) [16:27:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:28:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1054.eqiad.wmnet [16:28:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1054.eqiad.wmnet [16:29:06] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2003.codfw.wmnet [16:29:20] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3076.esams.wmnet with reason: host reimage [16:29:23] !log failover Ganeti master in eqiad to ganeti1046 [16:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:06] sukhe@cumin1003 roll-reboot (PID 4103685) is awaiting input [16:32:19] PROBLEM - ganeti-wconfd running on ganeti1048 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [16:32:41] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org [16:32:49] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2008.codfw.wmnet [16:33:25] (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that supports concurrent request handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254967 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:33:39] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2045.codfw.wmnet [16:33:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:33:52] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports concurrent request handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254967 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:34:09] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [16:34:52] !log installing alsa-lib security updates [16:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3078.esams.wmnet with OS trixie [16:36:11] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports concurrent request handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254967 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [16:36:48] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2046.codfw.wmnet [16:37:32] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2009.codfw.wmnet [16:38:55] !log installing PHP 8.2 security updates [16:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:03] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2008.codfw.wmnet [16:39:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3079.esams.wmnet with OS trixie [16:40:21] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2012.codfw.wmnet [16:40:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:41:32] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy2007.codfw.wmnet with reason: kernel update [16:42:59] PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:16] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-cluster [16:43:16] !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1347.eqiad.wmnet with OS trixie [16:43:16] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [16:43:45] !log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1347 [16:43:48] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2009.codfw.wmnet [16:44:05] RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 5.20 ms [16:44:09] !log jayme@cumin1003 START - Cookbook sre.dns.netbox [16:44:12] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2013.codfw.wmnet [16:45:01] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet [16:46:07] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2003.codfw.wmnet [16:46:13] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3075.esams.wmnet [reason: trixie reimaging] [16:46:53] 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11724144 (10RobH) This also looks like its no longer throwing errors, but I've done nothing: https://grafana.wikimedia.org/d/5p97dAASz/queue-an... [16:47:01] !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2014.codfw.wmnet [16:47:02] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2012.codfw.wmnet [16:47:02] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir and A:ncredir [16:47:24] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1012.eqiad.wmnet [16:47:41] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org [16:47:55] !log jayme@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1347 - jayme@cumin1003" [16:47:59] !log jayme@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1347 - jayme@cumin1003" [16:47:59] !log jayme@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:48:00] !log jayme@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1347.eqiad.wmnet 199.48.64.10.in-addr.arpa 9.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:48:03] !log jayme@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1347.eqiad.wmnet 199.48.64.10.in-addr.arpa 9.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:48:03] !log jayme@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1347 [16:48:59] 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11724159 (10KFrancis) Hi all, the NDA has been sent for signatures. I'll confirm when it's complete. Thanks! [16:49:16] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=97) rolling reboot on A:ncredir and A:ncredir [16:49:23] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir2001.* [16:50:23] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2013.codfw.wmnet [16:51:11] !log klausman@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve1013.eqiad.wmnet with reason: Reboot for security update [16:51:41] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-magru and A:ncredir [16:51:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11724171 (10Jhancock.wm) 05Open→03Resolved you're welcome! I'm gonna close this just do i don't mess up my own SLA waiting for the drive. [16:51:59] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:52:08] !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy2008.codfw.wmnet with reason: kernel update [16:52:17] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-eqsin and A:ncredir [16:53:06] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2014.codfw.wmnet [16:53:17] (03PS2) 10Harroyo-wmf: Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) [16:55:00] FIRING: JobUnavailable: Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:55:01] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [16:55:29] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3076.esams.wmnet with OS trixie [16:55:35] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for ncredir2001.codfw.wmnet [16:55:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ncredir2001.codfw.wmnet [16:55:45] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir2001.* [16:56:35] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: upgrade [16:57:54] (03PS3) 10Harroyo-wmf: Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) [16:58:36] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir2002.* [16:58:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:58:59] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:59:26] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncredir2002.codfw.wmnet [17:00:05] swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1700). [17:00:27] o/ [17:00:59] (03CR) 10Scott French: [C:03+2] mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:01:14] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3077.esams.wmnet [reason: trixie reimaging] [17:01:42] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3077.esams.wmnet with OS trixie [17:01:44] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3076.esams.wmnet [reason: trixie reimaging] [17:02:38] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3078.esams.wmnet [reason: trixie reimaging] [17:02:53] PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [17:02:56] !log jayme@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1347 [17:02:56] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1347 [17:03:03] (03Merged) 10jenkins-bot: mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French) [17:03:04] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3078.esams.wmnet with OS trixie [17:04:19] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-magru and A:ncredir [17:05:13] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-ulsfo and A:ncredir [17:05:19] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-eqsin and A:ncredir [17:05:38] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir2002.codfw.wmnet [17:05:54] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-drmrs and A:ncredir [17:06:19] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir2002.* [17:06:30] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:07:20] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-esams and A:ncredir [17:07:22] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11724272 (10RobH) > Support, > > The link came back up after your cleaning and re-seating the optic and patch cable, but the errors have resumed after the circuit... [17:07:40] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:07:49] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-eqiad and A:ncredir [17:08:12] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3078.* [17:08:14] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org [17:08:15] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3079.* [17:08:20] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:08:49] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:09:13] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3078.* [17:09:13] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [17:10:44] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11724294 (10Raine) [17:11:20] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:12:28] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2047.codfw.wmnet [17:12:47] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:13:02] PROBLEM - haproxy failover on dbproxy1029 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [17:14:40] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5031.eqsin.wmnet with OS trixie [17:14:54] !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1347.eqiad.wmnet with reason: host reimage [17:15:34] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS trixie [17:15:37] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2048.codfw.wmnet [17:15:43] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-ulsfo and A:ncredir [17:16:09] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-eqiad and A:ncredir [17:18:29] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-drmrs and A:ncredir [17:19:26] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1347.eqiad.wmnet with reason: host reimage [17:20:04] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-esams and A:ncredir [17:20:51] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:21:18] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:21:26] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:21:54] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:23:14] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns3004.wikimedia.org [17:23:38] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1258: Ready [17:25:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:25:36] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:26:31] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3077.esams.wmnet with reason: host reimage [17:27:25] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:27:49] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3078.esams.wmnet with reason: host reimage [17:27:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:28:52] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:29:38] !log rearmed keyholder on deploy1003 [17:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:50] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:30:05] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3077.esams.wmnet with reason: host reimage [17:30:52] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:31:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:32:11] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:32:14] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5031.eqsin.wmnet with OS trixie [17:32:30] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5032.eqsin.wmnet with OS trixie [17:32:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5031.eqsin.wmnet with OS trixie [17:32:49] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS trixie [17:33:53] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3078.esams.wmnet with reason: host reimage [17:34:25] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:35:59] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1347.eqiad.wmnet with OS trixie [17:38:13] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backupmon1001.eqiad.wmnet with reason: upgrade [17:38:57] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:39:02] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500 (10AnnieKim_WMDE) 03NEW [17:39:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:40:02] (03PS17) 10Bking: WIP: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) [17:40:37] (03PS18) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) [17:40:37] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns3004.wikimedia.org [17:42:21] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:43:30] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:46:16] (03PS6) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) [17:46:21] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [17:49:06] (03PS19) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) [17:49:37] (03CR) 10CI reject: [V:04-1] dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [17:51:14] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2049.codfw.wmnet [17:52:27] (03PS20) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) [17:54:08] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2050.codfw.wmnet [17:55:37] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org [17:56:25] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3077.esams.wmnet with OS trixie [17:59:56] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3078.esams.wmnet with OS trixie [18:00:05] andre and brennen: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1800). nyaa~ [18:00:13] nah. [18:00:39] (03PS1) 10BCornwall: Add sre.cdn.roll-restart-reboot-proxoid [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 [18:01:15] haha [18:01:40] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm [18:02:27] (03PS12) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) [18:02:27] (03PS1) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) [18:02:31] (03PS1) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) [18:03:40] (03CR) 10CI reject: [V:04-1] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:03:41] FIRING: JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:03:43] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 2 others: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11724749 (10jcrespo) [18:04:12] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 2 others: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11724753 (10jcrespo) [18:04:25] FIRING: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:04:28] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 2 others: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11724754 (10jcrespo) p:05Triage→03High [18:04:35] (03PS7) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) [18:05:00] RESOLVED: JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:07:53] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage [18:08:56] (03PS2) 10BCornwall: Add sre.cdn.roll-restart-reboot-proxoid [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 [18:09:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1258: Ready [18:09:25] FIRING: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:10:17] (03PS3) 10BCornwall: Add sre.cdn.roll-restart-reboot-tcp-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 [18:12:28] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org [18:12:53] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage [18:13:02] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3078.esams.wmnet [reason: trixie reimaging] [18:13:08] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3077.esams.wmnet [reason: trixie reimaging] [18:13:15] (03PS1) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) [18:13:50] (03PS2) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) [18:14:04] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:14:43] (03PS2) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) [18:15:24] (03PS3) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) [18:15:25] (03CR) 10CI reject: [V:04-1] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:15:30] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:16:07] (03PS2) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) [18:16:09] (03CR) 10CI reject: [V:04-1] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:16:11] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:16:12] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [reason: trixie reimaging] [18:16:18] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:16:29] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet [reason: trixie reimaging] [18:16:38] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5018.eqsin.wmnet [reason: trixie reimaging] [18:17:06] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5018.eqsin.wmnet with OS trixie [18:17:17] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [reason: trixie reimaging] [18:17:19] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5031.eqsin.wmnet with reason: host reimage [18:18:03] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS trixie [18:18:18] (03PS4) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) [18:18:26] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:18:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:18:52] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11724854 (10herron) [18:20:26] (03PS3) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) [18:20:34] (03PS5) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) [18:20:44] (03PS4) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) [18:21:16] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo) [18:21:34] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:21:44] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:23:41] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:24:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5031.eqsin.wmnet with reason: host reimage [18:27:17] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo) [18:27:28] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns4004.wikimedia.org [18:29:58] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2051.codfw.wmnet [18:32:56] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2052.codfw.wmnet [18:34:25] FIRING: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:35:00] FIRING: JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:36:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11724940 (10Jclark-ctr) @BTullis Performed firmware update on backplane seems to of cleare... [18:38:41] RESOLVED: JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:39:40] FIRING: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:42:36] (03PS4) 10BCornwall: Add sre.cdn.roll-restart-reboot-tcp-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 [18:43:25] (03CR) 10BCornwall: [V:03+1] "`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall) [18:44:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5032.eqsin.wmnet with OS trixie [18:45:24] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5032.* [18:45:25] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host install4004.wikimedia.org with OS bookworm [18:45:37] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11724984 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ayounsi@cumin1003 for host install4004.wikimedia.org with OS bookworm [18:46:18] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns4004.wikimedia.org [18:46:35] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5030.eqsin.wmnet with OS trixie [18:47:51] (03PS21) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) [18:47:57] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [18:48:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [18:49:49] (03CR) 10CI reject: [V:04-1] dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [18:50:11] PROBLEM - Host cloudrabbit2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [18:51:41] RECOVERY - Host cloudrabbit2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [18:54:05] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage [18:55:54] (03PS1) 10Ahmon Dancy: scap.cfg.erb: [eqiad1.wikimedia.cloud] remove php_parsoid from mw_web_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1255012 (https://phabricator.wikimedia.org/T420509) [18:56:01] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5031.eqsin.wmnet with OS trixie [18:56:04] FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [18:56:17] (03PS22) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) [18:56:23] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5031.* [18:57:09] PROBLEM - Host cloudrabbit2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [18:57:32] (03CR) 10Jforrester: [C:03+1] scap.cfg.erb: [eqiad1.wikimedia.cloud] remove php_parsoid from mw_web_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1255012 (https://phabricator.wikimedia.org/T420509) (owner: 10Ahmon Dancy) [18:59:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [18:59:41] RECOVERY - Host cloudrabbit2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [19:00:20] (03CR) 10BCornwall: [V:03+1] "Not sure if this is desired as there's already https://wikitech.wikimedia.org/wiki/Gerrit/tcp-proxy#Service_restarts_and_depooling" [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall) [19:01:04] RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins [19:01:18] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org [19:02:14] !log brett@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp5029.eqsin.wmnet [19:02:37] (03PS1) 10Jdlrobson: Guard for JS null deref on empty Parsoid sections [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255013 (https://phabricator.wikimedia.org/T419721) [19:06:03] brett@cumin2002 upgrade-firmware (PID 3816247) is awaiting input [19:06:11] PROBLEM - Host cloudrabbit2003-dev is DOWN: PING CRITICAL - Packet loss = 100% [19:06:34] (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: [eqiad1.wikimedia.cloud] remove php_parsoid from mw_web_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1255012 (https://phabricator.wikimedia.org/T420509) (owner: 10Ahmon Dancy) [19:06:55] (03CR) 10Ssingh: [C:03+1] "Happy to check the DNS hosts explicitly after this change, since they are more critical than the Wikidough ones." [homer/public] - 10https://gerrit.wikimedia.org/r/1254185 (https://phabricator.wikimedia.org/T420342) (owner: 10Ayounsi) [19:07:41] RECOVERY - Host cloudrabbit2003-dev is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [19:08:38] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on install4004.wikimedia.org with reason: host reimage [19:08:47] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5029.eqsin.wmnet with OS trixie [19:08:52] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2053.codfw.wmnet [19:08:53] !log brett@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp5029.eqsin.wmnet [19:09:25] FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:11:41] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2054.codfw.wmnet [19:13:43] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install4004.wikimedia.org with reason: host reimage [19:13:47] (03PS4) 10BCornwall: trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254 [19:13:57] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage [19:14:10] (03CR) 10BCornwall: trafficserver: Update single_backend site comments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1254254 (owner: 10BCornwall) [19:14:25] FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:15:19] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725109 (10Ladsgroup) I'm about to make this a lot less gradual. On the ground that we have thumb steps now plus I really don't want to spend all of 2026 (and even 2027) babysitting... [19:17:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage [19:17:50] (03CR) 10Xcollazo: [C:03+1] "LGTM!" [dumps] - 10https://gerrit.wikimedia.org/r/1251169 (https://phabricator.wikimedia.org/T401296) (owner: 10WMDE-leszek) [19:18:07] (03CR) 10Ssingh: [C:03+1] trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254 (owner: 10BCornwall) [19:18:27] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org [19:19:11] (03PS1) 10Ottomata: mw-page-edit-type-enrich-next - increase taskmanager replicas while we backfill [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255017 (https://phabricator.wikimedia.org/T351225) [19:20:49] (03CR) 10Ottomata: [C:03+2] mw-page-edit-type-enrich-next - increase taskmanager replicas while we backfill [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255017 (https://phabricator.wikimedia.org/T351225) (owner: 10Ottomata) [19:22:44] (03Merged) 10jenkins-bot: mw-page-edit-type-enrich-next - increase taskmanager replicas while we backfill [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255017 (https://phabricator.wikimedia.org/T351225) (owner: 10Ottomata) [19:23:43] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [19:23:48] (03CR) 10BCornwall: [C:03+2] trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254 (owner: 10BCornwall) [19:23:57] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [19:25:11] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725166 (10Ladsgroup) [19:26:06] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5018.eqsin.wmnet with OS trixie [19:26:11] FYI, I'm going to be testing something briefly in mw-debug (codfw) [19:27:24] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:27:42] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:28:04] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp5018.eqsin.wmnet [reason: trixie reimaging] [19:28:19] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5020.eqsin.wmnet [reason: trixie reimaging] [19:29:02] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS trixie [19:30:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install4004.wikimedia.org with OS bookworm [19:30:20] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11725172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ayounsi@cumin1003 for host install4004.wikimedia.org with OS bookworm complet... [19:33:27] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns5004.wikimedia.org [19:34:42] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11725174 (10RobH) > Comentário gerado em Smart Hands: Good afternoon, > > We carried out the replacement of the fiber optic patch cable. A 10‑meter patch cable ava... [19:35:04] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:35:17] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:35:24] all done [19:35:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:29] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725175 (10Ladsgroup) [19:39:25] FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:39:30] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5017.eqsin.wmnet with OS trixie [19:39:47] jouncebot: nowandnext [19:39:47] For the next 0 hour(s) and 20 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1800) [19:39:47] In 0 hour(s) and 20 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T2000) [19:40:09] (03Abandoned) 10Ebernhardson: semanticsearch: Increase heap by 1gb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249382 (https://phabricator.wikimedia.org/T414623) (owner: 10Ebernhardson) [19:41:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:42:42] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5029.eqsin.wmnet with reason: host reimage [19:46:04] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11725225 (10RobH) Errors returned, Arzhel redrained the link, update sent to ticket: > Support, > > Thank you for swapping out fiber 70152 with 260301, but it... [19:48:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5030.eqsin.wmnet with OS trixie [19:49:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1252684 (https://phabricator.wikimedia.org/T420142) (owner: 10Pppery) [19:49:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255013 (https://phabricator.wikimedia.org/T419721) (owner: 10Jdlrobson) [19:49:25] FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [19:49:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250095 (https://phabricator.wikimedia.org/T418066) (owner: 10Pppery) [19:49:39] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2055.codfw.wmnet [19:49:46] !log reedy@deploy2002 Synchronized private/PrivateSettings.php: Set $wgOATHSecretKey T404363 (duration: 05m 51s) [19:49:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) (owner: 10Pppery) [19:49:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11725240 (10VRiley-WMF) No problem. Let us know when this can be closed. Thank you @BTullis [19:49:52] T404363: Set OATHSecretKey value within Wikimedia production and migrate older 2fa data within oathauth_devices - https://phabricator.wikimedia.org/T404363 [19:49:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5029.eqsin.wmnet with reason: host reimage [19:50:10] !log running `mwscript extensions/OATHAuth/maintenance/UpdateSecretsToEncryptedFormat.php --wiki=metawiki` T404363 [19:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:21] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2056.codfw.wmnet [19:50:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:50:44] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns5004.wikimedia.org [19:51:30] !log running `foreachwikiindblist private.dblist extensions/OATHAuth/maintenance/UpdateSecretsToEncryptedFormat.php` T404363 [19:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:43] !log running `foreachwikiindblist fishbowl.dblist extensions/OATHAuth/maintenance/UpdateSecretsToEncryptedFormat.php` T404363 [19:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:52] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725245 (10Ladsgroup) [19:56:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T2000). [20:00:05] hector-arroyo, cscott, Kemayo, Pppery, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] here [20:00:16] o/ [20:01:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:04:08] Well, I'm going to go ahead and backport mine. Anyone else want theirs rolled in? [20:04:28] I am not free for next 30m but can help with mine and others in second half. https://gerrit.wikimedia.org/r/c/1255013/ is likely to be a deploy blocker if I don't backport it. [20:05:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) (owner: 10Kgraessle) [20:05:18] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-codfw [20:05:29] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5030.* [20:05:32] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS trixie [20:05:44] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org [20:05:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254965 (owner: 10DLynch) [20:05:59] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie [20:05:59] No takers, so I have gone ahead with just my patch. [20:07:40] (03Merged) 10jenkins-bot: Editcheck: fix tagging not happening for non-default checks [extensions/VisualEditor] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254965 (owner: 10DLynch) [20:08:14] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1254965|Editcheck: fix tagging not happening for non-default checks]] [20:08:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:09:23] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:09:34] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1033.eqiad.wmnet with OS trixie [20:09:40] FIRING: [10x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.5 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:10:23] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1254965|Editcheck: fix tagging not happening for non-default checks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:54] !log kemayo@deploy2002 kemayo: Continuing with sync [20:11:07] o/ [20:11:20] Kemayo: sorry i was slow. but going ahead was the right thing! [20:12:03] 🎉 [20:13:12] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725317 (10herron) [20:14:09] if hector-arroyo isn't here i'll go next i guess [20:14:24] I'm here [20:14:29] but go ahead [20:14:42] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254965|Editcheck: fix tagging not happening for non-default checks]] (duration: 06m 28s) [20:14:49] Mine's done, the floor is open. [20:15:20] ok, i'm jumping in; maybe hector-arroyo and Pppery can combine their config patches in the next slot [20:15:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254956 (https://phabricator.wikimedia.org/T376183) (owner: 10C. Scott Ananian) [20:15:25] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:18:31] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org [20:19:25] FIRING: [10x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.5 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:19:41] (03Merged) 10jenkins-bot: Limit legacy postprocessing cache to pages where DT does apply [extensions/DiscussionTools] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254956 (https://phabricator.wikimedia.org/T376183) (owner: 10C. Scott Ananian) [20:20:31] FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:20:39] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1033.eqiad.wmnet with reason: host reimage [20:21:18] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725330 (10herron) [20:21:36] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5029.eqsin.wmnet with OS trixie [20:22:25] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5029.* [20:22:58] cscott: let me know when you are done. I can do the remaining deploys (if their owners show!) [20:24:12] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS trixie [20:24:20] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1254956|Limit legacy postprocessing cache to pages where DT does apply (T376183)]] [20:24:24] T376183: Use postprocessing cache for Discussion Tools - https://phabricator.wikimedia.org/T376183 [20:25:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1033.eqiad.wmnet with reason: host reimage [20:25:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:20] !log cscott@deploy2002 cscott: Backport for [[gerrit:1254956|Limit legacy postprocessing cache to pages where DT does apply (T376183)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:28:29] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2057.codfw.wmnet [20:28:29] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_codfw and not P{cp2041.codfw.wmnet} and A:cp [20:28:45] !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2058.codfw.wmnet [20:28:45] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_codfw and not P{cp2042.codfw.wmnet} and A:cp [20:29:58] Gerrit seems down? [20:30:04] Same for me [20:30:23] It has been giving some 502s... The bot reported it too above [20:30:32] for me it is working, but was super slow just a few seconds ago [20:30:43] (Maybe it was the reboot of the CDN above)? [20:30:45] it was down for like 10-15 mins for me but now it's working again [20:31:06] but it kept giving me 502s since this morning [20:33:31] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns6002.wikimedia.org [20:33:58] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725380 (10herron) [20:34:21] !log cscott@deploy2002 cscott: Continuing with sync [20:35:15] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-out2001.wikimedia.org with reason: T419960 [20:35:31] RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:38] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:36:38] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [20:37:24] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:37:42] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage [20:38:13] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254956|Limit legacy postprocessing cache to pages where DT does apply (T376183)]] (duration: 13m 54s) [20:38:17] T376183: Use postprocessing cache for Discussion Tools - https://phabricator.wikimedia.org/T376183 [20:38:41] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:39:18] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725406 (10herron) [20:39:25] FIRING: [10x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:41:09] ok over to who ever is next [20:41:16] hector-arroyo? [20:42:08] ok [20:42:21] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:42:25] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:42:40] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [20:42:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1033.eqiad.wmnet with OS trixie [20:42:45] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-out1001.wikimedia.org with reason: T419960 [20:43:35] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns6002.wikimedia.org [20:43:47] newbie question: should I just click on "deploy change" on https://schedule-deployment.toolforge.org/window/1773864000? when I do so, I get an error ("access denied due to lack of permissions") [20:43:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11725412 (10Jclark-ctr) 05Open→03Resolved [20:44:25] FIRING: [10x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [20:44:29] hector-arroyo: i have a deploy blocker which is also MobileFrontend. Would it be okay to do them together? [20:44:38] (I can also do the deploys if that's helpful) [20:44:39] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage [20:45:03] sure [20:45:09] mine is just a config change [20:45:15] ok want me to deploy them? [20:45:21] yes, please [20:45:25] ok starting now [20:45:28] thx [20:45:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255013 (https://phabricator.wikimedia.org/T419721) (owner: 10Jdlrobson) [20:45:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [20:46:31] Hey [20:46:32] What's the problem with the CI now? [20:46:51] (03Merged) 10jenkins-bot: Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf) [20:48:41] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:48:47] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5028.eqsin.wmnet with OS trixie [20:49:09] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie [20:49:32] Neriah: CI issues are #wikimedia-releng territory [20:50:03] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-in2001.wikimedia.org with reason: T419960 [20:50:27] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5020.eqsin.wmnet with OS trixie [20:51:28] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-in1001.wikimedia.org with reason: T419960 [20:51:54] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage [20:52:38] (03PS1) 10Jforrester: Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255034 (https://phabricator.wikimedia.org/T418887) [20:52:45] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-codfw [20:56:29] (03PS1) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [20:56:50] hector-arroyo: almost ready for testing on debug [20:56:54] (03Merged) 10jenkins-bot: Guard for JS null deref on empty Parsoid sections [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255013 (https://phabricator.wikimedia.org/T419721) (owner: 10Jdlrobson) [20:57:02] (03CR) 10CI reject: [V:04-1] Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [20:57:30] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1255013|Guard for JS null deref on empty Parsoid sections (T419721)]], [[gerrit:1254889|Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] [20:57:36] T419721: Various client errors relating to MobileFrontend section collapsing - https://phabricator.wikimedia.org/T419721 [20:57:36] T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125 [20:58:01] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage [20:58:16] 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725460 (10herron) [20:58:35] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org [20:59:14] !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-eqiad [20:59:36] !log jdlrobson@deploy2002 jdlrobson, harroyo-wmf: Backport for [[gerrit:1255013|Guard for JS null deref on empty Parsoid sections (T419721)]], [[gerrit:1254889|Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:59:57] hector-arroyo: please test and give me green light to sync! [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T2100) [21:00:14] (03PS2) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) [21:00:33] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [21:00:59] (03PS1) 10Ssingh: P:dns::auth: update check for authdns_update_run [puppet] - 10https://gerrit.wikimedia.org/r/1255038 [21:02:14] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8298/co" [puppet] - 10https://gerrit.wikimedia.org/r/1255038 (owner: 10Ssingh) [21:02:14] (03PS1) 10Jforrester: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 [21:02:20] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:03:39] hector-arroyo: all good? We are overrunning our window now so I'd like to wrap it up. [21:03:59] If they are not around, it should be safe to sync [21:04:11] We're around. [21:04:13] I don't see my changes working in https://test.wikipedia.org/wiki/Test [21:04:16] The idea was to test the broken functionality this enables on testwiki to find where it was broken [21:04:16] And waiting to create the wiki. [21:04:25] FIRING: [10x] BFDdown: BFD session down between asw1-b3-magru and 195.200.68.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:04:36] should i continue to sync anyway and then you can debug further? [21:04:46] yes [21:04:56] !log jdlrobson@deploy2002 jdlrobson, harroyo-wmf: Continuing with sync [21:05:27] good luck hector-arroyo ! [21:05:29] (03PS7) 10Jforrester: Create Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247650 (https://phabricator.wikimedia.org/T411725) [21:07:20] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:07:26] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5028.eqsin.wmnet with OS trixie [21:07:46] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie [21:08:50] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255013|Guard for JS null deref on empty Parsoid sections (T419721)]], [[gerrit:1254889|Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] (duration: 11m 20s) [21:08:55] T419721: Various client errors relating to MobileFrontend section collapsing - https://phabricator.wikimedia.org/T419721 [21:08:55] T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125 [21:09:01] Jdlrobson: Do you have more to deploy or can I start? [21:09:25] FIRING: [10x] BFDdown: BFD session down between asw1-b3-magru and 195.200.68.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:11:58] I'm going to take that as a yes. [21:12:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255034 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester) [21:12:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS trixie [21:14:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255034 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester) [21:14:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247650 (https://phabricator.wikimedia.org/T411725) (owner: 10Jforrester) [21:15:15] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org [21:15:27] (03Merged) 10jenkins-bot: Create Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247650 (https://phabricator.wikimedia.org/T411725) (owner: 10Jforrester) [21:15:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:16:45] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5017.eqsin.wmnet with OS trixie [21:17:22] (03Merged) 10jenkins-bot: Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255034 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester) [21:17:54] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255034|Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" (T418887)]], [[gerrit:1247650|Create Abstract Wikipedia (T411725 T411726)]] [21:18:01] T418887: Collect and decide on whether and how to fix community-experienced changes with the v2 orchestrator - https://phabricator.wikimedia.org/T418887 [21:18:02] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [21:18:02] T411725: Set up Wikimedia production config to allow abstract.wikipedia.org to be a special wiki - https://phabricator.wikimedia.org/T411725 [21:18:02] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [21:18:02] T411726: Set up initial wiki settings for Abstract Wikipedia - https://phabricator.wikimedia.org/T411726 [21:19:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11725554 (10VRiley-WMF) Had to reset the iDrac, but it should be good to go. @Jgreen [21:20:08] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255034|Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" (T418887)]], [[gerrit:1247650|Create Abstract Wikipedia (T411725 T411726)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:20:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11725558 (10VRiley-WMF) a:05VRiley-WMF→03Jgreen [21:20:45] !log jforrester@deploy2002 jforrester: Continuing with sync [21:23:52] (03CR) 10BCornwall: [C:03+1] P:dns::auth: update check for authdns_update_run [puppet] - 10https://gerrit.wikimedia.org/r/1255038 (owner: 10Ssingh) [21:24:38] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255034|Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" (T418887)]], [[gerrit:1247650|Create Abstract Wikipedia (T411725 T411726)]] (duration: 06m 44s) [21:24:48] T418887: Collect and decide on whether and how to fix community-experienced changes with the v2 orchestrator - https://phabricator.wikimedia.org/T418887 [21:24:48] T411725: Set up Wikimedia production config to allow abstract.wikipedia.org to be a special wiki - https://phabricator.wikimedia.org/T411725 [21:24:49] T411726: Set up initial wiki settings for Abstract Wikipedia - https://phabricator.wikimedia.org/T411726 [21:25:37] (03PS2) 10Daniel Kinzler: rest-gateway: update readme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 [21:26:12] !log jforrester@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/maintenance/addWiki.php --wiki=abstractwiki # T411723 addWiki.php run [21:26:16] T411723: Set up abstract.wikipedia.org as a new wiki - https://phabricator.wikimedia.org/T411723 [21:27:02] !log jforrester@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/maintenance/addWiki.php --wiki=abstractwiki # T411723 addWiki.php run [21:28:44] Well that's unfortunate. [21:29:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5027.eqsin.wmnet with OS trixie [21:30:15] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org [21:30:27] (03CR) 10Hashar: [C:04-1] "This is quite arbitrary and it has some issues:" [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [21:30:40] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5028.eqsin.wmnet with OS trixie [21:31:04] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie [21:33:09] (03PS4) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1142675 [21:34:28] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:34:40] FIRING: [10x] BFDdown: BFD session down between asw1-b4-magru and 195.200.68.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:39:33] (03CR) 10Kamila Součková: [C:03+1] rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) (owner: 10Daniel Kinzler) [21:40:02] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [21:40:39] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway) [21:41:08] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11725650 (10RobH) The optic was swapped, but the errors resumed. Arzhel got me setup with an EdgeUno portal account so I can view the two circuits and opened case... [21:41:26] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11725652 (10RobH) a:05ayounsi→03RobH [21:41:27] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:41:27] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5027.* [21:44:04] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage [21:44:25] FIRING: [10x] BFDdown: BFD session down between asw1-b4-magru and 195.200.68.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:45:50] (03PS4) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) [21:49:06] !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org [21:49:07] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox [21:49:20] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11725676 (10TheDJ) I was testing File:High_quality_skull.stl locally via instantcommons. And i'm not sure why, but it seems my setup... [21:49:25] (03CR) 10Kamila Součková: shellbox: Setup shellbox-icu72 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [21:51:27] !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-eqiad [21:53:27] (03PS5) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) [21:56:01] (03CR) 10CI reject: [V:04-1] shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T2200) [22:03:21] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková) [22:04:54] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage [22:08:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage [22:16:38] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5020.eqsin.wmnet with OS trixie [22:25:50] PROBLEM - Host logging-hd2004 is DOWN: PING CRITICAL - Packet loss = 100% [22:27:10] RECOVERY - Host logging-hd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [22:40:03] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5028.eqsin.wmnet with OS trixie [22:40:15] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725876 (10Ladsgroup) [22:47:24] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1254331/8300/" [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [22:48:36] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11725889 (10Ladsgroup) STL is basically the only file handler left that is not following thumb steps yet (everything else from T41480... [23:01:57] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5028.* [23:02:03] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5020.* [23:04:50] (03PS23) 10Ryan Kemper: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [23:06:02] (03PS24) 10Ryan Kemper: dse-k8s: Auto-set OpenSearch pod readahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [23:08:12] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5017.* [23:08:20] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [23:15:57] (03CR) 10Jforrester: "This is blocked by T420531." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 (owner: 10Jforrester) [23:21:04] (03PS1) 10Sportzpikachu: Allow `ws://localhost:*` and `wss://localhost:*` in CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) [23:23:01] !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev [23:23:08] (03PS2) 10Sportzpikachu: Allow `ws://localhost:*` and `wss://localhost:*` in CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) [23:28:43] (03PS25) 10Ryan Kemper: dse-k8s: Auto-set OpenSearch pod readahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking) [23:35:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:42] (03CR) 10Dzahn: "Here is the reason why the compiler output can be so confusing (as in "why does it create timers on BOTH sides"?):" [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [23:35:47] (03CR) 10Dzahn: [C:03+2] releases: remove "unless" condition around rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn) [23:49:54] !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:cassandra-dev [23:52:25] (03CR) 10Scardenasmolinar: [C:03+1] Deploy Extension:PersonalDashboard to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) (owner: 10Kgraessle) [23:57:45] !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet [23:58:44] !log releases2003 - kill 782 (stunnel4) - systemctl start stunnel4 - fix T420246 T420388 T420411 [23:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:51] T420246: SystemdUnitFailed - rsync releases2003 - https://phabricator.wikimedia.org/T420246 [23:58:51] T420388: SystemdUnitFailed - https://phabricator.wikimedia.org/T420388 [23:58:52] T420411: PuppetFailure - https://phabricator.wikimedia.org/T420411