[00:05:16] <icinga-wm>	 PROBLEM - Host wikikube-worker1036 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 2193.41 ms
[00:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:05:34] <icinga-wm>	 RECOVERY - Host wikikube-worker1036 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[00:10:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[00:20:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:30:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:39:47] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1254392
[00:39:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1254392 (owner: 10TrainBranchBot)
[00:40:09] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11721374 (10Ladsgroup) >>! In T414805#11675775, @Ladsgroup wrote: >>>! In T414805#11668230, @Ladsgroup wrote: >> Top "file formats" f...
[00:53:57] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1254392 (owner: 10TrainBranchBot)
[01:00:47] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[01:09:42] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1254418
[01:09:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1254418 (owner: 10TrainBranchBot)
[01:10:16] <logmsgbot>	 !log denisse@deploy2002 Started deploy [librenms/librenms@d152b36]: Upgrade LibreNMS to 25.11.0
[01:10:24] <logmsgbot>	 !log denisse@deploy2002 Finished deploy [librenms/librenms@d152b36]: Upgrade LibreNMS to 25.11.0 (duration: 00m 08s)
[01:25:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1254418 (owner: 10TrainBranchBot)
[01:38:12] <logmsgbot>	 !log denisse@deploy2002 Started deploy [librenms/librenms@9bdfb73]: Upgrade LibreNMS to 26.3.1
[01:38:31] <logmsgbot>	 !log denisse@deploy2002 Finished deploy [librenms/librenms@9bdfb73]: Upgrade LibreNMS to 26.3.1 (duration: 00m 19s)
[01:50:34] <wikibugs>	 (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254445 (https://phabricator.wikimedia.org/T420424)
[01:52:54] <wikibugs>	 (03PS1) 10DDesouza: miscweb(design-blog): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254446 (https://phabricator.wikimedia.org/T344471)
[01:55:09] <wikibugs>	 (03PS1) 10DDesouza: Undeploy participant recruitment survey on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275)
[01:55:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Undeploy participant recruitment survey on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275) (owner: 10DDesouza)
[01:55:56] <wikibugs>	 (03PS1) 10DDesouza: Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275)
[01:56:01] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(design-blog): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254446 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[01:56:05] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254445 (https://phabricator.wikimedia.org/T420424) (owner: 10DDesouza)
[01:57:38] <wikibugs>	 (03PS1) 10DDesouza: Undeploy participant recruitment survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254452 (https://phabricator.wikimedia.org/T419778)
[01:58:25] <wikibugs>	 (03PS2) 10DDesouza: Undeploy participant recruitment survey on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254450 (https://phabricator.wikimedia.org/T419275)
[01:58:34] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(design-blog): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254446 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[01:58:36] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254445 (https://phabricator.wikimedia.org/T420424) (owner: 10DDesouza)
[02:00:48] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[02:03:36] <wikibugs>	 (03PS2) 10DDesouza: Undeploy participant recruitment survey on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254448 (https://phabricator.wikimedia.org/T419275)
[02:04:38] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[02:05:01] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[02:05:02] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[02:05:31] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[02:05:33] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[02:06:07] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[02:07:01] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[02:07:14] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[02:07:15] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[02:07:30] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[02:07:32] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[02:07:52] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[02:08:35] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 07m 47s)
[02:08:40] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:33:40] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:46:04] <wikibugs>	 (03PS3) 10RLazarus: function-{evaluator,orchestrator}: set AppArmor profile in pod SecurityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880)
[02:53:07] <wikibugs>	 (03PS1) 10Krinkle: labs: Remove redundant wgSkipSkins override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254463
[03:01:09] <wikibugs>	 (03PS1) 10MusikAnimal: CM5: add more aggressive warnings about CM5 deprecation [extensions/CodeMirror] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254468 (https://phabricator.wikimedia.org/T373720)
[03:06:54] <wikibugs>	 (03CR) 10Bhsd: [C:03+1] CM5: add more aggressive warnings about CM5 deprecation [extensions/CodeMirror] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254468 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal)
[03:07:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeMirror] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254468 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal)
[03:08:57] <wikibugs>	 (03Merged) 10jenkins-bot: CM5: add more aggressive warnings about CM5 deprecation [extensions/CodeMirror] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254468 (https://phabricator.wikimedia.org/T373720) (owner: 10MusikAnimal)
[03:09:46] <logmsgbot>	 !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1254468|CM5: add more aggressive warnings about CM5 deprecation (T373720)]]
[03:09:50] <stashbot>	 T373720: Deprecate use of CodeMirror 5 - https://phabricator.wikimedia.org/T373720
[03:11:49] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1254468|CM5: add more aggressive warnings about CM5 deprecation (T373720)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[03:18:12] <logmsgbot>	 !log musikanimal@deploy2002 musikanimal: Continuing with sync
[03:22:09] <logmsgbot>	 !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254468|CM5: add more aggressive warnings about CM5 deprecation (T373720)]] (duration: 12m 22s)
[03:22:12] <stashbot>	 T373720: Deprecate use of CodeMirror 5 - https://phabricator.wikimedia.org/T373720
[04:05:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:10:12] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[04:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:30:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:00:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1015:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:00:47] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T0600)
[06:37:40] <kart_>	 Deploying MinT/machinetranslation. Let's see how it goes!
[06:38:20] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[06:54:41] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[06:59:30] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: add a ttl on ProxyPass to jetty [puppet] - 10https://gerrit.wikimedia.org/r/1254128 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:04:18] <wikibugs>	 (03PS1) 10Arnaudb: trafficserver: Enable connection re-use for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1254746 (https://phabricator.wikimedia.org/T420189)
[07:04:57] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:05:37] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] trafficserver: Enable connection re-use for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1254746 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[07:06:13] <wikibugs>	 (03PS1) 10KartikMistry: machinetranslation: reduce replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254751 (https://phabricator.wikimedia.org/T411058)
[07:06:13] <wikibugs>	 (03PS1) 10Arnaudb: Revert "trafficserver: Enable connection re-use for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1254750
[07:06:30] <wikibugs>	 (03CR) 10Arnaudb: [V:03+2] Revert "trafficserver: Enable connection re-use for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1254750 (owner: 10Arnaudb)
[07:07:54] <wikibugs>	 (03CR) 10Arnaudb: [V:03+2 C:03+2] Revert "trafficserver: Enable connection re-use for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1254750 (owner: 10Arnaudb)
[07:10:33] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] machinetranslation: reduce replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254751 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry)
[07:10:37] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[07:12:40] <wikibugs>	 (03Merged) 10jenkins-bot: machinetranslation: reduce replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254751 (https://phabricator.wikimedia.org/T411058) (owner: 10KartikMistry)
[07:16:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:16:36] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "cool, thanks. Not strictly needed but best for the sake of completeness." [homer/public] - 10https://gerrit.wikimedia.org/r/1254293 (https://phabricator.wikimedia.org/T420361) (owner: 10Ssingh)
[07:16:50] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[07:17:13] <kart_>	 Another attempt ^
[07:18:43] <wikibugs>	 (03Merged) 10jenkins-bot: definitions/static.net: add IPv6 addresses for nameservers [homer/public] - 10https://gerrit.wikimedia.org/r/1254293 (https://phabricator.wikimedia.org/T420361) (owner: 10Ssingh)
[07:21:48] <wikibugs>	 (03PS9) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130)
[07:21:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[07:22:00] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: per-route jwt overrides (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[07:22:22] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[07:25:45] <wikibugs>	 (03PS14) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969)
[07:26:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:27:23] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11721743 (10ayounsi)
[07:29:31] <wikibugs>	 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 4 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11721747 (10KartikMistry) @RLazarus After reducing `replicas`, I was able to deploy MinT in codfw. How to delete failing older pods...
[07:29:37] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1254237 (https://phabricator.wikimedia.org/T419878) (owner: 10Ayounsi)
[07:30:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add benbuchenau to analytics-wmde-users [puppet] - 10https://gerrit.wikimedia.org/r/1254237 (https://phabricator.wikimedia.org/T419878) (owner: 10Ayounsi)
[07:31:49] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users for Ben.buchenau - https://phabricator.wikimedia.org/T419878#11721754 (10ayounsi) 05Open→03Resolved Change merged, should be live in ~30min. Please re-open if any issue.
[07:35:37] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 32934
[07:41:32] <logmsgbot>	 ayounsi@cumin1003 peering (PID 4043258) is awaiting input
[07:42:23] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop analytics cluster
[07:45:30] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934
[07:45:48] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11721774 (10JMeybohm) >>! In T419967#11720994, @Ajuanca wrote: > What's task `T419960` about? I don't enough privilegies to access it. Yes, I think a parameter with explici...
[07:49:59] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254338 (https://phabricator.wikimedia.org/T367880) (owner: 10RLazarus)
[07:52:05] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11721782 (10Wellverywell) Oh, ok, thank you! So is this a bug in file descriptions on Commons? (Well, for another image, 480px produces an actual 480px image -- so what is the bug in...
[07:58:40] <wikibugs>	 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11721786 (10JMeybohm) >>! In T411058#11721747, @KartikMistry wrote: > @RLazarus After reducing `replicas`, I was able to deploy MinT...
[08:00:05] <jouncebot>	 andre and brennen: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T0800)
[08:02:26] <icinga-wm>	 PROBLEM - Host cloudgw1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:02:43] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[08:03:36] <godog>	 the cloud host down alerts are expected, part of T417393
[08:04:24] <icinga-wm>	 PROBLEM - Host wikikube-worker1157 is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:48] <icinga-wm>	 RECOVERY - Host wikikube-worker1157 is UP: PING WARNING - Packet loss = 0%, RTA = 629.26 ms
[08:05:31] <wikibugs_>	 (03PS10) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130)
[08:05:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:07:54] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[08:08:55] <wikibugs>	 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11721800 (10KartikMistry) >>! In T411058#11721786, @JMeybohm wrote: >>>! In T411058#11721747, @KartikMistry wrote: >> @RLazarus Afte...
[08:11:33] <kart_>	 !log codfw/eqiad: Deployed MinT (T411058)
[08:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:14:52] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:15:00] <icinga-wm>	 PROBLEM - Host cloudlb1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:15:51] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254820 (https://phabricator.wikimedia.org/T413811)
[08:15:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by aklapper@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254820 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[08:16:39] <jinxer-wm>	 FIRING: CoreBGPDown: ...
[08:16:39] <jinxer-wm>	 Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (2a02:ec80:a000:201::2) - group cloud_host6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-c8-eqiad:9804&var-bgp_group=cloud_host6&var-bgp_neighbor=cloudlb1001 - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:17:02] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254820 (https://phabricator.wikimedia.org/T413811) (owner: 10TrainBranchBot)
[08:21:12] <hashar>	 andre: you are running the train arent you? :)
[08:21:26] <andre>	 hashar: yes, sorry, should have communicated
[08:21:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:21:39] <hashar>	 no no it was to be expected, I am merely triple checking!
[08:21:54] <hashar>	 I'll restart the CI Jenkins for a plugin update once you are down and everything is stable
[08:22:04] <hashar>	 :)
[08:22:22] <andre>	 yay
[08:22:54] <logmsgbot>	 !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.20  refs T413811
[08:22:59] <stashbot>	 T413811: 1.46.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T413811
[08:27:44] <andre>	 hashar: looks stable enough to me, go ahead
[08:28:40] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:28:58] <hashar>	 andre: merci!
[08:29:05] <andre>	 dr
[08:29:29] <hashar>	 !log Restarting CI Jenkins for plugin upgrade # T420347
[08:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:33] <stashbot>	 T420347: Quibble -c commands cause Jenkins Collapsible Section plugin to erase console output (Wikibase job in Jenkins do not include the full log) - https://phabricator.wikimedia.org/T420347
[08:31:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:33:40] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:38:40] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:41:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (80.249.209.211) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit4&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[08:43:40] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:44:31] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8297/co" [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh)
[08:45:03] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11721853 (10WMDE-leszek) Hello, I approve this request on WMDE's end. Thank you!
[08:45:10] <wikibugs>	 (03PS3) 10Slyngshede: service.yaml: update IPs for ulsfo-lb (text/upload/gerrit/ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh)
[08:46:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[08:46:42] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-eqiad
[08:46:51] <wikibugs>	 (03PS1) 10Slyngshede: WMCS cloudgw: update IPs for ulsfo-lb (text/upload) [puppet] - 10https://gerrit.wikimedia.org/r/1254830 (https://phabricator.wikimedia.org/T418971)
[08:47:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] service.yaml: update IPs for ulsfo-lb (text/upload/gerrit/ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh)
[08:47:56] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] geo-resources: update IP addresses for ulsfo services [dns] - 10https://gerrit.wikimedia.org/r/1253503 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh)
[08:47:58] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2070.codfw.wmnet
[08:47:58] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1071.eqiad.wmnet
[08:50:07] <jinxer-wm>	 FIRING: ProbeDown: Service ml-cache1001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#ml-cache1001-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:50:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1023.eqiad.wmnet
[08:51:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet
[08:52:44] <moritzm>	 aux-k8s-etcd1003, dse-k8s-etcd1001, kubestagemaster1005 will go down for a Ganeti reboot
[08:53:30] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:53:54] <icinga-wm>	 PROBLEM - Host kubestagemaster1005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:54:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-esams and 2001:7f8:1::a500:3320:1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[08:54:12] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:55:07] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:55:26] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms
[08:55:36] <icinga-wm>	 RECOVERY - Host kubestagemaster1005 is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms
[08:55:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2070.codfw.wmnet
[08:55:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2071.codfw.wmnet
[08:56:06] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1071.eqiad.wmnet
[08:56:09] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1072.eqiad.wmnet
[08:56:18] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[08:56:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[08:57:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet
[08:58:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet
[08:58:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1023.eqiad.wmnet
[08:59:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-esams and 2001:7f8:1::a500:3320:1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:00:10] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:00:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1024.eqiad.wmnet
[09:00:47] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:00:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:01:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet
[09:02:07] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool ulsfo [reason: no reason specified, T418971]
[09:02:11] <stashbot>	 T418971: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971
[09:02:15] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo [reason: no reason specified, T418971]
[09:02:47] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2071.codfw.wmnet
[09:02:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet
[09:03:26] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1072.eqiad.wmnet
[09:03:29] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1073.eqiad.wmnet
[09:04:00] <wikibugs>	 06SRE, 10SRE-swift-storage: upload.wikimedia.org serves .ogg audio files with content-type `application/ogg` instead of `audio/ogg`. - https://phabricator.wikimedia.org/T420422#11721895 (10Arendpieter)
[09:04:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-esams and 2001:7f8:1::a500:3320:1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:05:10] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:06:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1024.eqiad.wmnet
[09:07:16] <wikibugs>	 (03PS2) 10Ayounsi: ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971)
[09:08:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[09:08:25] <wikibugs>	 (03CR) 10Ayounsi: ulsfo: add new LVS service IP range (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[09:08:33] <logmsgbot>	 !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 23 hosts with reason: Update ULSFO LVS service IPs
[09:08:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet
[09:08:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2073.codfw.wmnet
[09:08:50] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11721899 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b34081b8-989e-49aa-91c7-56b4548775e2) set by slyngshede@cumin1003 for 4:00:...
[09:09:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Matches other sites and what I had in https://phabricator.wikimedia.org/T408892#11330727 so LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[09:10:10] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:10:22] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[09:10:54] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1073.eqiad.wmnet
[09:10:58] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1074.eqiad.wmnet
[09:11:45] <wikibugs>	 (03Merged) 10jenkins-bot: ulsfo: add new LVS service IP range [homer/public] - 10https://gerrit.wikimedia.org/r/1247994 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[09:12:37] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest-gateway: update readme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848
[09:12:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1024.eqiad.wmnet
[09:12:45] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kubestagemaster: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1242289 (https://phabricator.wikimedia.org/T352956) (owner: 10JMeybohm)
[09:12:45] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-eqiad
[09:12:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1024.eqiad.wmnet
[09:13:11] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-codfw
[09:14:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet
[09:14:24] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-staging-master-codfw@codfw
[09:15:10] <jinxer-wm>	 RESOLVED: [5x] ProbeDown: Service ganeti1024:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:15:38] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2073.codfw.wmnet
[09:15:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2074.codfw.wmnet
[09:15:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet
[09:16:41] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] service.yaml: update IPs for ulsfo-lb (text/upload/gerrit/ncredir) [puppet] - 10https://gerrit.wikimedia.org/r/1253506 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh)
[09:17:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet
[09:17:16] <wikibugs>	 06SRE, 10SRE-swift-storage: upload.wikimedia.org serves .ogg audio files with content-type `application/ogg` instead of `audio/ogg`. - https://phabricator.wikimedia.org/T420422#11721907 (10Arendpieter) The response appears to be coming from a Swift-backed object where the original object metadata is preserved...
[09:17:18] <wikibugs>	 (03PS1) 10Mszwarc: Enable autodemotion for 2FA-less CN admins and WMF T&S [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254851 (https://phabricator.wikimedia.org/T418580)
[09:18:45] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1074.eqiad.wmnet
[09:18:49] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1075.eqiad.wmnet
[09:19:08] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:21:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet
[09:22:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[09:22:57] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:23:10] <vgutierrez>	 !ack
[09:23:10] <sirenbot>	 Could not ack the alert. Please check the parameters.
[09:23:14] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2074.codfw.wmnet
[09:23:18] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2075.codfw.wmnet
[09:23:46] <effie>	 !incidents 
[09:23:46] <sirenbot>	 7770 (UNACKED)  [2x] ProbeDown sre (198.35.26.112 ip4 probes/service ulsfo)
[09:23:46] <sirenbot>	 7768 (RESOLVED)  NELHigh sre (thanos-rule@main tcp.timed_out)
[09:23:46] <sirenbot>	 7767 (RESOLVED)  [2x] ProbeDown sre (dse-k8s-ctrl2001:6443 probes/custom codfw)
[09:23:55] <slyngs>	 !ack 7770
[09:23:55] <sirenbot>	 7770 (ACKED)  [2x] ProbeDown sre (198.35.26.112 ip4 probes/service ulsfo)
[09:24:04] <vgutierrez>	 hmm !ack all has been changed?
[09:24:20] <effie>	 vgutierrez: anything we can help?
[09:24:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet
[09:24:28] <vgutierrez>	 nope, it's related to the ulsfo maintenance
[09:24:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet
[09:24:32] <vgutierrez>	 all good and expected
[09:24:43] <effie>	 grand
[09:24:57] <wikibugs>	 06SRE: upload.wikimedia.org serves .ogg audio files with content-type `application/ogg` instead of `audio/ogg`. - https://phabricator.wikimedia.org/T420422#11721927 (10Arendpieter)
[09:26:20] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:26:20] <icinga-wm>	 RECOVERY - Host cloudgw1003 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[09:26:20] <icinga-wm>	 RECOVERY - Host cloudlb1001 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms
[09:26:26] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1075.eqiad.wmnet
[09:26:29] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1076.eqiad.wmnet
[09:27:57] <logmsgbot>	 jayme@cumin1003 migrate-service-ipip (PID 4054043) is awaiting input
[09:27:57] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:30:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2075.codfw.wmnet
[09:30:50] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2076.codfw.wmnet
[09:30:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:31:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cloudsw1-c8-eqiad and cloudlb1001 (172.20.1.2) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:32:45] <jinxer-wm>	 RESOLVED: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[09:34:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[09:35:30] <jinxer-wm>	 FIRING: LibericaStaleConfig: Liberica instance lvs4010 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig - https://grafana.wikimedia.org/d/fa4de97a-7114-48c7-a91a-f56089ef554f/liberica?orgId=1&viewPanel=10&var-site=ulsfo&var-instance=lvs4010 - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[09:35:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet
[09:35:46] <vgutierrez>	 liberica alert is ulsfo maintenance, all good
[09:35:53] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:35:53] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: wikikube-staging-master-codfw@codfw
[09:36:19] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1076.eqiad.wmnet
[09:36:22] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1077.eqiad.wmnet
[09:37:05] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.loadbalancer.migrate-service-ipip for alias: wikikube-staging-master-eqiad@eqiad
[09:37:35] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-codfw
[09:37:59] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2076.codfw.wmnet
[09:38:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2077.codfw.wmnet
[09:39:44] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3676199) is awaiting input
[09:39:52] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[09:40:30] <jinxer-wm>	 FIRING: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[09:40:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet
[09:40:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2003.codfw.wmnet
[09:40:35] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.admin config_reloading A:lvs-secondary-ulsfo and A:liberica (T418971)
[09:40:41] <stashbot>	 T418971: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971
[09:40:43] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading A:lvs-secondary-ulsfo and A:liberica (T418971)
[09:40:50] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[09:40:50] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for alias: wikikube-staging-master-eqiad@eqiad
[09:42:15] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1001.eqiad.wmnet
[09:43:38] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1077.eqiad.wmnet
[09:43:42] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1078.eqiad.wmnet
[09:44:13] <jayme>	 !log switched wikikube staging apiservers to IPIP and maglev in eqiad and codfw - T352956
[09:44:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:16] <stashbot>	 T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956
[09:44:33] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.upgrade restart A:lvs-secondary-ulsfo and A:liberica
[09:44:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2003.codfw.wmnet
[09:44:37] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11721959 (10Aklapper) @Sarmbruster: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/settings/panel/external/), so your 'LDAP User' ac...
[09:44:39] <jinxer-wm>	 RESOLVED: TransitBGPDown: Transit BGP session down between cr1-esams and Deutsche Telekom (2001:7f8:1::a500:3320:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr1-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Deutsche+Telekom - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[09:44:49] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.admin depooling P{lvs4010.ulsfo.wmnet} and A:liberica
[09:45:01] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4010.ulsfo.wmnet} and A:liberica
[09:45:10] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.loadbalancer.admin pooling P{lvs4010.ulsfo.wmnet} and A:liberica
[09:45:16] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2077.codfw.wmnet
[09:45:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2078.codfw.wmnet
[09:45:30] <jinxer-wm>	 FIRING: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[09:45:32] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P{lvs4010.ulsfo.wmnet} and A:liberica
[09:45:35] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restart A:lvs-secondary-ulsfo and A:liberica
[09:45:42] <moritzm>	 !log installing postgresql-15 security updates
[09:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:06] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1001.eqiad.wmnet
[09:46:16] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1002.eqiad.wmnet
[09:46:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet
[09:46:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet
[09:46:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet
[09:46:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb1003.eqiad.wmnet
[09:48:03] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254856 (https://phabricator.wikimedia.org/T418350)
[09:48:37] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1002.eqiad.wmnet
[09:48:43] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1003.eqiad.wmnet
[09:49:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet
[09:50:27] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254856 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:50:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb1003.eqiad.wmnet
[09:51:02] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1078.eqiad.wmnet
[09:51:03] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1003.eqiad.wmnet
[09:51:05] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1079.eqiad.wmnet
[09:51:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2078.codfw.wmnet
[09:51:43] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2079.codfw.wmnet
[09:52:11] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254856 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:52:14] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet
[09:54:18] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that removes fuse_rope_kvcache config to solve P89877 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254856 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[09:55:30] <jinxer-wm>	 RESOLVED: [3x] LibericaStaleConfig: Liberica instance lvs4008 is running a stale configuration - https://wikitech.wikimedia.org/wiki/Liberica#LibericaStaleConfig  - https://alerts.wikimedia.org/?q=alertname%3DLibericaStaleConfig
[09:55:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet
[09:56:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet
[09:56:47] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:57:05] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet
[09:57:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno)
[09:58:18] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet
[09:59:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2079.codfw.wmnet
[09:59:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2080.codfw.wmnet
[09:59:53] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1079.eqiad.wmnet
[09:59:57] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1080.eqiad.wmnet
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1000)
[10:01:32] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.hosts.remove-downtime for 23 hosts
[10:01:40] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet
[10:01:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1028.eqiad.wmnet
[10:01:45] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 23 hosts
[10:01:53] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet
[10:03:41] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: no reason specified, no task ID specified]
[10:03:48] <logmsgbot>	 !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: pool ulsfo [reason: no reason specified, no task ID specified]
[10:04:05] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: no reason specified, T418971]
[10:04:09] <stashbot>	 T418971: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971
[10:04:10] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: no reason specified, T418971]
[10:04:58] <icinga-wm>	 RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms
[10:05:18] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet
[10:05:30] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet
[10:05:32] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:06:21] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1080.eqiad.wmnet
[10:06:25] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1081.eqiad.wmnet
[10:06:30] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3681874) is awaiting input
[10:06:33] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254863 (https://phabricator.wikimedia.org/T360794)
[10:06:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1028.eqiad.wmnet
[10:07:24] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2080.codfw.wmnet
[10:07:41] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11722088 (10SLyngshede-WMF) 05Open→03Resolved @Papaul Done :-)
[10:09:10] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] geo-resources: update IP addresses for ulsfo services [dns] - 10https://gerrit.wikimedia.org/r/1253503 (https://phabricator.wikimedia.org/T418971) (owner: 10Ssingh)
[10:09:38] <logmsgbot>	 !log vgutierrez@dns1004 START - running authdns-update
[10:10:30] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet
[10:10:38] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet
[10:11:12] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] WMCS cloudgw: update IPs for ulsfo-lb (text/upload) [puppet] - 10https://gerrit.wikimedia.org/r/1254830 (https://phabricator.wikimedia.org/T418971) (owner: 10Slyngshede)
[10:11:22] <logmsgbot>	 !log vgutierrez@dns1004 END - running authdns-update
[10:11:41] <wikibugs>	 (03CR) 10DCausse: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254863 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton)
[10:12:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1028.eqiad.wmnet
[10:13:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1028.eqiad.wmnet
[10:13:53] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1081.eqiad.wmnet
[10:14:04] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet
[10:14:10] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet
[10:15:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1029.eqiad.wmnet
[10:16:25] <wikibugs>	 (03PS1) 10Ayounsi: ulsfo: remove old LVS service IPs and range [homer/public] - 10https://gerrit.wikimedia.org/r/1254864 (https://phabricator.wikimedia.org/T418971)
[10:16:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1029.eqiad.wmnet
[10:17:36] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet
[10:17:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2081.codfw.wmnet
[10:17:57] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool ulsfo [reason: no reason specified, no task ID specified]
[10:17:59] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo [reason: no reason specified, no task ID specified]
[10:18:09] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1082.eqiad.wmnet
[10:19:04] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-staging2003.codfw.wmnet
[10:22:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1029.eqiad.wmnet
[10:23:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1029.eqiad.wmnet
[10:23:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1030.eqiad.wmnet
[10:23:50] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2003.codfw.wmnet
[10:24:19] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster
[10:25:04] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2081.codfw.wmnet
[10:25:08] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2082.codfw.wmnet
[10:25:24] <wikibugs>	 (03PS1) 10Kgraessle: Deploy Extension:PersonalDashboard to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367)
[10:25:37] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1082.eqiad.wmnet
[10:25:40] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1083.eqiad.wmnet
[10:25:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1030.eqiad.wmnet
[10:26:02] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-staging2002.codfw.wmnet
[10:29:58] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254863 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton)
[10:30:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove ncredir4001/4002 [puppet] - 10https://gerrit.wikimedia.org/r/1253538 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[10:30:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] ulsfo: remove old LVS service IPs and range [homer/public] - 10https://gerrit.wikimedia.org/r/1254864 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[10:31:10] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove support for old Elastic releases [puppet] - 10https://gerrit.wikimedia.org/r/1247917 (https://phabricator.wikimedia.org/T388607)
[10:31:19] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2002.codfw.wmnet
[10:31:49] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler)
[10:32:04] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet
[10:32:06] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254863 (https://phabricator.wikimedia.org/T360794) (owner: 10JavierMonton)
[10:32:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1030.eqiad.wmnet
[10:32:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1030.eqiad.wmnet
[10:32:31] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2082.codfw.wmnet
[10:32:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1031.eqiad.wmnet
[10:32:35] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2083.codfw.wmnet
[10:32:36] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1083.eqiad.wmnet
[10:32:39] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1084.eqiad.wmnet
[10:33:11] <wikibugs>	 (03CR) 10Blake: "Sounds good, I'll make a note to deploy this on Monday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251045 (https://phabricator.wikimedia.org/T413974) (owner: 10Blake)
[10:34:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1031.eqiad.wmnet
[10:34:46] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:34:54] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:37:29] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet
[10:39:42] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1084.eqiad.wmnet
[10:39:45] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1085.eqiad.wmnet
[10:39:47] <logmsgbot>	 !log fabfur@cumin1003 START - Cookbook sre.dns.netbox
[10:40:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1031.eqiad.wmnet
[10:40:48] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "typo inline, otherwise LGTM, though I didn't nitpick the tests code" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[10:40:49] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2083.codfw.wmnet
[10:40:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1031.eqiad.wmnet
[10:40:53] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2084.codfw.wmnet
[10:40:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[10:43:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet
[10:44:57] <wikibugs>	 (03PS1) 10Muehlenhoff: proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254867
[10:45:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Install systemd-timesyncd universally [puppet] - 10https://gerrit.wikimedia.org/r/1243756 (owner: 10Muehlenhoff)
[10:45:54] <wikibugs>	 (03PS1) 10Ayounsi: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406)
[10:46:31] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1085.eqiad.wmnet
[10:46:35] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1086.eqiad.wmnet
[10:47:13] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11722329 (10ayounsi) @bvibber you can read and sign the L3 at the end of https://phabricator.wikimedia.org/L3 I don't see your email in the signat...
[10:47:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users level 3 for bvibber - https://phabricator.wikimedia.org/T420406#11722330 (10ayounsi)
[10:48:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove profile to build Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1243828 (owner: 10Muehlenhoff)
[10:49:04] <logmsgbot>	 fabfur@cumin1003 netbox (PID 4067230) is awaiting input
[10:49:49] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2084.codfw.wmnet
[10:49:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet
[10:49:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2085.codfw.wmnet
[10:49:59] <wikibugs>	 (03Abandoned) 10Jgiannelos: Remove duplicate definition of site.v1.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1234927 (https://phabricator.wikimedia.org/T415877) (owner: 10Jgiannelos)
[10:50:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[10:50:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet
[10:50:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet
[10:53:17] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1086.eqiad.wmnet
[10:53:20] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1087.eqiad.wmnet
[10:53:42] <wikibugs>	 (03PS1) 10Vgutierrez: Remove deprecated 198.35.26.240/28 include [dns] - 10https://gerrit.wikimedia.org/r/1254869
[10:54:36] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[10:54:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove deprecated 198.35.26.240/28 include [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez)
[10:55:10] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Remove deprecated 198.35.26.240/28 include [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez)
[10:56:23] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[10:56:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2085.codfw.wmnet
[10:56:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2086.codfw.wmnet
[10:56:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet
[10:57:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1033.eqiad.wmnet
[10:57:10] <logmsgbot>	 !log fabfur@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[10:57:21] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1087.eqiad.wmnet
[10:57:25] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet
[10:58:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet
[10:58:32] <wikibugs>	 (03PS2) 10Vgutierrez: Refresh 198.35.26.0 includes [dns] - 10https://gerrit.wikimedia.org/r/1254869
[10:59:16] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-jumbo-eqiad
[10:59:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Refresh 198.35.26.0 includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez)
[10:59:48] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Refresh 198.35.26.0 includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez)
[11:00:04] <wikibugs>	 (03PS9) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032)
[11:00:05] <jouncebot>	 mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1100).
[11:00:06] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.dns.netbox
[11:02:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet
[11:03:03] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:03:17] <wikibugs>	 (03CR) 10Vgutierrez: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez)
[11:03:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron)
[11:03:43] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2086.codfw.wmnet
[11:03:47] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2087.codfw.wmnet
[11:04:17] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1088.eqiad.wmnet
[11:04:18] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:04:21] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1089.eqiad.wmnet
[11:05:13] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:05:17] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster
[11:05:49] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1026.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:05:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11722408 (10MoritzMuehlenhoff)
[11:06:44] <wikibugs>	 (03PS3) 10Muehlenhoff: thumbor-plugins: Stop using pkg_resources [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1243135
[11:06:55] <wikibugs>	 (03PS3) 10Vgutierrez: Refresh ulsfo includes [dns] - 10https://gerrit.wikimedia.org/r/1254869
[11:07:26] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm
[11:08:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet
[11:08:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet
[11:10:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2087.codfw.wmnet
[11:10:33] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet
[11:10:51] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Refresh ulsfo includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez)
[11:11:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] Refresh ulsfo includes [dns] - 10https://gerrit.wikimedia.org/r/1254869 (owner: 10Vgutierrez)
[11:11:17] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1089.eqiad.wmnet
[11:11:21] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1090.eqiad.wmnet
[11:11:30] <logmsgbot>	 !log vgutierrez@dns1004 START - running authdns-update
[11:12:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet
[11:13:03] <logmsgbot>	 !log vgutierrez@dns1004 END - running authdns-update
[11:14:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet
[11:15:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet
[11:16:50] <wikibugs>	 (03CR) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake)
[11:17:25] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2088.codfw.wmnet
[11:17:30] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2089.codfw.wmnet
[11:18:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet
[11:18:24] <logmsgbot>	 !log vgutierrez@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: no reason specified, no task ID specified]
[11:18:28] <logmsgbot>	 !log vgutierrez@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: no reason specified, no task ID specified]
[11:18:33] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1090.eqiad.wmnet
[11:18:36] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1091.eqiad.wmnet
[11:20:02] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1015
[11:20:16] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host dse-k8s-worker1015
[11:22:38] <wikibugs>	 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11722430 (10Clement_Goubert) >>! In T411058#11721747, @KartikMistry wrote: > @RLazarus After reducing `replicas`, I was able to depl...
[11:22:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet
[11:22:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet
[11:23:40] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1091.eqiad.wmnet
[11:23:41] <wikibugs>	 (03PS15) 10Daniel Kinzler: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969)
[11:23:44] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1092.eqiad.wmnet
[11:23:53] <wikibugs>	 (03PS11) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130)
[11:24:11] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2089.codfw.wmnet
[11:24:15] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2090.codfw.wmnet
[11:25:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet
[11:26:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet
[11:27:16] <Amir1>	 jouncebot: nowandnext
[11:27:16] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1100)
[11:27:16] <jouncebot>	 In 1 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1300)
[11:27:38] <wikibugs>	 (03PS1) 10Mszwarc: Tweak configuration of external link aggregate usage analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837)
[11:28:34] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[11:28:58] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[11:29:07] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[11:29:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rabbitmq: set pause_minority for cluster_partition_handling [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444)
[11:29:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet
[11:29:50] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[11:30:02] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[11:30:10] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[11:30:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet
[11:30:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:30:46] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[11:30:53] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1347.eqiad.wmnet
[11:30:56] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2090.codfw.wmnet
[11:30:58] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1092.eqiad.wmnet
[11:31:00] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2091.codfw.wmnet
[11:31:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Deployment plan:" [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi)
[11:31:31] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[11:34:53] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1020.eqiad.wmnet with OS bookworm
[11:35:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11722471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host dse-k8s-...
[11:35:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11722473 (10Jclark-ctr) a:03Jclark-ctr
[11:35:37] <wikibugs>	 (03PS1) 10Btullis: Switch some of the dse-k8s-worker hosts to/from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1254880 (https://phabricator.wikimedia.org/T418582)
[11:35:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95133216 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:36:11] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1347.eqiad.wmnet
[11:37:25] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting:  Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389#11722493 (10Clement_Goubert) 05Open→03In progress a:05VRiley-WMF→03Clement_Goubert Yep looking good....
[11:37:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet
[11:37:36] <wikibugs>	 (03PS1) 10Ladsgroup: DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792)
[11:37:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1036.eqiad.wmnet
[11:37:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet
[11:37:46] <wikibugs>	 (03PS1) 10Ladsgroup: Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254882 (https://phabricator.wikimedia.org/T402792)
[11:37:53] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2091.codfw.wmnet
[11:37:53] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:37:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1037.eqiad.wmnet
[11:37:57] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254882 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:38:08] <wikibugs>	 (03PS1) 10Ladsgroup: Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254883 (https://phabricator.wikimedia.org/T402792)
[11:38:16] <wikibugs>	 (03PS1) 10Ladsgroup: DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254884 (https://phabricator.wikimedia.org/T402792)
[11:38:30] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254884 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:38:35] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254883 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:39:01] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Switch some of the dse-k8s-worker hosts to/from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1254880 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis)
[11:39:35] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox
[11:39:58] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:40:22] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler)
[11:40:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1037.eqiad.wmnet
[11:40:35] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway rate limiting: add CORS headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[11:41:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet
[11:42:16] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm
[11:42:30] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:42:39] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway rate limit: add BYPASS and DENY policy and class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250598 (owner: 10Daniel Kinzler)
[11:42:41] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway rate limiting: add CORS headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248461 (https://phabricator.wikimedia.org/T418969) (owner: 10Daniel Kinzler)
[11:44:32] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox
[11:44:43] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11722551 (10MatthewVernon) >>! In T414805#11721374, @Ladsgroup wrote: >>>! In T414805#11682308, @MatthewVernon wrote: >> @Ladsgroup t...
[11:45:07] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm
[11:46:05] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm
[11:46:05] <wikibugs>	 (03Abandoned) 10Sergio Gimeno: [Growth] Remove get-started notification variant delays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176254 (owner: 10Sergio Gimeno)
[11:46:18] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage
[11:47:13] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:47:18] <wikibugs>	 (03PS1) 10Elukey: dse-k8s-services: update the base Airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254887 (https://phabricator.wikimedia.org/T402512)
[11:47:28] <claime>	 !log sudo homer lsw1-e5-eqiad* commit 'wikikube-worker1307 to active'
[11:47:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:57] <wikibugs>	 (03CR) 10Elukey: "Tested in my airflow dev environment, all good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254887 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[11:48:04] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1307.eqiad.wmnet
[11:48:05] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1307.eqiad.wmnet
[11:48:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1037.eqiad.wmnet
[11:48:22] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:48:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1037.eqiad.wmnet
[11:48:37] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: hw troubleshooting:  Comm Error: Backplane 0 for wikikube-worker1307.eqiad.wmnet - https://phabricator.wikimedia.org/T420389#11722560 (10Clement_Goubert) 05In progress→03Resolved Host back Active and repooled, resolving.
[11:48:41] <wikibugs>	 (03PS1) 10Harroyo-wmf: Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125)
[11:48:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet
[11:49:04] <wikibugs>	 06SRE, 10MinT, 10Prod-Kubernetes, 06ServiceOps new, and 3 others: Can't deploy machinetranslation due to exceeding resource quotas - https://phabricator.wikimedia.org/T411058#11722566 (10KartikMistry) Pods look fine so far:   ` kartik@deploy2002:/srv/deployment-charts/helmfile.d/services/machinetranslation...
[11:49:32] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1015.eqiad.wmnet with reason: host reimage
[11:49:40] <wikibugs>	 (03PS1) 10SomeRandomDeveloper: Revert "SpecialPreferences: Use Language Select Widget in language field" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254890 (https://phabricator.wikimedia.org/T419895)
[11:49:40] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Updating for dse-k8s-worker1012 - btullis@cumin1003"
[11:50:18] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Updating for dse-k8s-worker1012 - btullis@cumin1003"
[11:50:38] <wikibugs>	 (03PS1) 10SomeRandomDeveloper: Revert "SpecialPreferences: Use Language Select Widget in language field" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254891 (https://phabricator.wikimedia.org/T419895)
[11:50:55] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:51:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:51:13] <wikibugs>	 (03PS1) 10Sergio Gimeno: loggedOutWarning: dont set the schema for experiment events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254894 (https://phabricator.wikimedia.org/T420451)
[11:51:30] <wikibugs>	 (03PS1) 10Sergio Gimeno: loggedOutWarning: dont set the schema for experiment events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254895 (https://phabricator.wikimedia.org/T420451)
[11:51:34] <moritzm>	 kubestagemaster1003 will go down for a Ganeti reboot
[11:51:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet
[11:51:54] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] "..." [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:52:05] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254895 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno)
[11:52:05] <wikibugs>	 (03Merged) 10jenkins-bot: DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254881 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:52:12] <wikibugs>	 (03Merged) 10jenkins-bot: Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254882 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:52:19] <wikibugs>	 (03Merged) 10jenkins-bot: DjvuHandler: Make it follow thumb steps [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254884 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:52:27] <wikibugs>	 (03Merged) 10jenkins-bot: Make it follow thumb steps [extensions/PagedTiffHandler] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254883 (https://phabricator.wikimedia.org/T402792) (owner: 10Ladsgroup)
[11:52:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254894 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno)
[11:53:58] <icinga-wm>	 PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100%
[11:54:18] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm
[11:54:51] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:54:58] <jinxer-wm>	 RESOLVED: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:55:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458 (10MPostoronca-WMF) 03NEW
[11:55:39] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1254883|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254884|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]], [[gerrit:1254882|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254881|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]]
[11:55:49] <stashbot>	 T402792: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792
[11:55:49] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[11:55:49] <stashbot>	 T416620: Make ProofreadPage follow thumb steps - https://phabricator.wikimedia.org/T416620
[11:55:49] <stashbot>	 T418178: imageinfo API requests for DJVU files don't follow thumbnail steps, allows upscaling - https://phabricator.wikimedia.org/T418178
[11:55:56] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1027.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[11:56:36] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-codfw
[11:56:43] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[1328-1372].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[11:57:09] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T420459 (10katiamusiolekwmde) 03NEW
[11:57:28] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm
[11:57:42] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "LGTM. DBAs acknowledged this and are okay with the experiment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno)
[11:57:48] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1254883|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254884|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]], [[gerrit:1254882|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254881|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]] synced to the testservers (see https://wikitech.wikimedia.
[11:57:48] <logmsgbot>	 org/wiki/Mwdebug). Changes can now be verified there.
[11:57:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1012.eqiad.wmnet
[11:58:20] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:58:31] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[11:59:42] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1026.eqiad.wmnet with reason: host reimage
[12:00:01] <wikibugs>	 (03PS1) 10Ayounsi: Add suecarmol shell + add to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1254896 (https://phabricator.wikimedia.org/T419932)
[12:00:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet
[12:00:25] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11722662 (10ayounsi)
[12:00:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet
[12:00:32] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:00:34] <icinga-wm>	 RECOVERY - Host kubestagemaster1003 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms
[12:00:35] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:01:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1039.eqiad.wmnet
[12:01:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1039.eqiad.wmnet
[12:01:42] <wikibugs>	 (03PS2) 10Ayounsi: Add bvibber to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406)
[12:02:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad
[12:02:11] <wikibugs>	 (03CR) 10MVernon: [C:03+1] Add suecarmol shell + add to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1254896 (https://phabricator.wikimedia.org/T419932) (owner: 10Ayounsi)
[12:02:25] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1026.eqiad.wmnet with reason: host reimage
[12:02:27] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254883|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254884|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]], [[gerrit:1254882|Make it follow thumb steps (T402792 T414805)]], [[gerrit:1254881|DjvuHandler: Make it follow thumb steps (T402792 T414805 T416620 T418178)]] (duration: 06m 48s)
[12:02:35] <stashbot>	 T402792: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792
[12:02:35] <stashbot>	 T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805
[12:02:36] <stashbot>	 T416620: Make ProofreadPage follow thumb steps - https://phabricator.wikimedia.org/T416620
[12:02:36] <stashbot>	 T418178: imageinfo API requests for DJVU files don't follow thumbnail steps, allows upscaling - https://phabricator.wikimedia.org/T418178
[12:03:00] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add suecarmol shell + add to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1254896 (https://phabricator.wikimedia.org/T419932) (owner: 10Ayounsi)
[12:03:13] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:03:20] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service ganeti1037:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:04:16] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:05:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254851 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc)
[12:05:32] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster1003.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1003.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:05:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:05:48] <logmsgbot>	 !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-eqiad
[12:05:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[12:06:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for SCardenas (WMF) - https://phabricator.wikimedia.org/T419932#11722683 (10ayounsi) 05Open→03Resolved Change is merged, you should be good to go in the next ~30min. Please re-open if any issues.
[12:06:39] <wikibugs>	 (03Merged) 10jenkins-bot: Enable autodemotion for 2FA-less CN admins and WMF T&S [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254851 (https://phabricator.wikimedia.org/T418580) (owner: 10Mszwarc)
[12:06:59] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] ulsfo: remove old LVS service IPs and range [homer/public] - 10https://gerrit.wikimedia.org/r/1254864 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[12:07:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1039.eqiad.wmnet
[12:07:08] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254887 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[12:07:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1039.eqiad.wmnet
[12:07:09] <logmsgbot>	 !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1254851|Enable autodemotion for 2FA-less CN admins and WMF T&S (T418580)]]
[12:07:18] <stashbot>	 T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580
[12:08:25] <wikibugs>	 (03Merged) 10jenkins-bot: ulsfo: remove old LVS service IPs and range [homer/public] - 10https://gerrit.wikimedia.org/r/1254864 (https://phabricator.wikimedia.org/T418971) (owner: 10Ayounsi)
[12:09:03] <logmsgbot>	 btullis@cumin1003 reimage (PID 4075940) is awaiting input
[12:09:15] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1254851|Enable autodemotion for 2FA-less CN admins and WMF T&S (T418580)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[12:09:36] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc: Continuing with sync
[12:10:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad
[12:10:24] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[12:10:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1040.eqiad.wmnet
[12:10:58] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[12:12:52] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: add wmf-navigator values - empty httpd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1253489 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[12:13:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org
[12:13:30] <logmsgbot>	 !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254851|Enable autodemotion for 2FA-less CN admins and WMF T&S (T418580)]] (duration: 06m 21s)
[12:13:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:14:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:14:32] <stashbot>	 T418580: Deploy 2FA requirement using $wgRestrictedGroups to Wikimedia production, instead of OATHAuth's custom config - https://phabricator.wikimedia.org/T418580
[12:14:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1021.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyB
[12:15:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1254868 (https://phabricator.wikimedia.org/T420406) (owner: 10Ayounsi)
[12:15:26] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:15:35] <logmsgbot>	 jclark@cumin1003 reimage (PID 4076099) is awaiting input
[12:15:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:16:11] <wikibugs>	 (03PS12) 10Daniel Kinzler: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130)
[12:16:16] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[12:16:23] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3707861) is awaiting input
[12:16:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org
[12:17:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <ENTER RESOURCE NAME> for <ENTER YOUR USERNAME> - https://phabricator.wikimedia.org/T420459#11722698 (10WMDE-leszek) I approve this request on WMDE's behalf. Thank you!
[12:17:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11722699 (10WMDE-leszek)
[12:18:29] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: per-route jwt overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1248477 (https://phabricator.wikimedia.org/T419130) (owner: 10Daniel Kinzler)
[12:19:16] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[12:19:58] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[12:20:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1040.eqiad.wmnet
[12:21:33] <logmsgbot>	 !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[12:22:21] <logmsgbot>	 btullis@cumin1003 reimage (PID 4076647) is awaiting input
[12:22:36] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[12:22:36] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet with OS bookworm
[12:23:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:23:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1018.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet, wdqs1017.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:24:28] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update for dse-k8s-worker1015 - btullis@cumin1003"
[12:24:48] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update for dse-k8s-worker1015 - btullis@cumin1003"
[12:25:00] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[12:25:00] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1026.eqiad.wmnet with OS bookworm
[12:25:22] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm
[12:25:26] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:25:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1040.eqiad.wmnet
[12:25:38] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1020.eqiad.wmnet with reason: host reimage
[12:25:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1040.eqiad.wmnet
[12:25:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:27:10] <wikibugs>	 (03PS1) 10Muehlenhoff: installserver::dhcp: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1254904
[12:27:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] installserver::dhcp: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1254904 (owner: 10Muehlenhoff)
[12:27:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1041.eqiad.wmnet
[12:28:14] <wikibugs>	 (03PS1) 10Btullis: Put dse-k8s-worker10[15,26] into service [puppet] - 10https://gerrit.wikimedia.org/r/1254905 (https://phabricator.wikimedia.org/T418582)
[12:28:36] <wikibugs>	 (03PS2) 10Muehlenhoff: installserver::dhcp: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1254904
[12:29:31] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Put dse-k8s-worker10[15,26] into service [puppet] - 10https://gerrit.wikimedia.org/r/1254905 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis)
[12:30:01] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1020.eqiad.wmnet with reason: host reimage
[12:31:07] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[12:32:48] <logmsgbot>	 !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[12:32:53] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3712455) is awaiting input
[12:33:48] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[12:35:21] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[12:35:44] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[1328-1372].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[12:35:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:36:55] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:37:49] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[12:37:50] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[12:38:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254904 (owner: 10Muehlenhoff)
[12:38:39] <logmsgbot>	 !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[12:39:03] <wikibugs>	 (03CR) 10Elukey: [C:03+2] dse-k8s-services: update the base Airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254887 (https://phabricator.wikimedia.org/T402512) (owner: 10Elukey)
[12:39:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:39:22] <wikibugs>	 (03PS1) 10Btullis: Update linux-base when installing backported kernel on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582)
[12:39:29] <wikibugs>	 (03PS1) 10Ayounsi: ulsfo routed ganeti: add public range [puppet] - 10https://gerrit.wikimedia.org/r/1254909 (https://phabricator.wikimedia.org/T418993)
[12:41:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good. Note that we'll also need to update the 6.12 backport soon, but the kernel update isn't signed yet (that is a step which needs" [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis)
[12:41:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update linux-base when installing backported kernel on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis)
[12:42:14] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-jumbo-eqiad
[12:42:29] <wikibugs>	 (03PS2) 10Btullis: Update linux-base when installing backported kernel on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582)
[12:43:09] <wikibugs>	 (03CR) 10Btullis: "Got it. Thanks. I will be on the lookout for it." [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis)
[12:43:27] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[12:43:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1041.eqiad.wmnet
[12:43:58] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1243820 (owner: 10Muehlenhoff)
[12:44:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:44:13] <moritzm>	 kubestagemaster1004, dse-k8s-etcd1002 will go down for a Ganeti reboot
[12:44:42] <logmsgbot>	 ayounsi@cumin1003 netbox (PID 4090814) is awaiting input
[12:45:03] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:45:29] <icinga-wm>	 PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100%
[12:45:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:46:52] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice. Thanks for this. I have checked and I'm certain that it's not used anywhere." [puppet] - 10https://gerrit.wikimedia.org/r/1242407 (owner: 10Muehlenhoff)
[12:46:52] <wikibugs>	 (03PS1) 10Ayounsi: public1-virtual-ulsfo: add missing v6 PTR [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993)
[12:47:36] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update linux-base when installing backported kernel on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1254908 (https://phabricator.wikimedia.org/T418582) (owner: 10Btullis)
[12:47:39] <icinga-wm>	 PROBLEM - Host mc1039 is DOWN: PING CRITICAL - Packet loss = 100%
[12:47:41] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-codfw
[12:48:13] <wikibugs>	 (03PS3) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[12:48:37] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[12:49:08] <wikibugs>	 (03PS1) 10Jforrester: Restore quotation-marks in ext.wikilambda.app messages [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254911
[12:49:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1041.eqiad.wmnet
[12:49:29] <wikibugs>	 (03PS2) 10Jforrester: Restore quotation-marks in ext.wikilambda.app messages [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254911 (https://phabricator.wikimedia.org/T420456)
[12:49:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1041.eqiad.wmnet
[12:50:02] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:50:19] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm
[12:50:31] <icinga-wm>	 RECOVERY - Host kubestagemaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms
[12:50:32] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:50:40] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[12:50:53] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 2.94 ms
[12:51:02] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1015.eqiad.wmnet
[12:51:38] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] public1-virtual-ulsfo: add missing v6 PTR (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[12:52:23] <wikibugs>	 (03PS2) 10Ayounsi: public1-virtual-ulsfo: add missing v6 PTR [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993)
[12:52:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] matomo: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1243820 (owner: 10Muehlenhoff)
[12:52:45] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] public1-virtual-ulsfo: add missing v6 PTR [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[12:53:30] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] public1-virtual-ulsfo: add missing v6 PTR [dns] - 10https://gerrit.wikimedia.org/r/1254910 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[12:53:42] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[12:53:44] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1012.eqiad.wmnet
[12:53:45] <logmsgbot>	 jclark@cumin1003 reimage (PID 4076099) is awaiting input
[12:54:07] <logmsgbot>	 !log ayounsi@dns1004 START - running authdns-update
[12:54:31] <icinga-wm>	 RECOVERY - Host mc1039 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[12:54:36] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1012.eqiad.wmnet
[12:54:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:55:01] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:55:02] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:55:05] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:55:09] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254909 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[12:55:32] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-staging&var-instance=kubestagemaster1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:55:38] <logmsgbot>	 !log ayounsi@dns1004 END - running authdns-update
[12:55:42] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464)
[12:56:57] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet
[12:57:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Add install4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1254915 (https://phabricator.wikimedia.org/T418993)
[12:57:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1042.eqiad.wmnet
[12:57:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Tweak configuration of external link aggregate usage analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[12:57:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254891 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper)
[12:58:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254890 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper)
[12:58:53] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[12:58:54] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1020.eqiad.wmnet with OS bookworm
[12:59:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#11722891 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host dse-k8s-work...
[12:59:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1254909 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1300).
[13:00:05] <jouncebot>	 Sergi0 and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] ulsfo routed ganeti: add public range [puppet] - 10https://gerrit.wikimedia.org/r/1254909 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[13:00:17] <sergi0>	 o/
[13:00:23] <Lucas_WMDE>	 I can’t deploy, I’m in a meeting, sorry
[13:00:32] <sergi0>	 I can self-deploy
[13:00:32] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1012.eqiad.wmnet
[13:00:34] <MatmaRex>	 hey. i can't deploy myself, i'd appreciate if someone could ship it
[13:00:50] <wikibugs>	 (03PS1) 10Mszwarc: Normalize external domain names in click analysis [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254916 (https://phabricator.wikimedia.org/T419837)
[13:00:53] <sergi0>	 @MatmaRex can do
[13:01:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248095 (owner: 10Bartosz Dziewoński)
[13:01:28] <MatmaRex>	 might also add this no-op change while we're here ^
[13:01:37] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1254904 (owner: 10Muehlenhoff)
[13:02:19] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Add install4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1254915 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[13:02:20] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:02:29] <wikibugs>	 (03PS2) 10Mszwarc: Normalize external domain names in click analysis [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254917 (https://phabricator.wikimedia.org/T419837)
[13:02:55] <sergi0>	 ack
[13:03:13] <sergi0>	 I'll do first wmf19/20 then config
[13:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:03:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:04:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254917 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:04:13] <moritzm>	 ml-etcd1001 will go down for a Ganeti reboot
[13:04:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1042.eqiad.wmnet
[13:04:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254916 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:04:36] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1027.eqiad.wmnet with reason: host reimage
[13:04:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage1004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:04:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254894 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno)
[13:04:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254895 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno)
[13:04:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254891 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper)
[13:04:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254890 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper)
[13:05:00] <Msz2001>	 FYI: I scheduled a few patches, I can self-deploy them when you both are done. Just ping me :)
[13:05:33] <sergi0>	 @Msz2001 ack
[13:05:51] <wikibugs>	 (03PS4) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[13:06:16] <icinga-wm>	 PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:06:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] installserver::dhcp: Migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1254904 (owner: 10Muehlenhoff)
[13:06:19] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt  - jclark@cumin1003"
[13:06:21] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:06:54] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1016
[13:07:20] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt  - jclark@cumin1003"
[13:07:21] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:08:25] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1016
[13:08:30] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1027.eqiad.wmnet with reason: host reimage
[13:09:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2092.codfw.wmnet
[13:09:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1042.eqiad.wmnet
[13:09:52] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1093.eqiad.wmnet
[13:10:06] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[13:10:17] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:10:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1042.eqiad.wmnet
[13:10:29] <wikibugs>	 (03Merged) 10jenkins-bot: loggedOutWarning: dont set the schema for experiment events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254894 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno)
[13:10:32] <icinga-wm>	 RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[13:10:55] <wikibugs>	 (03Merged) 10jenkins-bot: loggedOutWarning: dont set the schema for experiment events [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254895 (https://phabricator.wikimedia.org/T420451) (owner: 10Sergio Gimeno)
[13:11:45] <wikibugs>	 (03PS5) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[13:11:54] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:12:07] <wikibugs>	 (03PS1) 10Muehlenhoff: firewall::dhcp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1254919
[13:12:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet
[13:13:06] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464)
[13:14:23] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467)
[13:15:13] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt  - jclark@cumin1003"
[13:15:42] <wikibugs>	 (03PS6) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[13:15:44] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:15:46] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update network and mgmt  - jclark@cumin1003"
[13:15:46] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:16:00] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1093.eqiad.wmnet
[13:16:03] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1094.eqiad.wmnet
[13:16:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] firewall::dhcp: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1254919 (owner: 10Muehlenhoff)
[13:16:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet
[13:16:22] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2092.codfw.wmnet
[13:16:26] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2093.codfw.wmnet
[13:20:12] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "SpecialPreferences: Use Language Select Widget in language field" [core] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254891 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper)
[13:20:20] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:21:30] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1026.eqiad.wmnet
[13:21:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "SpecialPreferences: Use Language Select Widget in language field" [core] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254890 (https://phabricator.wikimedia.org/T419895) (owner: 10SomeRandomDeveloper)
[13:21:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet
[13:21:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet
[13:22:11] <wikibugs>	 (03PS7) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[13:22:11] <wikibugs>	 (03PS3) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464)
[13:22:25] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1254894|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254895|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254891|Revert "SpecialPreferences: Use Language Select Widget in language field" (T419895)]], [[gerrit:1254890|Revert "SpecialPreferences: Use Language Select Widget in lang
[13:22:25] <logmsgbot>	 uage field" (T419895)]]
[13:22:32] <stashbot>	 T420451: '.experiment.coordinator' should be equal to one of the allowed values - https://phabricator.wikimedia.org/T420451
[13:22:32] <stashbot>	 T419895: UnexpectedValueException: Default '"sh-latn"' is invalid for preference variant of user [user] - https://phabricator.wikimedia.org/T419895
[13:23:20] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2093.codfw.wmnet
[13:23:24] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2094.codfw.wmnet
[13:23:42] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:24:32] <logmsgbot>	 !log sgimeno@deploy2002 somerandomdeveloper, sgimeno: Backport for [[gerrit:1254894|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254895|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254891|Revert "SpecialPreferences: Use Language Select Widget in language field" (T419895)]], [[gerrit:1254890|Revert "SpecialPreferences: Use Language Select Widget in
[13:24:32] <logmsgbot>	 language field" (T419895)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:24:55] <logmsgbot>	 !log sgimeno@deploy2002 somerandomdeveloper, sgimeno: Continuing with sync
[13:24:55] <SomeRandomDev>	 Seems to be fixed for me, no error anymore at https://sh.wikipedia.org/wiki/Posebno:Postavke when using mwdebug
[13:25:10] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1094.eqiad.wmnet
[13:25:14] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1095.eqiad.wmnet
[13:26:01] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[13:27:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet
[13:27:27] <wikibugs>	 (03PS8) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[13:27:59] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1026.eqiad.wmnet
[13:28:03] <wikibugs>	 (03PS4) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464)
[13:28:11] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:28:14] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[13:28:14] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1027.eqiad.wmnet with OS bookworm
[13:28:48] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254894|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254895|loggedOutWarning: dont set the schema for experiment events (T420451)]], [[gerrit:1254891|Revert "SpecialPreferences: Use Language Select Widget in language field" (T419895)]], [[gerrit:1254890|Revert "SpecialPreferences: Use Language Select Widget in lan
[13:28:48] <logmsgbot>	 guage field" (T419895)]] (duration: 06m 23s)
[13:28:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11722986 (10Aklapper)
[13:28:54] <stashbot>	 T420451: '.experiment.coordinator' should be equal to one of the allowed values - https://phabricator.wikimedia.org/T420451
[13:28:54] <stashbot>	 T419895: UnexpectedValueException: Default '"sh-latn"' is invalid for preference variant of user [user] - https://phabricator.wikimedia.org/T419895
[13:29:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11722996 (10ayounsi)
[13:29:55] <sergi0>	 going with config  changes now 1248095 and 1254216
[13:30:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11723000 (10ayounsi) @KFrancis can you organize the NDA for this request ? Thanks
[13:30:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248095 (owner: 10Bartosz Dziewoński)
[13:30:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno)
[13:30:31] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2094.codfw.wmnet
[13:30:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2095.codfw.wmnet
[13:30:40] <Msz2001>	 To speed things up, I'll +2 my patches, so that CI starts to process them
[13:31:06] <wikibugs>	 (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254916 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:31:16] <wikibugs>	 (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254917 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:31:17] <wikibugs>	 (03Merged) 10jenkins-bot: filebackend: Remove outdated comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1248095 (owner: 10Bartosz Dziewoński)
[13:31:20] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: increase edit and thanks query limit II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254216 (https://phabricator.wikimedia.org/T341599) (owner: 10Sergio Gimeno)
[13:31:50] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1248095|filebackend: Remove outdated comment]], [[gerrit:1254216|GrowthExperiments: increase edit and thanks query limit II (T341599)]]
[13:31:52] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1095.eqiad.wmnet
[13:31:54] <stashbot>	 T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599
[13:31:56] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1096.eqiad.wmnet
[13:32:25] <wikibugs>	 (03CR) 10Herron: [C:03+2] systemd::timer::job: add ExecCondition support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1253655 (owner: 10Herron)
[13:33:56] <logmsgbot>	 !log sgimeno@deploy2002 matmarex, sgimeno: Backport for [[gerrit:1248095|filebackend: Remove outdated comment]], [[gerrit:1254216|GrowthExperiments: increase edit and thanks query limit II (T341599)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:34:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet
[13:34:29] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11723038 (10ayounsi) @OKryva-WMF do you approve this request ? @thcipriani do you approve this request ? @MPostoronca-WMF could you generate a ed25519 key instead?
[13:34:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11723049 (10ayounsi)
[13:36:40] <logmsgbot>	 !log sgimeno@deploy2002 matmarex, sgimeno: Continuing with sync
[13:36:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2095.codfw.wmnet
[13:36:58] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1016.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[13:36:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11723054 (10ayounsi) @KFrancis could you organize the NDA signature for this request ? Thanks
[13:36:59] <wikibugs>	 (03Merged) 10jenkins-bot: Normalize external domain names in click analysis [extensions/WikimediaEvents] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254916 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:37:01] <wikibugs>	 (03Merged) 10jenkins-bot: Normalize external domain names in click analysis [extensions/WikimediaEvents] (wmf/1.46.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1254917 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:37:01] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2096.codfw.wmnet
[13:37:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11723057 (10ayounsi)
[13:39:18] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1096.eqiad.wmnet
[13:39:23] <logmsgbot>	 !log mvernon@cumin1003 START - Cookbook sre.hosts.reboot-single for host ms-be1097.eqiad.wmnet
[13:39:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet
[13:40:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet
[13:40:36] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1248095|filebackend: Remove outdated comment]], [[gerrit:1254216|GrowthExperiments: increase edit and thanks query limit II (T341599)]] (duration: 08m 47s)
[13:40:40] <stashbot>	 T341599: Impact Module: improvements for former newcomers - https://phabricator.wikimedia.org/T341599
[13:40:54] <sergi0>	 @Msz2001 all yours
[13:41:01] <Msz2001>	 ack, deploying
[13:41:02] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hcaptcha: Enable nginx caching for secure-api.js [puppet] - 10https://gerrit.wikimedia.org/r/1249929 (https://phabricator.wikimedia.org/T418865) (owner: 10Kosta Harlan)
[13:41:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add install4004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1254915 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff)
[13:41:30] <moritzm>	 sukhe: I'll merge your patch along, ok?
[13:41:38] <sukhe>	 please do
[13:41:40] <sukhe>	 thanks
[13:41:42] <logmsgbot>	 !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1254916|Normalize external domain names in click analysis (T419837)]], [[gerrit:1254917|Normalize external domain names in click analysis (T419837)]]
[13:41:46] <stashbot>	 T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
[13:41:52] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1373.eqiad.wmnet with OS bookworm
[13:41:54] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1033.eqiad.wmnet with OS trixie
[13:42:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11723068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1373.eqiad.wmnet with OS bookworm
[13:43:23] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2096.codfw.wmnet
[13:43:44] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1254916|Normalize external domain names in click analysis (T419837)]], [[gerrit:1254917|Normalize external domain names in click analysis (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:44:03] <wikibugs>	 (03PS9) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[13:44:27] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:45:13] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc: Continuing with sync
[13:45:31] <wikibugs>	 (03CR) 10Mszwarc: [C:03+2] "Ahead of deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:46:09] <logmsgbot>	 !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1097.eqiad.wmnet
[13:46:24] <wikibugs>	 (03Merged) 10jenkins-bot: Tweak configuration of external link aggregate usage analysis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254876 (https://phabricator.wikimedia.org/T419837) (owner: 10Mszwarc)
[13:47:39] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254925
[13:47:39] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254926
[13:47:40] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254927
[13:49:05] <logmsgbot>	 !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254916|Normalize external domain names in click analysis (T419837)]], [[gerrit:1254917|Normalize external domain names in click analysis (T419837)]] (duration: 07m 23s)
[13:49:10] <stashbot>	 T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
[13:49:44] <logmsgbot>	 !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1254876|Tweak configuration of external link aggregate usage analysis (T419837)]]
[13:50:20] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox
[13:50:21] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns1004.wikimedia.org
[13:50:21] <wikibugs>	 (03PS10) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[13:50:27] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[13:50:33] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2026-03-10-214300 to 2026-03-16-124858 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254928 (https://phabricator.wikimedia.org/T399344)
[13:50:41] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2026-03-12-210521 to 2026-03-18-023444 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254929 (https://phabricator.wikimedia.org/T419092)
[13:51:17] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:51:52] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1254876|Tweak configuration of external link aggregate usage analysis (T419837)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:52:27] <logmsgbot>	 !log mszwarc@deploy2002 mszwarc: Continuing with sync
[13:52:31] <wikibugs>	 (03PS5) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464)
[13:53:17] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage
[13:54:24] <icinga-wm>	 PROBLEM - Host 2620:0:861:1:208:80:154:6 is DOWN: CRITICAL - Host Unreachable (2620:0:861:1:208:80:154:6)
[13:54:44] <icinga-wm>	 RECOVERY - Host 2620:0:861:1:208:80:154:6 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms
[13:55:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:55:22] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1033.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:55:24] <sukhe>	 this should have been downtimed
[13:55:27] <sukhe>	 the DNS host is depooled
[13:55:45] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1033.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[13:56:25] <logmsgbot>	 !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254876|Tweak configuration of external link aggregate usage analysis (T419837)]] (duration: 06m 41s)
[13:56:29] <stashbot>	 T419837: Temporary measurement of outbound citation link clicks - https://phabricator.wikimedia.org/T419837
[13:56:52] <Msz2001>	 Finished deployments
[13:57:06] <Msz2001>	 !log UTC afternoon backport+config window done
[13:57:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:11] <MatmaRex>	 thanks for deploying sergi0
[13:59:12] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1373.eqiad.wmnet with reason: host reimage
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1400)
[14:00:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:00:13] <James_F>	 Perfect timing.
[14:00:36] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2026-03-10-214300 to 2026-03-16-124858 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254928 (https://phabricator.wikimedia.org/T399344) (owner: 10Jforrester)
[14:01:21] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:01:40] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[14:02:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1045.eqiad.wmnet
[14:02:17] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254933 (https://phabricator.wikimedia.org/T418350)
[14:02:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254911 (https://phabricator.wikimedia.org/T420456) (owner: 10Jforrester)
[14:02:39] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2026-03-10-214300 to 2026-03-16-124858 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254928 (https://phabricator.wikimedia.org/T399344) (owner: 10Jforrester)
[14:02:40] <wikibugs>	 (03PS2) 10Kgraessle: Deploy Extension:PersonalDashboard to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367)
[14:04:04] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns1004.wikimedia.org
[14:04:32] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:05:17] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:05:30] <XioNoX>	 !log set graceful-shutdown on EdgeUno transit sessions
[14:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:49] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:07:47] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:08:03] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:08:48] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:10:04] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3731205) is awaiting input
[14:10:28] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2026-03-12-210521 to 2026-03-18-023444 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254929 (https://phabricator.wikimedia.org/T419092) (owner: 10Jforrester)
[14:10:52] <wikibugs>	 (03Merged) 10jenkins-bot: Restore quotation-marks in ext.wikilambda.app messages [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254911 (https://phabricator.wikimedia.org/T420456) (owner: 10Jforrester)
[14:10:53] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254933 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[14:11:23] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1254911|Restore quotation-marks in ext.wikilambda.app messages (T420456)]]
[14:11:28] <stashbot>	 T420456: In the default collapsed view, all strings appear as ⧼quotation-marks⧽ - https://phabricator.wikimedia.org/T420456
[14:11:33] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254933 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[14:12:52] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2026-03-12-210521 to 2026-03-18-023444 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254929 (https://phabricator.wikimedia.org/T419092) (owner: 10Jforrester)
[14:13:19] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[14:13:27] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1254911|Restore quotation-marks in ext.wikilambda.app messages (T420456)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:13:36] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[14:13:39] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:14:00] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "this alone is not enough:" [puppet] - 10https://gerrit.wikimedia.org/r/1242499 (https://phabricator.wikimedia.org/T419887) (owner: 10Cwhite)
[14:14:03] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[14:14:03] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:14:05] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable max_num_batched_tokens flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254933 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[14:14:33] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:15:05] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:15:16] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:16:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[14:16:22] <wikibugs>	 (03CR) 10Herron: [C:03+2] "proceeding with this after discussion on irc" [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) (owner: 10Herron)
[14:16:32] <wikibugs>	 (03PS5) 10Herron: icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336)
[14:16:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host install4004.wikimedia.org
[14:16:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:16:37] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:16:41] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:16:59] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[14:17:00] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1373.eqiad.wmnet with OS bookworm
[14:17:03] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:17:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11723269 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1373.eqiad.wmnet with OS bookworm completed: - wikikube-worker1373 (...
[14:17:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) (owner: 10Herron)
[14:17:55] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254911|Restore quotation-marks in ext.wikilambda.app messages (T420456)]] (duration: 06m 32s)
[14:17:59] <stashbot>	 T420456: In the default collapsed view, all strings appear as ⧼quotation-marks⧽ - https://phabricator.wikimedia.org/T420456
[14:18:41] <wikibugs>	 (03PS6) 10Herron: icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336)
[14:19:04] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org
[14:20:37] <wikibugs>	 (03CR) 10Herron: [C:03+2] icinga: add monthly restart [puppet] - 10https://gerrit.wikimedia.org/r/1253576 (https://phabricator.wikimedia.org/T196336) (owner: 10Herron)
[14:20:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4004.wikimedia.org - jmm@cumin2002"
[14:21:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM install4004.wikimedia.org - jmm@cumin2002"
[14:21:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:21:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install4004.wikimedia.org on all recursors
[14:21:25] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.wipe-cache (exit_code=99) install4004.wikimedia.org on all recursors
[14:21:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11723291 (10Jclark-ctr)
[14:21:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install wikikube-worker137[3-4] - https://phabricator.wikimedia.org/T416390#11723293 (10Jclark-ctr) 05Open→03Resolved
[14:24:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:24:25] <logmsgbot>	 jmm@cumin2002 makevm (PID 3732975) is awaiting input
[14:24:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache install4004.wikimedia.org on all recursors
[14:25:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) install4004.wikimedia.org on all recursors
[14:25:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1045.eqiad.wmnet
[14:25:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install4004.wikimedia.org - jmm@cumin2002"
[14:25:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM install4004.wikimedia.org - jmm@cumin2002"
[14:28:39] <logmsgbot>	 jmm@cumin2002 makevm (PID 3732975) is awaiting input
[14:29:02] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Equivalence of functions of inline patterns and patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1254936
[14:29:10] <jinxer-wm>	 RESOLVED: [4x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:29:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Equivalence of functions of inline patterns and patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1254936 (owner: 10Giuseppe Lavagetto)
[14:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1400)
[14:30:05] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1430)
[14:30:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1045.eqiad.wmnet
[14:30:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1045.eqiad.wmnet
[14:31:29] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "inline pattern and pattern equivalence - oblivian@cumin1003"
[14:31:32] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: inline pattern and pattern equivalence - oblivian@cumin1003
[14:32:25] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: inline pattern and pattern equivalence - oblivian@cumin1003
[14:32:27] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "inline pattern and pattern equivalence - oblivian@cumin1003"
[14:32:44] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org
[14:33:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1046.eqiad.wmnet
[14:34:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install4004.wikimedia.org with OS bookworm
[14:36:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum4001.ulsfo.wmnet
[14:38:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1046.eqiad.wmnet
[14:40:00] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:40:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[14:40:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:41:11] <wikibugs>	 (03PS1) 10Ottomata: mw-page-html-content-change-enrich - increase taskmanager replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254938 (https://phabricator.wikimedia.org/T360794)
[14:41:47] <wikibugs>	 (03PS2) 10Ottomata: mw-page-html-content-change-enrich - increase taskmanager replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254938 (https://phabricator.wikimedia.org/T360794)
[14:43:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.4 point update - https://phabricator.wikimedia.org/T420240#11723367 (10MoritzMuehlenhoff)
[14:43:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1046.eqiad.wmnet
[14:44:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1046.eqiad.wmnet
[14:44:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1047.eqiad.wmnet
[14:44:25] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:44:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum4001.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:44:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum4001.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:44:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:44:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum4001.ulsfo.wmnet
[14:45:05] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723386 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum4001.ulsfo.wmnet` - durum4001.ulsfo.wmnet (**PASS...
[14:45:12] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-eqiad
[14:45:44] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] mw-page-html-content-change-enrich - increase taskmanager replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254938 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata)
[14:46:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts durum4002.ulsfo.wmnet
[14:46:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1047.eqiad.wmnet
[14:47:15] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] geo-maps: update Meta geo mapping [dns] - 10https://gerrit.wikimedia.org/r/1254092 (owner: 10Slyngshede)
[14:47:33] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: cloudcephmon2007-dev service implementation - https://phabricator.wikimedia.org/T420282#11723394 (10Andrew) p:05Triage→03Medium
[14:47:44] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org
[14:47:49] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-html-content-change-enrich - increase taskmanager replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254938 (https://phabricator.wikimedia.org/T360794) (owner: 10Ottomata)
[14:48:04] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[14:48:39] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[14:48:58] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[14:49:49] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[14:50:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:52:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1047.eqiad.wmnet
[14:52:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1047.eqiad.wmnet
[14:53:54] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3074.esams.wmnet [reason: trixie reimaging]
[14:54:17] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3074.esams.wmnet with OS trixie
[14:54:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:54:25] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3075.esams.wmnet [reason: trixie reimaging]
[14:54:53] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3075.esams.wmnet with OS trixie
[14:56:43] <logmsgbot>	 jmm@cumin2002 decommission (PID 3739626) is awaiting input
[14:56:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11723455 (10Sarmbruster) >>! In T420410#11721959, @Aklapper wrote: > @Sarmbruster: Please also [link your LDAP account to your Phabricator account](https://phabricator.wikimedia.org/se...
[14:57:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1049.eqiad.wmnet
[14:57:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:58:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum4002.ulsfo.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:58:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:58:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum4002.ulsfo.wmnet
[14:58:37] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723461 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `durum4002.ulsfo.wmnet` - durum4002.ulsfo.wmnet (**PASS...
[14:58:44] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: bump MaxRequestWorkers [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189)
[14:59:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-eqiad and 208.80.154.77 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:59:44] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: bump MaxRequestWorkers [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189)
[15:01:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11723484 (10Jgreen) These have all been updated to the frack management password.
[15:01:24] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org
[15:01:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1049.eqiad.wmnet
[15:02:12] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1033.eqiad.wmnet with OS trixie
[15:03:02] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723498 (10MoritzMuehlenhoff)
[15:03:41] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:03:56] <logmsgbot>	 !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-eqiad
[15:04:04] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1328.eqiad.wmnet
[15:04:45] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1329.eqiad.wmnet
[15:05:08] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1330.eqiad.wmnet
[15:05:14] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1331.eqiad.wmnet
[15:06:15] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1332.eqiad.wmnet
[15:06:18] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1333.eqiad.wmnet
[15:06:55] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1334.eqiad.wmnet
[15:07:01] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1335.eqiad.wmnet
[15:07:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11723523 (10Jgreen) Note to self: since these are Supermicro, the default management user is "ADMIN"
[15:07:14] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1336.eqiad.wmnet
[15:07:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1049.eqiad.wmnet
[15:07:25] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1337.eqiad.wmnet
[15:07:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1049.eqiad.wmnet
[15:08:51] <wikibugs>	 (03CR) 10Cwhite: "Hmm, IIUC, this hard connection would make our dns servers a dependency for serving all of wikimediastatus.net.  Keeping at least the www " [puppet] - 10https://gerrit.wikimedia.org/r/1242499 (https://phabricator.wikimedia.org/T419887) (owner: 10Cwhite)
[15:09:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1050.eqiad.wmnet
[15:09:07] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1328.eqiad.wmnet
[15:09:45] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1338.eqiad.wmnet
[15:09:56] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1329.eqiad.wmnet
[15:10:06] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1339.eqiad.wmnet
[15:10:13] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1330.eqiad.wmnet
[15:10:18] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1331.eqiad.wmnet
[15:10:22] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1340.eqiad.wmnet
[15:10:27] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1341.eqiad.wmnet
[15:11:09] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Post reimage - btullis@cumin1003"
[15:11:14] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Post reimage - btullis@cumin1003"
[15:11:22] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1333.eqiad.wmnet
[15:11:40] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1332.eqiad.wmnet
[15:11:50] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1342.eqiad.wmnet
[15:11:55] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1343.eqiad.wmnet
[15:12:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: bast2003 boot failure - https://phabricator.wikimedia.org/T420320#11723563 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:12:05] <wikibugs>	 (03PS1) 10Btullis: Add dse-k8s-worker1027 into service [puppet] - 10https://gerrit.wikimedia.org/r/1254952 (https://phabricator.wikimedia.org/T414787)
[15:12:05] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1335.eqiad.wmnet
[15:12:13] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1334.eqiad.wmnet
[15:12:21] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1344.eqiad.wmnet
[15:12:28] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1345.eqiad.wmnet
[15:12:29] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1337.eqiad.wmnet
[15:12:31] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1336.eqiad.wmnet
[15:12:40] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1346.eqiad.wmnet
[15:12:46] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1348.eqiad.wmnet
[15:12:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1050.eqiad.wmnet
[15:13:22] <wikibugs>	 (03PS1) 10Ayounsi: network/data.yaml: add ulsfo routed ganeti public [puppet] - 10https://gerrit.wikimedia.org/r/1254953 (https://phabricator.wikimedia.org/T418993)
[15:13:51] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254953 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[15:14:39] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm
[15:14:49] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1338.eqiad.wmnet
[15:15:08] <moritzm>	 !log imported jenkins 2.541.3 for bullseye/bookworm/trixie 
[15:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:11] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1339.eqiad.wmnet
[15:15:16] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1340.eqiad.wmnet
[15:15:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1254953 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[15:15:32] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1341.eqiad.wmnet
[15:15:52] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1349.eqiad.wmnet
[15:16:24] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add dse-k8s-worker1027 into service [puppet] - 10https://gerrit.wikimedia.org/r/1254952 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis)
[15:16:24] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org
[15:16:40] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] network/data.yaml: add ulsfo routed ganeti public [puppet] - 10https://gerrit.wikimedia.org/r/1254953 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[15:16:55] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1342.eqiad.wmnet
[15:17:01] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1343.eqiad.wmnet
[15:17:32] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1345.eqiad.wmnet
[15:17:33] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1344.eqiad.wmnet
[15:17:45] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1346.eqiad.wmnet
[15:17:51] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1348.eqiad.wmnet
[15:18:07] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3074.esams.wmnet with reason: host reimage
[15:18:18] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) (owner: 10Daniel Kinzler)
[15:18:57] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3075.esams.wmnet with reason: host reimage
[15:20:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1050.eqiad.wmnet
[15:20:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1050.eqiad.wmnet
[15:20:46] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1349.eqiad.wmnet
[15:20:56] <wikibugs>	 (03PS1) 10Ladsgroup: Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031)
[15:21:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup)
[15:22:38] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1360.eqiad.wmnet
[15:22:43] <wikibugs>	 (03PS1) 10C. Scott Ananian: Limit legacy postprocessing cache to pages where DT does apply [extensions/DiscussionTools] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254956 (https://phabricator.wikimedia.org/T376183)
[15:22:44] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1361.eqiad.wmnet
[15:22:50] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1362.eqiad.wmnet
[15:22:56] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1363.eqiad.wmnet
[15:23:03] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1364.eqiad.wmnet
[15:23:08] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1365.eqiad.wmnet
[15:23:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1051.eqiad.wmnet
[15:23:14] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1366.eqiad.wmnet
[15:23:20] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1367.eqiad.wmnet
[15:23:27] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1368.eqiad.wmnet
[15:23:32] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1369.eqiad.wmnet
[15:24:03] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host install4004.wikimedia.org with OS bookworm
[15:24:03] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host install4004.wikimedia.org
[15:24:07] <icinga-wm>	 ACKNOWLEDGEMENT - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 4134e5f01ac0575de459f204e1ba3c23cd5bfb2a, dns.git is f38df3b8f8408e4f3e4d008d1744ad43c7d241aa) Sukhbir Singh ACK https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:24:15] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[15:24:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:24:40] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:25:02] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11723608 (10Jhancock.wm) it's a good thing we didn't wait for dell to send us a new drive. their portal says shipped but the drive still hasn't been delivered to codfw.
[15:25:42] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[15:25:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host install4004.wikimedia.org with OS bookworm
[15:25:52] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3074.esams.wmnet with reason: host reimage
[15:26:05] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723610 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host install4004.wikimedia.org wi...
[15:26:09] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1017
[15:27:32] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1360.eqiad.wmnet
[15:27:39] <wikibugs>	 (03PS2) 10Jforrester: Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup)
[15:27:43] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1017
[15:27:48] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1361.eqiad.wmnet
[15:27:55] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1362.eqiad.wmnet
[15:27:56] <wikibugs>	 (03PS3) 10Jforrester: Remove VP8 from transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254955 (https://phabricator.wikimedia.org/T413031) (owner: 10Ladsgroup)
[15:28:02] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1363.eqiad.wmnet
[15:28:07] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1364.eqiad.wmnet
[15:28:11] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1365.eqiad.wmnet
[15:28:12] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1370.eqiad.wmnet
[15:28:18] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1371.eqiad.wmnet
[15:28:19] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1366.eqiad.wmnet
[15:28:24] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1367.eqiad.wmnet
[15:28:26] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host wikikube-worker1372.eqiad.wmnet
[15:28:31] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1368.eqiad.wmnet
[15:28:36] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1369.eqiad.wmnet
[15:29:03] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3747404) is awaiting input
[15:29:22] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage
[15:30:05] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org
[15:30:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:31:05] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1027.eqiad.wmnet
[15:32:16] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11723627 (10RobH) Remote hands cleaned the patch cable and reseated the optic along with photos to show the work.   This is now returned to #netops purview for moni...
[15:32:29] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11723630 (10RobH) {F73035080}  {F73035081}  {F73035082}
[15:33:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:33:16] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1370.eqiad.wmnet
[15:33:23] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1371.eqiad.wmnet
[15:33:36] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wikikube-worker1372.eqiad.wmnet
[15:34:05] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3075.esams.wmnet with reason: host reimage
[15:34:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1051.eqiad.wmnet
[15:34:39] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11723631 (10MatthewVernon) I remain grateful that we have spare disks available, so thanks again :)
[15:35:06] <icinga-wm>	 PROBLEM - Host dse-k8s-worker1016 is DOWN: PING CRITICAL - Packet loss = 100%
[15:35:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:35:31] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11723635 (10herron)
[15:35:38] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-eqiad
[15:36:43] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1016.eqiad.wmnet with reason: host reimage
[15:37:19] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1027.eqiad.wmnet
[15:37:29] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy1003.eqiad.wmnet
[15:37:50] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Update for dse-k8s-worker1016 - btullis@cumin1003"
[15:38:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[15:38:13] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Update for dse-k8s-worker1016 - btullis@cumin1003"
[15:39:47] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware, 06Traffic: Decommission codfw cp hosts cp2027-cp2040 - https://phabricator.wikimedia.org/T419753#11723690 (10BCornwall)
[15:39:52] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbproxy1022.eqiad.wmnet with reason: kernel update
[15:40:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1051.eqiad.wmnet
[15:40:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1051.eqiad.wmnet
[15:41:19] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm. it seems we could even go as high as 2048 with our amount of RAM" [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[15:41:34] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3078.esams.wmnet with OS trixie
[15:41:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/DiscussionTools] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254956 (https://phabricator.wikimedia.org/T376183) (owner: 10C. Scott Ananian)
[15:41:46] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp3079.esams.wmnet with OS trixie
[15:41:58] <logmsgbot>	 btullis@cumin1003 reimage (PID 4173938) is awaiting input
[15:42:21] <wikibugs>	 (03PS1) 10Btullis: Put dse-k8s-worker101[67] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1254959 (https://phabricator.wikimedia.org/T414787)
[15:42:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1052.eqiad.wmnet
[15:42:36] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1014.eqiad.wmnet
[15:43:25] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Put dse-k8s-worker101[67] into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1254959 (https://phabricator.wikimedia.org/T414787) (owner: 10Btullis)
[15:45:05] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns2005.wikimedia.org
[15:45:28] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:45:53] <wikibugs>	 (03CR) 10David Caro: "LGTM, though double check with Andrew first, he did a lot of tweaking might have experience with this setting" [puppet] - 10https://gerrit.wikimedia.org/r/1254877 (https://phabricator.wikimedia.org/T418444) (owner: 10Filippo Giunchedi)
[15:46:44] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_codfw and not P{cp2041.codfw.wmnet} and A:cp
[15:46:47] <moritzm>	 ml-etcd1003 will go down for a Ganeti reboot
[15:46:48] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3754269) is awaiting input
[15:46:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1052.eqiad.wmnet
[15:47:32] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1003.eqiad.wmnet
[15:48:08] <icinga-wm>	 PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:41] <logmsgbot>	 !log klausman@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-serve-worker-eqiad
[15:48:53] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1014.eqiad.wmnet
[15:49:24] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_codfw and not P{cp2042.codfw.wmnet} and A:cp
[15:49:40] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.74 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:49:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:50:00] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job trafficserver-upload in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:50:33] <icinga-wm>	 RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms
[15:51:31] <icinga-wm>	 PROBLEM - bacula sd process on backup1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-sd https://wikitech.wikimedia.org/wiki/Bacula
[15:51:34] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3074.esams.wmnet with OS trixie
[15:51:43] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1012.eqiad.wmnet
[15:52:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1052.eqiad.wmnet
[15:52:49] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1010.eqiad.wmnet
[15:52:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1052.eqiad.wmnet
[15:53:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1053.eqiad.wmnet
[15:53:41] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job trafficserver-upload in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:54:00] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for dbproxy1022.eqiad.wmnet
[15:54:01] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1008.eqiad.wmnet
[15:54:01] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dbproxy1022.eqiad.wmnet
[15:54:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.74 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[15:54:39] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1015.eqiad.wmnet
[15:54:41] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1023.eqiad.wmnet with reason: kernel update
[15:54:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:54:58] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[15:55:05] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dse-k8s-worker1017.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:55:32] <icinga-wm>	 RECOVERY - bacula sd process on backup1012 is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-sd https://wikitech.wikimedia.org/wiki/Bacula
[15:55:45] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2043.codfw.wmnet
[15:56:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1053.eqiad.wmnet
[15:56:37] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[15:57:03] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3074.esams.wmnet [reason: trixie reimaging]
[15:57:09] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1008.eqiad.wmnet
[15:57:15] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1010.eqiad.wmnet
[15:57:40] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3078.esams.wmnet [reason: trixie reimaging]
[15:58:00] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3078.esams.wmnet [reason: trixie reimaging]
[15:58:00] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2044.codfw.wmnet
[15:58:05] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1012.eqiad.wmnet
[15:58:59] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3076.esams.wmnet [reason: trixie reimaging]
[15:59:17] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns2005.wikimedia.org
[15:59:45] <wikibugs>	 (03Abandoned) 10Bking: dumps: Update cirrus index dumps path to point to new dumps [puppet] - 10https://gerrit.wikimedia.org/r/1210636 (owner: 10DCausse)
[16:00:08] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3076.esams.wmnet with OS trixie
[16:00:09] <wikibugs>	 (03PS1) 10DLynch: Editcheck: fix tagging not happening for non-default checks [extensions/VisualEditor] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254965
[16:00:19] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3075.esams.wmnet with OS trixie
[16:00:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/VisualEditor] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254965 (owner: 10DLynch)
[16:00:30] <wikibugs>	 (03PS2) 10Scott French: mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245)
[16:00:45] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1015.eqiad.wmnet
[16:00:55] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1003.eqiad.wmnet
[16:01:59] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1028.eqiad.wmnet with reason: kernel update
[16:02:43] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[16:03:41] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:04:37] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[16:04:38] <logmsgbot>	 !log klausman@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad
[16:04:58] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[16:05:40] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[16:06:00] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3078.esams.wmnet with reason: host reimage
[16:06:01] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[16:06:03] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[16:06:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3079.esams.wmnet with reason: host reimage
[16:07:03] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1003.eqiad.wmnet
[16:07:56] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1009.eqiad.wmnet
[16:08:03] <logmsgbot>	 btullis@cumin1003 reimage (PID 4173938) is awaiting input
[16:08:22] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports concurrent request handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254967 (https://phabricator.wikimedia.org/T418350)
[16:08:40] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:08:53] <logmsgbot>	 !log klausman@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad
[16:09:36] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[16:09:37] <logmsgbot>	 !log klausman@cumin1003 END (ERROR) - Cookbook sre.k8s.reboot-nodes (exit_code=97) rolling reboot on A:ml-serve-worker-eqiad
[16:09:57] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3078.esams.wmnet with reason: host reimage
[16:11:12] <moritzm>	 !log powercycling ganeti1053 (stuck on reboot)
[16:11:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:25] <wikibugs>	 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11723858 (10RobH) Remote Hands Directions: I can write up the directions for them to pull the patch and clean it, and also reseat the optic in t...
[16:11:56] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[16:12:31] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003"
[16:12:31] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1016.eqiad.wmnet with OS bookworm
[16:12:53] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy1029.eqiad.wmnet with reason: kernel update
[16:13:41] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:13:42] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1011.eqiad.wmnet
[16:13:48] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3079.esams.wmnet with reason: host reimage
[16:14:11] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1009.eqiad.wmnet
[16:14:15] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup1013.eqiad.wmnet
[16:14:17] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org
[16:16:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1053.eqiad.wmnet
[16:16:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1053.eqiad.wmnet
[16:16:46] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host install4004.wikimedia.org with OS bookworm
[16:16:57] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11723901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host install4004.wikimedia.org with O...
[16:18:46] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[16:18:48] <wikibugs>	 (03PS11) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[16:18:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1054.eqiad.wmnet
[16:19:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:19:38] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1011.eqiad.wmnet
[16:20:35] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1013.eqiad.wmnet
[16:22:11] <logmsgbot>	 !log klausman@cumin1003 START - Cookbook sre.hosts.reboot-single for host ml-serve1012.eqiad.wmnet
[16:24:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hints for alsa-lib [puppet] - 10https://gerrit.wikimedia.org/r/1254971
[16:24:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1054.eqiad.wmnet
[16:24:23] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3076.esams.wmnet with reason: host reimage
[16:24:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-codfw and 208.80.153.107 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:24:39] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy2005.codfw.wmnet with reason: kernel update
[16:27:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hints for alsa-lib [puppet] - 10https://gerrit.wikimedia.org/r/1254971 (owner: 10Muehlenhoff)
[16:27:57] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:28:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1054.eqiad.wmnet
[16:28:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1054.eqiad.wmnet
[16:29:06] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2003.codfw.wmnet
[16:29:20] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3076.esams.wmnet with reason: host reimage
[16:29:23] <moritzm>	 !log failover Ganeti master in eqiad to ganeti1046
[16:29:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:06] <logmsgbot>	 sukhe@cumin1003 roll-reboot (PID 4103685) is awaiting input
[16:32:19] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti1048 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[16:32:41] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org
[16:32:49] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2008.codfw.wmnet
[16:33:25] <wikibugs>	 (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that supports concurrent request handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254967 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:33:39] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2045.codfw.wmnet
[16:33:41] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:33:52] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports concurrent request handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254967 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:34:09] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[16:34:52] <moritzm>	 !log installing alsa-lib security updates
[16:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:56] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3078.esams.wmnet with OS trixie
[16:36:11] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports concurrent request handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254967 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira)
[16:36:48] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2046.codfw.wmnet
[16:37:32] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2009.codfw.wmnet
[16:38:55] <moritzm>	 !log installing PHP 8.2 security updates
[16:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:03] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2008.codfw.wmnet
[16:39:56] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3079.esams.wmnet with OS trixie
[16:40:21] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2012.codfw.wmnet
[16:40:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[16:41:32] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy2007.codfw.wmnet with reason: kernel update
[16:42:59] <icinga-wm>	 PROBLEM - Host ml-serve1012 is DOWN: PING CRITICAL - Packet loss = 100%
[16:43:16] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-cluster
[16:43:16] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1347.eqiad.wmnet with OS trixie
[16:43:16] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99)
[16:43:45] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1347
[16:43:48] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2009.codfw.wmnet
[16:44:05] <icinga-wm>	 RECOVERY - Host ml-serve1012 is UP: PING OK - Packet loss = 0%, RTA = 5.20 ms
[16:44:09] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.dns.netbox
[16:44:12] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2013.codfw.wmnet
[16:45:01] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet
[16:46:07] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2003.codfw.wmnet
[16:46:13] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3075.esams.wmnet [reason: trixie reimaging]
[16:46:53] <wikibugs>	 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409#11724144 (10RobH) This also looks like its no longer throwing errors, but I've done nothing:  https://grafana.wikimedia.org/d/5p97dAASz/queue-an...
[16:47:01] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.reboot-single for host backup2014.codfw.wmnet
[16:47:02] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2012.codfw.wmnet
[16:47:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir and A:ncredir
[16:47:24] <logmsgbot>	 !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1012.eqiad.wmnet
[16:47:41] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org
[16:47:55] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1347 - jayme@cumin1003"
[16:47:59] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker1347 - jayme@cumin1003"
[16:47:59] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:48:00] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.dns.wipe-cache wikikube-worker1347.eqiad.wmnet 199.48.64.10.in-addr.arpa 9.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:48:03] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1347.eqiad.wmnet 199.48.64.10.in-addr.arpa 9.9.1.0.8.4.0.0.4.6.0.0.0.1.0.0.7.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:48:03] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1347
[16:48:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11724159 (10KFrancis) Hi all, the NDA has been sent for signatures.  I'll confirm when it's complete. Thanks!
[16:49:16] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=97) rolling reboot on A:ncredir and A:ncredir
[16:49:23] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir2001.*
[16:50:23] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2013.codfw.wmnet
[16:51:11] <logmsgbot>	 !log klausman@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve1013.eqiad.wmnet with reason: Reboot for security update
[16:51:41] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-magru and A:ncredir
[16:51:50] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdm) failed in thanos-be2008 - https://phabricator.wikimedia.org/T419817#11724171 (10Jhancock.wm) 05Open→03Resolved you're welcome!  I'm gonna close this just do i don't mess up my own SLA waiting for the drive.
[16:51:59] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:52:08] <logmsgbot>	 !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbproxy2008.codfw.wmnet with reason: kernel update
[16:52:17] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-eqsin and A:ncredir
[16:53:06] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup2014.codfw.wmnet
[16:53:17] <wikibugs>	 (03PS2) 10Harroyo-wmf: Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125)
[16:55:00] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:55:01] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[16:55:29] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3076.esams.wmnet with OS trixie
[16:55:35] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for ncredir2001.codfw.wmnet
[16:55:36] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ncredir2001.codfw.wmnet
[16:55:45] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir2001.*
[16:56:35] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 8 hosts with reason: upgrade
[16:57:54] <wikibugs>	 (03PS3) 10Harroyo-wmf: Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125)
[16:58:36] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=ncredir2002.*
[16:58:41] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job haproxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:58:59] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:59:26] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host ncredir2002.codfw.wmnet
[17:00:05] <jouncebot>	 swfrench-wmf: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki infrastructure (UTC late) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1700).
[17:00:27] <swfrench-wmf>	 o/
[17:00:59] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[17:01:14] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3077.esams.wmnet [reason: trixie reimaging]
[17:01:42] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3077.esams.wmnet with OS trixie
[17:01:44] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3076.esams.wmnet [reason: trixie reimaging]
[17:02:38] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp3078.esams.wmnet [reason: trixie reimaging]
[17:02:53] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1028 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[17:02:56] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1347
[17:02:56] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1347
[17:03:03] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|jobrunner): Use envoy drain configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254962 (https://phabricator.wikimedia.org/T364245) (owner: 10Scott French)
[17:03:04] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp3078.esams.wmnet with OS trixie
[17:04:19] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-magru and A:ncredir
[17:05:13] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-ulsfo and A:ncredir
[17:05:19] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-eqsin and A:ncredir
[17:05:38] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir2002.codfw.wmnet
[17:05:54] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-drmrs and A:ncredir
[17:06:19] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=ncredir2002.*
[17:06:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[17:07:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-esams and A:ncredir
[17:07:22] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11724272 (10RobH) > Support, >  > The link came back up after your cleaning and re-seating the optic and patch cable, but the errors have resumed after the circuit...
[17:07:40] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[17:07:49] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir-eqiad and A:ncredir
[17:08:12] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3078.*
[17:08:14] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org
[17:08:15] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp3079.*
[17:08:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[17:08:49] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[17:09:13] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp3078.*
[17:09:13] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[17:10:44] <wikibugs>	 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11724294 (10Raine)
[17:11:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[17:12:28] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2047.codfw.wmnet
[17:12:47] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[17:13:02] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1029 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[17:14:40] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5031.eqsin.wmnet with OS trixie
[17:14:54] <logmsgbot>	 !log jayme@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1347.eqiad.wmnet with reason: host reimage
[17:15:34] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS trixie
[17:15:37] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2048.codfw.wmnet
[17:15:43] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-ulsfo and A:ncredir
[17:16:09] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-eqiad and A:ncredir
[17:18:29] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-drmrs and A:ncredir
[17:19:26] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1347.eqiad.wmnet with reason: host reimage
[17:20:04] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir-esams and A:ncredir
[17:20:51] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[17:21:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[17:21:26] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:21:54] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[17:23:14] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns3004.wikimedia.org
[17:23:38] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1258: Ready
[17:25:05] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[17:25:36] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[17:26:31] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3077.esams.wmnet with reason: host reimage
[17:27:25] <icinga-wm>	 PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:27:49] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3078.esams.wmnet with reason: host reimage
[17:27:57] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:28:52] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[17:29:38] <claime>	 !log rearmed keyholder on deploy1003
[17:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:50] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[17:30:05] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3077.esams.wmnet with reason: host reimage
[17:30:52] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:31:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[17:32:11] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[17:32:14] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5031.eqsin.wmnet with OS trixie
[17:32:30] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5032.eqsin.wmnet with OS trixie
[17:32:32] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5031.eqsin.wmnet with OS trixie
[17:32:49] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS trixie
[17:33:53] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3078.esams.wmnet with reason: host reimage
[17:34:25] <icinga-wm>	 RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:35:59] <logmsgbot>	 !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1347.eqiad.wmnet with OS trixie
[17:38:13] <logmsgbot>	 !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on backupmon1001.eqiad.wmnet with reason: upgrade
[17:38:57] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[17:39:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500 (10AnnieKim_WMDE) 03NEW
[17:39:23] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[17:40:02] <wikibugs>	 (03PS17) 10Bking: WIP: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041)
[17:40:37] <wikibugs>	 (03PS18) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041)
[17:40:37] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns3004.wikimedia.org
[17:42:21] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[17:43:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[17:46:16] <wikibugs>	 (03PS6) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464)
[17:46:21] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[17:49:06] <wikibugs>	 (03PS19) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041)
[17:49:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[17:51:14] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2049.codfw.wmnet
[17:52:27] <wikibugs>	 (03PS20) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041)
[17:54:08] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2050.codfw.wmnet
[17:55:37] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org
[17:56:25] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3077.esams.wmnet with OS trixie
[17:59:56] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3078.esams.wmnet with OS trixie
[18:00:05] <jouncebot>	 andre and brennen: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1800). nyaa~
[18:00:13] <andre>	 nah.
[18:00:39] <wikibugs>	 (03PS1) 10BCornwall: Add sre.cdn.roll-restart-reboot-proxoid [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997
[18:01:15] <brennen>	 haha
[18:01:40] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1017.eqiad.wmnet with OS bookworm
[18:02:27] <wikibugs>	 (03PS12) 10Jcrespo: mediabackup: Prepare mediabackup worker profile for new storage backend [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028)
[18:02:27] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028)
[18:02:31] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028)
[18:03:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:03:41] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:03:43] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 2 others: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11724749 (10jcrespo)
[18:04:12] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 2 others: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11724753 (10jcrespo)
[18:04:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:04:28] <wikibugs>	 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, and 2 others: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11724754 (10jcrespo) p:05Triage→03High
[18:04:35] <wikibugs>	 (03PS7) 10Jcrespo: mediabackup: Deploy new ms-backup hosts on both dcs [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464)
[18:05:00] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:07:53] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage
[18:08:56] <wikibugs>	 (03PS2) 10BCornwall: Add sre.cdn.roll-restart-reboot-proxoid [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997
[18:09:07] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) pool db1258: Ready
[18:09:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:10:17] <wikibugs>	 (03PS3) 10BCornwall: Add sre.cdn.roll-restart-reboot-tcp-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997
[18:12:28] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org
[18:12:53] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage
[18:13:02] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3078.esams.wmnet [reason: trixie reimaging]
[18:13:08] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp3077.esams.wmnet [reason: trixie reimaging]
[18:13:15] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028)
[18:13:50] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028)
[18:14:04] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:14:43] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028)
[18:15:24] <wikibugs>	 (03PS3) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028)
[18:15:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:15:30] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:16:07] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028)
[18:16:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:16:11] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254999 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:16:12] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [reason: trixie reimaging]
[18:16:18] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254906 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:16:29] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp5017.eqsin.wmnet [reason: trixie reimaging]
[18:16:38] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5018.eqsin.wmnet [reason: trixie reimaging]
[18:17:06] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5018.eqsin.wmnet with OS trixie
[18:17:17] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5017.eqsin.wmnet [reason: trixie reimaging]
[18:17:19] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5031.eqsin.wmnet with reason: host reimage
[18:18:03] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS trixie
[18:18:18] <wikibugs>	 (03PS4) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028)
[18:18:26] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:18:41] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:18:52] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11724854 (10herron)
[18:20:26] <wikibugs>	 (03PS3) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028)
[18:20:34] <wikibugs>	 (03PS5) 10Jcrespo: mediabackup: Deploy new 24 shards (6 hosts) for mediabackups@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028)
[18:20:44] <wikibugs>	 (03PS4) 10Jcrespo: mediabackups: Open s3 storage port on storage hosts from working hosts [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028)
[18:21:16] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254913 (https://phabricator.wikimedia.org/T420464) (owner: 10Jcrespo)
[18:21:34] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:21:44] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255000 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:23:41] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:24:54] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5031.eqsin.wmnet with reason: host reimage
[18:27:17] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255003 (https://phabricator.wikimedia.org/T410028) (owner: 10Jcrespo)
[18:27:28] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns4004.wikimedia.org
[18:29:58] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2051.codfw.wmnet
[18:32:56] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2052.codfw.wmnet
[18:34:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:35:00] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:36:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11724940 (10Jclark-ctr) @BTullis  Performed firmware update on backplane   seems to of cleare...
[18:38:41] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:39:40] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:42:36] <wikibugs>	 (03PS4) 10BCornwall: Add sre.cdn.roll-restart-reboot-tcp-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997
[18:43:25] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall)
[18:44:25] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5032.eqsin.wmnet with OS trixie
[18:45:24] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5032.*
[18:45:25] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host install4004.wikimedia.org with OS bookworm
[18:45:37] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11724984 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ayounsi@cumin1003 for host install4004.wikimedia.org with OS bookworm
[18:46:18] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns4004.wikimedia.org
[18:46:35] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5030.eqsin.wmnet with OS trixie
[18:47:51] <wikibugs>	 (03PS21) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041)
[18:47:57] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage
[18:48:53] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[18:49:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[18:50:11] <icinga-wm>	 PROBLEM - Host cloudrabbit2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[18:51:41] <icinga-wm>	 RECOVERY - Host cloudrabbit2001-dev is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[18:54:05] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5018.eqsin.wmnet with reason: host reimage
[18:55:54] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: [eqiad1.wikimedia.cloud] remove php_parsoid from mw_web_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1255012 (https://phabricator.wikimedia.org/T420509)
[18:56:01] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5031.eqsin.wmnet with OS trixie
[18:56:04] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[18:56:17] <wikibugs>	 (03PS22) 10Bking: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041)
[18:56:23] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5031.*
[18:57:09] <icinga-wm>	 PROBLEM - Host cloudrabbit2002-dev is DOWN: PING CRITICAL - Packet loss = 100%
[18:57:32] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] scap.cfg.erb: [eqiad1.wikimedia.cloud] remove php_parsoid from mw_web_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1255012 (https://phabricator.wikimedia.org/T420509) (owner: 10Ahmon Dancy)
[18:59:28] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[18:59:41] <icinga-wm>	 RECOVERY - Host cloudrabbit2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[19:00:20] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "Not sure if this is desired as there's already https://wikitech.wikimedia.org/wiki/Gerrit/tcp-proxy#Service_restarts_and_depooling" [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall)
[19:01:04] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-api-ext - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[19:01:18] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org
[19:02:14] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp5029.eqsin.wmnet
[19:02:37] <wikibugs>	 (03PS1) 10Jdlrobson: Guard for JS null deref on empty Parsoid sections [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255013 (https://phabricator.wikimedia.org/T419721)
[19:06:03] <logmsgbot>	 brett@cumin2002 upgrade-firmware (PID 3816247) is awaiting input
[19:06:11] <icinga-wm>	 PROBLEM - Host cloudrabbit2003-dev is DOWN: PING CRITICAL - Packet loss = 100%
[19:06:34] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] scap.cfg.erb: [eqiad1.wikimedia.cloud] remove php_parsoid from mw_web_clusters [puppet] - 10https://gerrit.wikimedia.org/r/1255012 (https://phabricator.wikimedia.org/T420509) (owner: 10Ahmon Dancy)
[19:06:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Happy to check the DNS hosts explicitly after this change, since they are more critical than the Wikidough ones." [homer/public] - 10https://gerrit.wikimedia.org/r/1254185 (https://phabricator.wikimedia.org/T420342) (owner: 10Ayounsi)
[19:07:41] <icinga-wm>	 RECOVERY - Host cloudrabbit2003-dev is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms
[19:08:38] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on install4004.wikimedia.org with reason: host reimage
[19:08:47] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5029.eqsin.wmnet with OS trixie
[19:08:52] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2053.codfw.wmnet
[19:08:53] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp5029.eqsin.wmnet
[19:09:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:11:41] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2054.codfw.wmnet
[19:13:43] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on install4004.wikimedia.org with reason: host reimage
[19:13:47] <wikibugs>	 (03PS4) 10BCornwall: trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254
[19:13:57] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage
[19:14:10] <wikibugs>	 (03CR) 10BCornwall: trafficserver: Update single_backend site comments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1254254 (owner: 10BCornwall)
[19:14:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:15:19] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725109 (10Ladsgroup) I'm about to make this a lot less gradual. On the ground that we have thumb steps now plus I really don't want to spend all of 2026 (and even 2027) babysitting...
[19:17:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5030.eqsin.wmnet with reason: host reimage
[19:17:50] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "LGTM!" [dumps] - 10https://gerrit.wikimedia.org/r/1251169 (https://phabricator.wikimedia.org/T401296) (owner: 10WMDE-leszek)
[19:18:07] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254 (owner: 10BCornwall)
[19:18:27] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org
[19:19:11] <wikibugs>	 (03PS1) 10Ottomata: mw-page-edit-type-enrich-next - increase taskmanager replicas while we backfill [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255017 (https://phabricator.wikimedia.org/T351225)
[19:20:49] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] mw-page-edit-type-enrich-next - increase taskmanager replicas while we backfill [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255017 (https://phabricator.wikimedia.org/T351225) (owner: 10Ottomata)
[19:22:44] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-edit-type-enrich-next - increase taskmanager replicas while we backfill [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255017 (https://phabricator.wikimedia.org/T351225) (owner: 10Ottomata)
[19:23:43] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply
[19:23:48] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] trafficserver: Update single_backend site comments [puppet] - 10https://gerrit.wikimedia.org/r/1254254 (owner: 10BCornwall)
[19:23:57] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply
[19:25:11] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725166 (10Ladsgroup)
[19:26:06] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5018.eqsin.wmnet with OS trixie
[19:26:11] <swfrench-wmf>	 FYI, I'm going to be testing something briefly in mw-debug (codfw)
[19:27:24] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[19:27:42] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[19:28:04] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp5018.eqsin.wmnet [reason: trixie reimaging]
[19:28:19] <logmsgbot>	 !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5020.eqsin.wmnet [reason: trixie reimaging]
[19:29:02] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS trixie
[19:30:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host install4004.wikimedia.org with OS bookworm
[19:30:20] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11725172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ayounsi@cumin1003 for host install4004.wikimedia.org with OS bookworm complet...
[19:33:27] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns5004.wikimedia.org
[19:34:42] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11725174 (10RobH) > Comentário gerado em Smart Hands: Good afternoon, >  > We carried out the replacement of the fiber optic patch cable. A 10‑meter patch cable ava...
[19:35:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[19:35:17] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[19:35:24] <swfrench-wmf>	 all done
[19:35:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:35:29] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725175 (10Ladsgroup)
[19:39:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:39:30] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5017.eqsin.wmnet with OS trixie
[19:39:47] <Reedy>	 jouncebot: nowandnext
[19:39:47] <jouncebot>	 For the next 0 hour(s) and 20 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T1800)
[19:39:47] <jouncebot>	 In 0 hour(s) and 20 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T2000)
[19:40:09] <wikibugs>	 (03Abandoned) 10Ebernhardson: semanticsearch: Increase heap by 1gb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1249382 (https://phabricator.wikimedia.org/T414623) (owner: 10Ebernhardson)
[19:41:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[19:42:42] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5029.eqsin.wmnet with reason: host reimage
[19:46:04] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11725225 (10RobH) Errors returned, Arzhel redrained the link, update sent to ticket:     > Support, >  > Thank you for swapping out fiber 70152 with 260301, but it...
[19:48:15] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5030.eqsin.wmnet with OS trixie
[19:49:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1252684 (https://phabricator.wikimedia.org/T420142) (owner: 10Pppery)
[19:49:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255013 (https://phabricator.wikimedia.org/T419721) (owner: 10Jdlrobson)
[19:49:25] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr2-eqsin and 103.102.166.8 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[19:49:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250095 (https://phabricator.wikimedia.org/T418066) (owner: 10Pppery)
[19:49:39] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2055.codfw.wmnet
[19:49:46] <logmsgbot>	 !log reedy@deploy2002 Synchronized private/PrivateSettings.php: Set $wgOATHSecretKey T404363 (duration: 05m 51s)
[19:49:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1242542 (https://phabricator.wikimedia.org/T414048) (owner: 10Pppery)
[19:49:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11725240 (10VRiley-WMF) No problem. Let us know when this can be closed. Thank you @BTullis
[19:49:52] <stashbot>	 T404363: Set OATHSecretKey value within Wikimedia production and migrate older 2fa data within oathauth_devices - https://phabricator.wikimedia.org/T404363
[19:49:56] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5029.eqsin.wmnet with reason: host reimage
[19:50:10] <Reedy>	 !log running `mwscript extensions/OATHAuth/maintenance/UpdateSecretsToEncryptedFormat.php --wiki=metawiki` T404363
[19:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:21] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2056.codfw.wmnet
[19:50:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[19:50:44] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns5004.wikimedia.org
[19:51:30] <Reedy>	 !log running `foreachwikiindblist private.dblist extensions/OATHAuth/maintenance/UpdateSecretsToEncryptedFormat.php` T404363
[19:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:43] <Reedy>	 !log running `foreachwikiindblist fishbowl.dblist extensions/OATHAuth/maintenance/UpdateSecretsToEncryptedFormat.php` T404363
[19:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:52] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725245 (10Ladsgroup)
[19:56:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T2000).
[20:00:05] <jouncebot>	 hector-arroyo, cscott, Kemayo, Pppery, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:10] <Pppery>	 here
[20:00:16] <Kemayo>	 o/
[20:01:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:04:08] <Kemayo>	 Well, I'm going to go ahead and backport mine. Anyone else want theirs rolled in?
[20:04:28] <Jdlrobson>	 I am not free for next 30m but can help with mine and others in second half.  https://gerrit.wikimedia.org/r/c/1255013/ is likely to be a deploy blocker if I don't backport it.
[20:05:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) (owner: 10Kgraessle)
[20:05:18] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-codfw
[20:05:29] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5030.*
[20:05:32] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5017.eqsin.wmnet with OS trixie
[20:05:44] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org
[20:05:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254965 (owner: 10DLynch)
[20:05:59] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie
[20:05:59] <Kemayo>	 No takers, so I have gone ahead with just my patch.
[20:07:40] <wikibugs>	 (03Merged) 10jenkins-bot: Editcheck: fix tagging not happening for non-default checks [extensions/VisualEditor] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254965 (owner: 10DLynch)
[20:08:14] <logmsgbot>	 !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1254965|Editcheck: fix tagging not happening for non-default checks]]
[20:08:41] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:09:23] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:09:34] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1033.eqiad.wmnet with OS trixie
[20:09:40] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.5 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:10:23] <logmsgbot>	 !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1254965|Editcheck: fix tagging not happening for non-default checks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:10:54] <logmsgbot>	 !log kemayo@deploy2002 kemayo: Continuing with sync
[20:11:07] <cscott>	 o/
[20:11:20] <cscott>	 Kemayo: sorry i was slow.  but going ahead was the right thing!
[20:12:03] <Kemayo>	 🎉
[20:13:12] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725317 (10herron)
[20:14:09] <cscott>	 if hector-arroyo isn't here i'll go next i guess
[20:14:24] <hector-arroyo>	 I'm here
[20:14:29] <hector-arroyo>	 but go ahead
[20:14:42] <logmsgbot>	 !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254965|Editcheck: fix tagging not happening for non-default checks]] (duration: 06m 28s)
[20:14:49] <Kemayo>	 Mine's done, the floor is open.
[20:15:20] <cscott>	 ok, i'm jumping in; maybe hector-arroyo and Pppery can combine their config patches in the next slot
[20:15:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254956 (https://phabricator.wikimedia.org/T376183) (owner: 10C. Scott Ananian)
[20:15:25] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:18:31] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org
[20:19:25] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between asw1-b12-drmrs and 185.15.58.5 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:19:41] <wikibugs>	 (03Merged) 10jenkins-bot: Limit legacy postprocessing cache to pages where DT does apply [extensions/DiscussionTools] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254956 (https://phabricator.wikimedia.org/T376183) (owner: 10C. Scott Ananian)
[20:20:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:20:39] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1033.eqiad.wmnet with reason: host reimage
[20:21:18] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725330 (10herron)
[20:21:36] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5029.eqsin.wmnet with OS trixie
[20:22:25] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5029.*
[20:22:58] <Jdlrobson>	 cscott: let me know when you are done. I can do the remaining deploys (if their owners show!)
[20:24:12] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5027.eqsin.wmnet with OS trixie
[20:24:20] <logmsgbot>	 !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1254956|Limit legacy postprocessing cache to pages where DT does apply (T376183)]]
[20:24:24] <stashbot>	 T376183: Use postprocessing cache for Discussion Tools - https://phabricator.wikimedia.org/T376183
[20:25:24] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1033.eqiad.wmnet with reason: host reimage
[20:25:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:26:20] <logmsgbot>	 !log cscott@deploy2002 cscott: Backport for [[gerrit:1254956|Limit legacy postprocessing cache to pages where DT does apply (T376183)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:28:29] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2057.codfw.wmnet
[20:28:29] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_codfw and not P{cp2041.codfw.wmnet} and A:cp
[20:28:45] <logmsgbot>	 !log brett@cumin2002 cookbooks.sre.cdn.roll-reboot finished rebooting cp2058.codfw.wmnet
[20:28:45] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_codfw and not P{cp2042.codfw.wmnet} and A:cp
[20:29:58] <Dreamy_Jazz>	 Gerrit seems down?
[20:30:04] <SomeRandomDev>	 Same for me
[20:30:23] <Reedy>	 It has been giving some 502s... The bot reported it too above
[20:30:32] <hector-arroyo>	 for me it is working, but was super slow just a few seconds ago
[20:30:43] <Dreamy_Jazz>	 (Maybe it was the reboot of the CDN above)?
[20:30:45] <SomeRandomDev>	 it was down for like 10-15 mins for me but now it's working again
[20:31:06] <SomeRandomDev>	 but it kept giving me 502s since this morning
[20:33:31] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns6002.wikimedia.org
[20:33:58] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725380 (10herron)
[20:34:21] <logmsgbot>	 !log cscott@deploy2002 cscott: Continuing with sync
[20:35:15] <logmsgbot>	 !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-out2001.wikimedia.org with reason: T419960
[20:35:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:36:38] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[20:36:38] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[20:37:24] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:37:42] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage
[20:38:13] <logmsgbot>	 !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1254956|Limit legacy postprocessing cache to pages where DT does apply (T376183)]] (duration: 13m 54s)
[20:38:17] <stashbot>	 T376183: Use postprocessing cache for Discussion Tools - https://phabricator.wikimedia.org/T376183
[20:38:41] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:39:18] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725406 (10herron)
[20:39:25] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:41:09] <cscott>	 ok over to who ever is next
[20:41:16] <cscott>	 hector-arroyo?
[20:42:08] <hector-arroyo>	 ok
[20:42:21] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[20:42:25] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 7 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:42:40] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[20:42:42] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1033.eqiad.wmnet with OS trixie
[20:42:45] <logmsgbot>	 !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-out1001.wikimedia.org with reason: T419960
[20:43:35] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns6002.wikimedia.org
[20:43:47] <hector-arroyo>	 newbie question: should I just click on "deploy change" on https://schedule-deployment.toolforge.org/window/1773864000? when I do so, I get an error ("access denied due to lack of permissions")
[20:43:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: Q2:rack/setup/install wdqs1033-1035 - https://phabricator.wikimedia.org/T411731#11725412 (10Jclark-ctr) 05Open→03Resolved
[20:44:25] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between asw1-b13-drmrs and 185.15.58.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[20:44:29] <Jdlrobson>	 hector-arroyo: i have a deploy blocker which is also MobileFrontend. Would it be okay to do them together?
[20:44:38] <Jdlrobson>	 (I can also do the deploys if that's helpful)
[20:44:39] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5017.eqsin.wmnet with reason: host reimage
[20:45:03] <hector-arroyo>	 sure
[20:45:09] <hector-arroyo>	 mine is just a config change
[20:45:15] <Jdlrobson>	 ok want me to deploy them?
[20:45:21] <hector-arroyo>	 yes, please
[20:45:25] <Jdlrobson>	 ok starting now
[20:45:28] <hector-arroyo>	 thx
[20:45:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255013 (https://phabricator.wikimedia.org/T419721) (owner: 10Jdlrobson)
[20:45:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[20:46:31] <Neriah>	 Hey
[20:46:32] <Neriah>	 What's the problem with the CI now?
[20:46:51] <wikibugs>	 (03Merged) 10jenkins-bot: Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254889 (https://phabricator.wikimedia.org/T419125) (owner: 10Harroyo-wmf)
[20:48:41] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[20:48:47] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5028.eqsin.wmnet with OS trixie
[20:49:09] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie
[20:49:32] <taavi>	 Neriah: CI issues are #wikimedia-releng territory
[20:50:03] <logmsgbot>	 !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-in2001.wikimedia.org with reason: T419960
[20:50:27] <logmsgbot>	 !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5020.eqsin.wmnet with OS trixie
[20:51:28] <logmsgbot>	 !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx-in1001.wikimedia.org with reason: T419960
[20:51:54] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage
[20:52:38] <wikibugs>	 (03PS1) 10Jforrester: Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255034 (https://phabricator.wikimedia.org/T418887)
[20:52:45] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-codfw
[20:56:29] <wikibugs>	 (03PS1) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948)
[20:56:50] <Jdlrobson>	 hector-arroyo: almost ready for testing on debug
[20:56:54] <wikibugs>	 (03Merged) 10jenkins-bot: Guard for JS null deref on empty Parsoid sections [extensions/MobileFrontend] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255013 (https://phabricator.wikimedia.org/T419721) (owner: 10Jdlrobson)
[20:57:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[20:57:30] <logmsgbot>	 !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1255013|Guard for JS null deref on empty Parsoid sections (T419721)]], [[gerrit:1254889|Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]]
[20:57:36] <stashbot>	 T419721: Various client errors relating to MobileFrontend section collapsing - https://phabricator.wikimedia.org/T419721
[20:57:36] <stashbot>	 T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125
[20:58:01] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5027.eqsin.wmnet with reason: host reimage
[20:58:16] <wikibugs>	 10SRE-SLO, 13Patch-For-Review: Sloth: onboard existing SLOs to sloth manifests - https://phabricator.wikimedia.org/T418163#11725460 (10herron)
[20:58:35] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org
[20:59:14] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-logging-eqiad
[20:59:36] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson, harroyo-wmf: Backport for [[gerrit:1255013|Guard for JS null deref on empty Parsoid sections (T419721)]], [[gerrit:1254889|Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:59:57] <Jdlrobson>	 hector-arroyo: please test and give me green light to sync!
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T2100)
[21:00:14] <wikibugs>	 (03PS2) 10CDanis: Add fundraising-data-uploader role user [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948)
[21:00:33] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis)
[21:00:59] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth: update check for authdns_update_run [puppet] - 10https://gerrit.wikimedia.org/r/1255038
[21:02:14] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8298/co" [puppet] - 10https://gerrit.wikimedia.org/r/1255038 (owner: 10Ssingh)
[21:02:14] <wikibugs>	 (03PS1) 10Jforrester: Activate Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039
[21:02:20] <icinga-wm>	 PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:03:39] <Jdlrobson>	 hector-arroyo: all good? We are overrunning our window now so I'd like to wrap it up.
[21:03:59] <Dreamy_Jazz>	 If they are not around, it should be safe to sync
[21:04:11] <James_F>	 We're around.
[21:04:13] <hector-arroyo>	 I don't see my changes working in https://test.wikipedia.org/wiki/Test
[21:04:16] <Dreamy_Jazz>	 The idea was to test the broken functionality this enables on testwiki to find where it was broken
[21:04:16] <James_F>	 And waiting to create the wiki.
[21:04:25] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between asw1-b3-magru and 195.200.68.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:04:36] <Jdlrobson>	 should i continue to sync anyway and then you can debug further?
[21:04:46] <hector-arroyo>	 yes
[21:04:56] <logmsgbot>	 !log jdlrobson@deploy2002 jdlrobson, harroyo-wmf: Continuing with sync
[21:05:27] <Jdlrobson>	 good luck hector-arroyo !
[21:05:29] <wikibugs>	 (03PS7) 10Jforrester: Create Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247650 (https://phabricator.wikimedia.org/T411725)
[21:07:20] <icinga-wm>	 RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:07:26] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5028.eqsin.wmnet with OS trixie
[21:07:46] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie
[21:08:50] <logmsgbot>	 !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255013|Guard for JS null deref on empty Parsoid sections (T419721)]], [[gerrit:1254889|Reapply "hcaptcha: Enforce hCaptcha on API edits coming from the MobileFrontend" (T419125)]] (duration: 11m 20s)
[21:08:55] <stashbot>	 T419721: Various client errors relating to MobileFrontend section collapsing - https://phabricator.wikimedia.org/T419721
[21:08:55] <stashbot>	 T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend - https://phabricator.wikimedia.org/T419125
[21:09:01] <James_F>	 Jdlrobson: Do you have more to deploy or can I start?
[21:09:25] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between asw1-b3-magru and 195.200.68.4 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:11:58] <James_F>	 I'm going to take that as a yes.
[21:12:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255034 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester)
[21:12:32] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5020.eqsin.wmnet with OS trixie
[21:14:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255034 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester)
[21:14:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247650 (https://phabricator.wikimedia.org/T411725) (owner: 10Jforrester)
[21:15:15] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org
[21:15:27] <wikibugs>	 (03Merged) 10jenkins-bot: Create Abstract Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1247650 (https://phabricator.wikimedia.org/T411725) (owner: 10Jforrester)
[21:15:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and EdgeUno (200.25.58.212) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[21:16:45] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5017.eqsin.wmnet with OS trixie
[21:17:22] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255034 (https://phabricator.wikimedia.org/T418887) (owner: 10Jforrester)
[21:17:54] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1255034|Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" (T418887)]], [[gerrit:1247650|Create Abstract Wikipedia (T411725 T411726)]]
[21:18:01] <stashbot>	 T418887: Collect and decide on whether and how to fix community-experienced changes with the v2 orchestrator - https://phabricator.wikimedia.org/T418887
[21:18:02] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[21:18:02] <stashbot>	 T411725: Set up Wikimedia production config to allow abstract.wikipedia.org to be a special wiki - https://phabricator.wikimedia.org/T411725
[21:18:02] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms
[21:18:02] <stashbot>	 T411726: Set up initial wiki settings for Abstract Wikipedia - https://phabricator.wikimedia.org/T411726
[21:19:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11725554 (10VRiley-WMF) Had to reset the iDrac, but it should be good to go. @Jgreen
[21:20:08] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1255034|Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" (T418887)]], [[gerrit:1247650|Create Abstract Wikipedia (T411725 T411726)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:20:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdb1008 - https://phabricator.wikimedia.org/T414374#11725558 (10VRiley-WMF) a:05VRiley-WMF→03Jgreen
[21:20:45] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[21:23:52] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] P:dns::auth: update check for authdns_update_run [puppet] - 10https://gerrit.wikimedia.org/r/1255038 (owner: 10Ssingh)
[21:24:38] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1255034|Revert "OrchestratorRequest: Switch evaluations to v2 endpoint" (T418887)]], [[gerrit:1247650|Create Abstract Wikipedia (T411725 T411726)]] (duration: 06m 44s)
[21:24:48] <stashbot>	 T418887: Collect and decide on whether and how to fix community-experienced changes with the v2 orchestrator - https://phabricator.wikimedia.org/T418887
[21:24:48] <stashbot>	 T411725: Set up Wikimedia production config to allow abstract.wikipedia.org to be a special wiki - https://phabricator.wikimedia.org/T411725
[21:24:49] <stashbot>	 T411726: Set up initial wiki settings for Abstract Wikipedia - https://phabricator.wikimedia.org/T411726
[21:25:37] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: update readme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848
[21:26:12] <logmsgbot>	 !log jforrester@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/maintenance/addWiki.php --wiki=abstractwiki  # T411723 addWiki.php run
[21:26:16] <stashbot>	 T411723: Set up abstract.wikipedia.org as a new wiki - https://phabricator.wikimedia.org/T411723
[21:27:02] <logmsgbot>	 !log jforrester@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/maintenance/addWiki.php --wiki=abstractwiki  # T411723 addWiki.php run
[21:28:44] <James_F>	 Well that's unfortunate.
[21:29:33] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5027.eqsin.wmnet with OS trixie
[21:30:15] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org
[21:30:27] <wikibugs>	 (03CR) 10Hashar: [C:04-1] "This is quite arbitrary and it has some issues:" [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb)
[21:30:40] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5028.eqsin.wmnet with OS trixie
[21:31:04] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5028.eqsin.wmnet with OS trixie
[21:33:09] <wikibugs>	 (03PS4) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1142675
[21:34:28] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:34:40] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between asw1-b4-magru and 195.200.68.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:39:33] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] rest gateway: merge authed-other into authed-bot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254921 (https://phabricator.wikimedia.org/T420467) (owner: 10Daniel Kinzler)
[21:40:02] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage
[21:40:39] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142675 (owner: 10JHathaway)
[21:41:08] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11725650 (10RobH) The optic was swapped, but the errors resumed.  Arzhel got me setup with an EdgeUno portal account so I can view the two circuits and opened case...
[21:41:26] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11725652 (10RobH) a:05ayounsi→03RobH
[21:41:27] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 2 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:41:27] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5027.*
[21:44:04] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5020.eqsin.wmnet with reason: host reimage
[21:44:25] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between asw1-b4-magru and 195.200.68.37 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:45:50] <wikibugs>	 (03PS4) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548)
[21:49:06] <logmsgbot>	 !log sukhe@cumin1003 cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org
[21:49:07] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox
[21:49:20] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11725676 (10TheDJ) I was testing File:High_quality_skull.stl locally via instantcommons.  And i'm not sure why, but it seems my setup...
[21:49:25] <wikibugs>	 (03CR) 10Kamila Součková: shellbox: Setup shellbox-icu72 (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[21:51:27] <logmsgbot>	 !log herron@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-logging-eqiad
[21:53:27] <wikibugs>	 (03PS5) 10Kamila Součková: shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548)
[21:56:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] shellbox: Setup shellbox-icu72 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[22:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260318T2200)
[22:03:21] <wikibugs>	 (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1251475 (https://phabricator.wikimedia.org/T419548) (owner: 10Kamila Součková)
[22:04:54] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage
[22:08:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5028.eqsin.wmnet with reason: host reimage
[22:16:38] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5020.eqsin.wmnet with OS trixie
[22:25:50] <icinga-wm>	 PROBLEM - Host logging-hd2004 is DOWN: PING CRITICAL - Packet loss = 100%
[22:27:10] <icinga-wm>	 RECOVERY - Host logging-hd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms
[22:40:03] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5028.eqsin.wmnet with OS trixie
[22:40:15] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11725876 (10Ladsgroup)
[22:47:24] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1254331/8300/" [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[22:48:36] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, and 6 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11725889 (10Ladsgroup) STL is basically the only file handler left that is not following thumb steps yet (everything else from T41480...
[23:01:57] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5028.*
[23:02:03] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5020.*
[23:04:50] <wikibugs>	 (03PS23) 10Ryan Kemper: dse-k8s: Add automation for setting OpenSearch pod ureadahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[23:06:02] <wikibugs>	 (03PS24) 10Ryan Kemper: dse-k8s: Auto-set OpenSearch pod readahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[23:08:12] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp5017.*
[23:08:20] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[23:15:57] <wikibugs>	 (03CR) 10Jforrester: "This is blocked by T420531." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255039 (owner: 10Jforrester)
[23:21:04] <wikibugs>	 (03PS1) 10Sportzpikachu: Allow `ws://localhost:*` and `wss://localhost:*` in CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539)
[23:23:01] <logmsgbot>	 !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev
[23:23:08] <wikibugs>	 (03PS2) 10Sportzpikachu: Allow `ws://localhost:*` and `wss://localhost:*` in CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539)
[23:28:43] <wikibugs>	 (03PS25) 10Ryan Kemper: dse-k8s: Auto-set OpenSearch pod readahead values [puppet] - 10https://gerrit.wikimedia.org/r/1254320 (https://phabricator.wikimedia.org/T419041) (owner: 10Bking)
[23:35:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:35:42] <wikibugs>	 (03CR) 10Dzahn: "Here is the reason why the compiler output can be so confusing (as in "why does it create timers on BOTH sides"?):" [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[23:35:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] releases: remove "unless" condition around rsync data copy [puppet] - 10https://gerrit.wikimedia.org/r/1254331 (https://phabricator.wikimedia.org/T420246) (owner: 10Dzahn)
[23:49:54] <logmsgbot>	 !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:cassandra-dev
[23:52:25] <wikibugs>	 (03CR) 10Scardenasmolinar: [C:03+1] Deploy Extension:PersonalDashboard to English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1254865 (https://phabricator.wikimedia.org/T418367) (owner: 10Kgraessle)
[23:57:45] <logmsgbot>	 !log herron@cumin1003 START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet
[23:58:44] <mutante>	 !log releases2003 - kill 782 (stunnel4) - systemctl start stunnel4 - fix T420246 T420388 T420411
[23:58:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:51] <stashbot>	 T420246: SystemdUnitFailed - rsync releases2003 - https://phabricator.wikimedia.org/T420246
[23:58:51] <stashbot>	 T420388: SystemdUnitFailed - https://phabricator.wikimedia.org/T420388
[23:58:52] <stashbot>	 T420411: PuppetFailure - https://phabricator.wikimedia.org/T420411