[00:06:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: firmware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741583 (10Papaul) @BCornwall The server is pxe booting but failed at see below  {F73537240}
[00:18:45] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1115.eqiad.wmnet with OS trixie
[00:19:27] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[00:27:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:32:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:37:27] <icinga-wm>	 PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100%
[00:38:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[00:39:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: firmware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741630 (10BCornwall) Hm. The specific output:   ` Mar 24 00:30:35 in-target: You are about to format nvme1n1, namespace 0x1.       Mar 24 00:30:35 in-target: WARNING: Fo...
[00:39:34] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1259309
[00:39:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1259309 (owner: 10TrainBranchBot)
[00:42:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741632 (10BCornwall)
[00:43:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[00:43:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741634 (10BCornwall) Marked cp1115 as "failed" in netbox
[00:44:19] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:50:35] <icinga-wm>	 RECOVERY - dump of db_inventory in eqiad on backupmon1001 is OK: Last dump for db_inventory at eqiad (db1215) taken on 2026-03-24 00:38:56 (3 MiB, -3.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:51:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1259309 (owner: 10TrainBranchBot)
[00:52:01] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS trixie
[00:55:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741650 (10BCornwall)
[00:57:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741654 (10BCornwall)
[00:59:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741661 (10BCornwall)
[01:05:35] <icinga-wm>	 RECOVERY - dump of db_inventory in codfw on backupmon1001 is OK: Last dump for db_inventory at codfw (db2185) taken on 2026-03-24 00:36:42 (3 MiB, -3.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:08:53] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage
[01:09:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1259313
[01:09:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1259313 (owner: 10TrainBranchBot)
[01:14:07] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage
[01:18:51] <icinga-wm>	 RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 123.19 ms
[01:19:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:23:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1259313 (owner: 10TrainBranchBot)
[01:37:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1104.eqiad.wmnet with OS trixie
[01:40:57] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1104.*
[01:56:09] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[02:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[02:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0200)
[02:00:52] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[02:08:57] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 04s)
[02:09:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.21 [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259332 (https://phabricator.wikimedia.org/T420479)
[02:09:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.21 [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259332 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot)
[02:21:12] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.21 [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259332 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot)
[02:31:55] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[02:34:19] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:19] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:46:59] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0300)
[03:01:57] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259366 (https://phabricator.wikimedia.org/T420479)
[03:01:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259366 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot)
[03:03:00] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259366 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot)
[03:03:27] <logmsgbot>	 !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.21  refs T420479
[03:03:32] <stashbot>	 T420479: 1.46.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T420479
[03:30:48] <wikibugs>	 (03PS1) 10Andrea Denisse: grafana: Add a SameSite attribute to cookies [puppet] - 10https://gerrit.wikimedia.org/r/1259382 (https://phabricator.wikimedia.org/T402844)
[03:30:48] <wikibugs>	 (03CR) 10Andrea Denisse: "Even tho the docs state that this doesn't work with Oauth [1] I tested it on the grafana-next host and I was able to log-in with our setup" [puppet] - 10https://gerrit.wikimedia.org/r/1259382 (https://phabricator.wikimedia.org/T402844) (owner: 10Andrea Denisse)
[03:34:52] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8330/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259254 (https://phabricator.wikimedia.org/T402844) (owner: 10Andrea Denisse)
[03:42:54] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.21  refs T420479 (duration: 39m 27s)
[03:42:59] <stashbot>	 T420479: 1.46.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T420479
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0400)
[04:01:15] <logmsgbot>	 !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.18 (duration: 01m 13s)
[04:14:19] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:19:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:34:19] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:38:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:00:00] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[06:00:00] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0600).
[06:31:55] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[06:57:44] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742014 (10ayounsi)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:11:12] <wikibugs>	 (03PS1) 10Ayounsi: Remove old ulsfo ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1259748 (https://phabricator.wikimedia.org/T418993)
[07:12:28] <wikibugs>	 (03PS2) 10Ayounsi: Remove old ulsfo ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1259748 (https://phabricator.wikimedia.org/T418993)
[07:40:20] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1255580 (owner: 10Slyngshede)
[07:45:47] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] C:external_clouds_vendors remove GeekyWorld [puppet] - 10https://gerrit.wikimedia.org/r/1255580 (owner: 10Slyngshede)
[07:59:44] <hashar>	 !log Changed https://logstash.wikimedia.org/ default page back to /app/dashboards
[07:59:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:41] <hashar>	 I am running the MediaWiki train
[08:01:54] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259837 (https://phabricator.wikimedia.org/T420479)
[08:01:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259837 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot)
[08:02:15] <hashar>	 the script generating the Deployment calendar got bugged for some reason
[08:03:05] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259837 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot)
[08:13:25] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.21  refs T420479
[08:13:30] <stashbot>	 T420479: 1.46.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T420479
[08:19:35] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1259748 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[08:20:57] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595)
[08:20:57] <wikibugs>	 (03CR) 10Arnaudb: "thanks, I've amended the change according to that very good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb)
[08:21:07] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280)
[08:21:44] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11742117 (10MatthewVernon)
[08:24:24] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:25:31] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:27:07] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:27:08] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:29:54] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:31:03] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:31:58] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[08:34:52] <wikibugs>	 (03PS1) 10Brouberol: dse-k8s-eqiad: document current version of 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484)
[08:35:48] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Remove old ulsfo ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1259748 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[08:36:37] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[08:38:42] <wikibugs>	 (03PS2) 10Brouberol: dse-k8s-eqiad: document current version of 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484)
[08:39:41] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti4008.ulsfo.wmnet with OS bookworm
[08:39:55] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ayounsi@cumin1003 for host ganeti4008.ulsfo.wmnet w...
[08:42:15] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8331/console" [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[08:43:03] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:45:12] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1170: Degraded drive T420873
[08:45:17] <stashbot>	 T420873: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873
[08:45:26] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:45:29] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1170: Degraded drive T420873
[08:46:17] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox
[08:47:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11742221 (10FCeratto-WMF) @VRiley-WMF I depooled the host to reduce I/O load when the new drive will be rebuilt, please go ahead and replace the drive ASAP. Thank you!
[08:47:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11742223 (10FCeratto-WMF) 05Open→03In progress p:05Triage→03High
[08:49:27] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[08:49:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:50:21] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old ulsfo ganeti VIP - ayounsi@cumin1003"
[08:50:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:51:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:51:51] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old ulsfo ganeti VIP - ayounsi@cumin1003"
[08:51:51] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:52:20] <dcausse>	 jouncebot: nowandnext
[08:52:20] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 7 minute(s)
[08:52:20] <jouncebot>	 In 1 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1000)
[08:52:24] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[08:52:45] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[08:52:51] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[08:56:17] <wikibugs>	 (03PS1) 10Ayounsi: Make ganeti4008 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1259848 (https://phabricator.wikimedia.org/T418993)
[08:56:45] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: alertmanager: Add Slack alerts receiver for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1259849 (https://phabricator.wikimedia.org/T421040)
[08:57:11] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:57:15] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:57:17] <wikibugs>	 (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259850 (https://phabricator.wikimedia.org/T408186)
[08:57:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] alertmanager: Add Slack alerts receiver for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1259849 (https://phabricator.wikimedia.org/T421040) (owner: 10Ilias Sarantopoulos)
[08:57:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11742249 (10kera_wmde) @Scott_French Yes, my request is for Level 1.
[08:58:29] <wikibugs>	 (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259851 (https://phabricator.wikimedia.org/T408186)
[08:58:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[08:59:19] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:00:14] <wikibugs>	 (03CR) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[09:00:40] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:01:10] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage
[09:04:19] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:05:04] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage
[09:09:19] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:09:33] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: alertmanager: Add Slack alerts receiver for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1259849 (https://phabricator.wikimedia.org/T421040)
[09:15:40] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:17:23] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259859 (https://phabricator.wikimedia.org/T420448)
[09:18:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Make ganeti4008 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1259848 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[09:19:21] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909)
[09:23:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4008.ulsfo.wmnet with OS bookworm
[09:23:22] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ayounsi@cumin1003 for host ganeti4008.ulsfo.wmnet with...
[09:24:18] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Make ganeti4008 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1259848 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi)
[09:27:02] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] alertmanager: Add Slack alerts receiver for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1259849 (https://phabricator.wikimedia.org/T421040) (owner: 10Ilias Sarantopoulos)
[09:29:08] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01
[09:29:22] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01
[09:29:45] <wikibugs>	 (03CR) 10A-pizzata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259859 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[09:30:28] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[09:30:31] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr3-ulsfo and 198.35.26.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:30:40] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:30:41] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T421042 (10phaultfinder) 03NEW
[09:31:17] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01
[09:31:26] <logmsgbot>	 !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01
[09:33:30] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T421043 (10phaultfinder) 03NEW
[09:34:01] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01
[09:35:27] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259859 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[09:35:31] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:37:25] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259859 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[09:38:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:40:32] <logmsgbot>	 ayounsi@cumin1003 addnode (PID 1674010) is awaiting input
[09:43:27] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01
[09:46:36] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742391 (10ayounsi)
[09:46:44] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:47:01] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[09:52:22] <wikibugs>	 (03PS1) 10Cathal Mooney: nftables: place notrack rules into the /etc/nftables/prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715)
[09:52:44] <wikibugs>	 (03PS1) 10DCausse: search: use the discovery ns record for the semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484)
[09:55:13] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: use Envoy on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1259869 (https://phabricator.wikimedia.org/T420909)
[09:57:23] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11742409 (10MLechvien-WMF) @Jclark-ctr gentle follow-up on that?
[09:58:55] <wikibugs>	 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742424 (10ayounsi) 05Open→03Resolved a:03ayounsi All done here. I've also opened {T421044} to balance the VMs better.
[10:00:00] <wikibugs>	 (03CR) 10DCausse: "Sorry I just saw this patch before uploading Ie7d9b5b489a38744d73eef9d2a704af532df74af. I think we can now use the discovery ns record and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259143 (owner: 10Ebernhardson)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1000)
[10:00:32] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[10:00:32] <jinxer-wm>	 FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[10:00:32] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11742436 (10Jclark-ctr) >>! In T412255#11742409, @MLechvien-WMF wrote: > @Jclark-ctr @Jhancock.wm  gentle follow-up on that?  For shipping updates procurement...
[10:06:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-deprecated, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742476 (10elukey)  To keep archives happy - I used the following workaround in provisioning and it worked:  `         # For som...
[10:07:41] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1006.eqiad.wmnet with OS trixie
[10:07:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-deprecated, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1006....
[10:08:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:10:06] <wikibugs>	 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11742508 (10elukey) p:05Triage→03Medium
[10:10:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: use Envoy on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1259869 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[10:10:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:13:52] <wikibugs>	 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11742517 (10MLechvien-WMF) 05Open→03Resolved Tentatively resolving as this was tested and merged, please reopen if any concerns
[10:16:23] <wikibugs>	 (03PS1) 10Cathal Mooney: nftables: remove 'notrack' directory from /etc/nftables [puppet] - 10https://gerrit.wikimedia.org/r/1259896 (https://phabricator.wikimedia.org/T420715)
[10:16:47] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2006.codfw.wmnet with OS trixie
[10:16:48] <wikibugs>	 (03PS1) 10Ayounsi: Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897
[10:16:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops-deprecated: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742530 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1003 for host aux-k8s-worker2006.codfw.wmnet wit...
[10:17:32] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2007.codfw.wmnet with OS trixie
[10:17:39] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aux-k8s-worker1006.eqiad.wmnet with OS trixie
[10:17:40] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops-deprecated: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742535 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1003 for host aux-k8s-worker2007.codfw.wmnet wit...
[10:17:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-deprecated, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1006.eqia...
[10:18:08] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:18:10] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:18:38] <wikibugs>	 (03PS1) 10Cathal Mooney: nftables: remove the file definition for /etc/nftables/notrack [puppet] - 10https://gerrit.wikimedia.org/r/1259898 (https://phabricator.wikimedia.org/T420715)
[10:18:56] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:18:59] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:19:34] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi)
[10:20:29] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:20:32] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:21:15] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:22:05] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:22:14] <wikibugs>	 (03PS1) 10Arnaudb: Revert "gerrit: use Envoy on gerrit-spare" [puppet] - 10https://gerrit.wikimedia.org/r/1259899
[10:23:08] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: use Envoy on gerrit-spare" [puppet] - 10https://gerrit.wikimedia.org/r/1259899 (owner: 10Arnaudb)
[10:24:23] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Make the recovery account obsolete (but not remove it yet) [puppet] - 10https://gerrit.wikimedia.org/r/1259901 (https://phabricator.wikimedia.org/T420506)
[10:28:01] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: use Envoy on gerrit-spare (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259869 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[10:28:16] <wikibugs>	 (03CR) 10JMeybohm: "I'd stack this on top of 1259141 so that the CI does not fail here" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis)
[10:28:31] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: use Envoy on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1259902 (https://phabricator.wikimedia.org/T420909)
[10:28:41] <icinga-wm>	 RECOVERY - Host thanos-be2006 is UP: PING OK - Packet loss = 0%, RTA = 30.54 ms
[10:28:55] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage
[10:29:36] <logmsgbot>	 !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage
[10:29:44] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:29:47] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Make the recovery account obsolete (but not remove it yet) [puppet] - 10https://gerrit.wikimedia.org/r/1259901 (https://phabricator.wikimedia.org/T420506)
[10:29:51] <wikibugs>	 (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259901 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo)
[10:30:00] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:30:02] <wikibugs>	 (03PS13) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102)
[10:30:32] <jinxer-wm>	 RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability
[10:30:32] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[10:30:39] <wikibugs>	 (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[10:32:24] <wikibugs>	 (03PS2) 10Ayounsi: Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897
[10:32:38] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi)
[10:32:46] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mediabackup: Make the recovery account obsolete (but not remove it yet) [puppet] - 10https://gerrit.wikimedia.org/r/1259901 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo)
[10:33:06] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage
[10:35:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742608 (10Clement_Goubert)
[10:35:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742609 (10Clement_Goubert)
[10:36:16] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage
[10:37:38] <wikibugs>	 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11742616 (10MLechvien-WMF) #traffic  do we know when we can do this cleanup?
[10:38:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:40:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:41:26] <wikibugs>	 (03PS3) 10Ayounsi: Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897
[10:41:56] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi)
[10:42:00] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:42:09] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:49:23] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2006.codfw.wmnet with OS trixie
[10:49:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1003 for host aux-k8s-worker2006.codfw.wmnet with OS trixie completed: - aux-k8...
[10:51:53] <wikibugs>	 (03CR) 10Majavah: "fixed in PS13." [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[10:53:15] <logmsgbot>	 !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2007.codfw.wmnet with OS trixie
[10:53:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1003 for host aux-k8s-worker2007.codfw.wmnet with OS trixie completed: - aux-k8...
[10:55:23] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:55:34] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[10:57:17] <wikibugs>	 (03PS1) 10Btullis: Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048)
[10:57:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) (owner: 10Btullis)
[10:59:28] <wikibugs>	 (03PS2) 10Btullis: Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048)
[10:59:29] <wikibugs>	 (03CR) 10Majavah: [C:03+1] conftool-data: move s3, x3 to new hosts (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/1256417 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri)
[11:00:17] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) (owner: 10Btullis)
[11:03:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) (owner: 10Btullis)
[11:03:57] <wikibugs>	 (03CR) 10FNegri: [C:03+2] conftool-data: move s3, x3 to new hosts (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/1256417 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri)
[11:04:27] <wikibugs>	 (03PS4) 10Ayounsi: Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897
[11:04:58] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi)
[11:06:59] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: document current version of 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[11:07:20] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] dse-k8s-eqiad: document current version of 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[11:07:22] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) (owner: 10Btullis)
[11:07:47] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1006.eqiad.wmnet with OS trixie
[11:08:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1006.eqiad.wmnet with OS trixie
[11:09:18] <wikibugs>	 (03PS3) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484)
[11:09:36] <wikibugs>	 (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[11:10:55] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!  Nice work :)" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi)
[11:11:49] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi)
[11:14:51] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=s3
[11:14:58] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1023.eqiad.wmnet,service=s3
[11:15:05] <wikibugs>	 (03PS4) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484)
[11:17:35] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3
[11:18:03] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3
[11:18:37] <wikibugs>	 (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259850 (https://phabricator.wikimedia.org/T408186) (owner: 10Santiago Faci)
[11:18:55] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1006.eqiad.wmnet with reason: host reimage
[11:19:06] <wikibugs>	 (03PS1) 10Ayounsi: Define profile::installserver::dhcp::install_servers6 in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1259934
[11:19:38] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1023.eqiad.wmnet,service=s3
[11:19:44] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1022.eqiad.wmnet,service=s3
[11:19:53] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Define profile::installserver::dhcp::install_servers6 in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1259934 (owner: 10Ayounsi)
[11:20:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[11:22:02] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=s3
[11:22:25] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Define profile::installserver::dhcp::install_servers6 in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1259934 (owner: 10Ayounsi)
[11:24:47] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1006.eqiad.wmnet with reason: host reimage
[11:25:53] <wikibugs>	 (03PS1) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975)
[11:26:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney)
[11:26:54] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1022.eqiad.wmnet,service=s3
[11:26:58] <wikibugs>	 (03PS2) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975)
[11:27:29] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney)
[11:27:30] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1023.eqiad.wmnet,service=s3
[11:27:45] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1023.eqiad.wmnet,service=s3
[11:27:55] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3
[11:31:37] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1022.eqiad.wmnet,service=x3
[11:31:41] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1023.eqiad.wmnet,service=x3
[11:31:54] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=x3
[11:31:58] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1023.eqiad.wmnet,service=x3
[11:32:07] <logmsgbot>	 !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet
[11:32:15] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=x3
[11:32:20] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=x3
[11:34:14] <wikibugs>	 (03PS3) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975)
[11:36:03] <logmsgbot>	 !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet
[11:37:29] <wikibugs>	 (03CR) 10Hnowlan: trafficserver: Add api.w.o to gateway-check.lua.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[11:38:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:38:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11742853 (10Daria-WMDE) @Scott_French thank you! Signed the NDA
[11:39:12] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney)
[11:40:19] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1006.eqiad.wmnet with OS trixie
[11:40:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:40:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1006.eqiad.wmnet with OS trixie comp...
[11:41:05] <wikibugs>	 (03PS4) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975)
[11:42:25] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145)
[11:42:29] <wikibugs>	 (03CR) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[11:42:49] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148)
[11:43:02] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: 100% of device-analytics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147)
[11:44:28] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146)
[11:46:00] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:46:02] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:46:11] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259078 (https://phabricator.wikimedia.org/T418146)
[11:47:23] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:47:24] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:47:25] <wikibugs>	 (03PS5) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975)
[11:47:51] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:47:53] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:48:09] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:48:11] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:48:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:49:26] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:49:28] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:49:44] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney)
[11:51:06] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1
[11:51:18] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1017.eqiad.wmnet
[11:51:19] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1017.eqiad.wmnet
[11:51:47] <logmsgbot>	 !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Rebooting clouddb1017 T419960
[11:52:07] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney)
[11:53:44] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:53:54] <logmsgbot>	 !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply
[11:53:56] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: use Envoy on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1259902 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[11:57:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:57:53] <wikibugs>	 (03CR) 10Hnowlan: [C:04-2] "This API isn't used in public and doesn't need to be rerouted. It existing in parallel is an awkward side-effect of the first rollout of A" [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert)
[11:58:18] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[11:58:25] <wikibugs>	 (03Abandoned) 10Clément Goubert: trafficserver: 100% of device-analytics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert)
[11:58:26] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:58:53] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm bar the unique-devices line" [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1200)
[12:00:16] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney)
[12:00:44] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "rest-gateway: Add api.w.o device-analytics support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942
[12:01:41] <logmsgbot>	 !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1
[12:01:55] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Revert "rest-gateway: Add api.w.o device-analytics support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 (owner: 10Clément Goubert)
[12:02:18] <logmsgbot>	 !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1017.eqiad.wmnet
[12:02:19] <logmsgbot>	 !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1017.eqiad.wmnet
[12:03:25] <wikibugs>	 (03PS4) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145)
[12:03:36] <wikibugs>	 (03PS4) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148)
[12:04:31] <wikibugs>	 (03PS5) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145)
[12:05:36] <wikibugs>	 (03PS5) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146)
[12:06:47] <wikibugs>	 (03PS6) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259078 (https://phabricator.wikimedia.org/T418146)
[12:07:50] <wikibugs>	 (03Abandoned) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259078 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[12:09:58] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146)
[12:10:55] <wikibugs>	 (03CR) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[12:11:52] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: use Envoy on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909)
[12:12:25] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: use Envoy on gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1259945 (https://phabricator.wikimedia.org/T420909)
[12:13:24] <wikibugs>	 (03PS6) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145)
[12:14:11] <wikibugs>	 (03CR) 10Arnaudb: "if we merge this after 1259944 the backend config will have to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb)
[12:15:24] <wikibugs>	 (03PS7) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145)
[12:16:41] <wikibugs>	 (03PS8) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146)
[12:17:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11742983 (10Jclark-ctr) a:05BTullis→03Jclark-ctr
[12:17:35] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146)
[12:18:47] <wikibugs>	 (03PS6) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148)
[12:34:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11743055 (10Jclark-ctr) a:05Jclark-ctr→03Jgreen Updated passwords and sent temp password via private message to you
[12:38:56] <wikibugs>	 (03PS1) 10Daniel Kinzler: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031)
[12:43:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975#11743113 (10cmooney) 05Open→03Resolved a:03cmooney This should now be working again.  Big thanks to @ayounsi for the heavy-lifting with all the puppe...
[12:50:47] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1007.eqiad.wmnet with OS trixie
[12:51:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11743142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1007.eqiad.wmnet with OS trixie
[12:54:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11743147 (10hashar) 05Resolved→03Open We still have the old Puppet 5 /...
[12:54:52] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031)
[12:56:34] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1300).
[13:00:04] <jouncebot>	 Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:20] <Lucas_WMDE>	 I can’t really deploy today, anyone else around?
[13:00:21] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: modify records for payments servers frack - cmooney@cumin1003"
[13:00:27] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: modify records for payments servers frack - cmooney@cumin1003"
[13:00:27] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:00:32] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:01:59] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage
[13:02:49] <wikibugs>	 (03PS1) 10Hashar: ci: fix typo in manage_srv [puppet] - 10https://gerrit.wikimedia.org/r/1259961
[13:03:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:03:38] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache payments1010.frack.eqiad.wmnet on all recursors
[13:03:41] <wikibugs>	 (03CR) 10Hashar: jenkins: allow rsyncing of data for migrating a jenkins server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[13:03:42] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) payments1010.frack.eqiad.wmnet on all recursors
[13:03:53] <wikibugs>	 (03PS3) 10Daniel Kinzler: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031)
[13:03:58] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache payments1011.frack.eqiad.wmnet on all recursors
[13:04:02] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) payments1011.frack.eqiad.wmnet on all recursors
[13:04:07] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache payments1012.frack.eqiad.wmnet on all recursors
[13:04:11] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) payments1012.frack.eqiad.wmnet on all recursors
[13:08:31] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage
[13:09:40] <Daimona>	 I'm here for the deployment BTW, lost track of time, sorry!
[13:09:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11743235 (10AnnieKim_WMDE) I'm in! Thanks for your help.
[13:10:01] <wikibugs>	 (03PS8) 10Andrew Bogott: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[13:10:05] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[13:10:32] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:13:01] <cmelo>	 \o/
[13:14:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[13:15:32] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:15:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cmelo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259231 (https://phabricator.wikimedia.org/T419597) (owner: 10Daimona Eaytoy)
[13:15:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cmelo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259237 (https://phabricator.wikimedia.org/T414149) (owner: 10Daimona Eaytoy)
[13:16:56] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on all wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259231 (https://phabricator.wikimedia.org/T419597) (owner: 10Daimona Eaytoy)
[13:17:09] <wikibugs>	 (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableEventGoals in prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259237 (https://phabricator.wikimedia.org/T414149) (owner: 10Daimona Eaytoy)
[13:18:03] <logmsgbot>	 !log cmelo@deploy2002 Started scap sync-world: Backport for [[gerrit:1259231|Enable the CampaignEvents extension on all wikibooks (T419597)]], [[gerrit:1259237|Enable $wgCampaignEventsEnableEventGoals in prod wikis (T414149)]]
[13:18:09] <stashbot>	 T419597: Enable CampaignEvents extension on Wikibooks [week of March 23] - https://phabricator.wikimedia.org/T419597
[13:18:10] <stashbot>	 T414149: Enable event goals in production - https://phabricator.wikimedia.org/T414149
[13:20:12] <logmsgbot>	 !log cmelo@deploy2002 cmelo, daimona: Backport for [[gerrit:1259231|Enable the CampaignEvents extension on all wikibooks (T419597)]], [[gerrit:1259237|Enable $wgCampaignEventsEnableEventGoals in prod wikis (T414149)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:20:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11743394 (10Jclark-ctr) a:05Jclark-ctr→03Jgreen @jgreen you should be good @cmooney  was able to assist with vlan
[13:21:16] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] search: use the discovery ns record for the semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse)
[13:22:54] <wikibugs>	 (03Abandoned) 10Ebernhardson: search: Add codfw semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259143 (owner: 10Ebernhardson)
[13:23:08] <sukhe>	 !log sudo cumin 'C:bird' "disable-puppet 'merging CR 1248385, T413740'"
[13:23:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:13] <stashbot>	 T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740
[13:24:16] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1007.eqiad.wmnet with OS trixie
[13:24:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11743426 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1007.eqiad.wmnet with OS trixie comp...
[13:24:53] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff)
[13:25:05] <wikibugs>	 (03CR) 10CDanis: vector-search: add initial deployment chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) (owner: 10Fabian Kaelin)
[13:26:29] <logmsgbot>	 !log cmelo@deploy2002 cmelo, daimona: Continuing with sync
[13:27:12] <wikibugs>	 (03PS1) 10Jforrester: Set json object before setting Abstract Wiki Id [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259967 (https://phabricator.wikimedia.org/T420916)
[13:27:44] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] "lgtm, will merge" [puppet] - 10https://gerrit.wikimedia.org/r/1259961 (owner: 10Hashar)
[13:29:28] <wikibugs>	 (03CR) 10Bking: [C:03+1] search: use the discovery ns record for the semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse)
[13:30:46] <logmsgbot>	 !log cmelo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259231|Enable the CampaignEvents extension on all wikibooks (T419597)]], [[gerrit:1259237|Enable $wgCampaignEventsEnableEventGoals in prod wikis (T414149)]] (duration: 12m 43s)
[13:30:52] <stashbot>	 T419597: Enable CampaignEvents extension on Wikibooks [week of March 23] - https://phabricator.wikimedia.org/T419597
[13:30:53] <stashbot>	 T414149: Enable event goals in production - https://phabricator.wikimedia.org/T414149
[13:31:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse)
[13:32:13] <sukhe>	 !log sudo cumin -b1 -s20 'C:bird' "run-puppet-agent --enable 'merging CR 1248385, T413740'"
[13:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:18] <stashbot>	 T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740
[13:33:19] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1008.eqiad.wmnet with OS trixie
[13:33:26] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:33:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11743487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1008.eqiad.wmnet with OS trixie
[13:33:53] <dcausse>	 just added a patch to the backport window
[13:35:25] <dcausse>	 o/ cmelo, Daimona are you done with your deploys?
[13:37:31] <wikibugs>	 (03CR) 10CDanis: [C:03+1] trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759) (owner: 10Bking)
[13:37:54] <dcausse>	 jouncebot: now
[13:37:54] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1300)
[13:38:54] <wikibugs>	 (03CR) 10CDanis: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb)
[13:39:10] <wikibugs>	 (03CR) 10Bking: [C:03+2] trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759) (owner: 10Bking)
[13:41:38] <wikibugs>	 (03CR) 10CDanis: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb)
[13:42:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse)
[13:43:57] <wikibugs>	 (03Merged) 10jenkins-bot: search: use the discovery ns record for the semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse)
[13:44:28] <logmsgbot>	 !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1259875|search: use the discovery ns record for the semanticsearch cluster (T414484)]]
[13:44:33] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1008.eqiad.wmnet with reason: host reimage
[13:44:34] <stashbot>	 T414484: Upgrade DSE clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414484
[13:45:20] <wikibugs>	 (03CR) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb)
[13:46:13] <wikibugs>	 (03PS1) 10Btullis: Temporarily suspend the flink applications running in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259973 (https://phabricator.wikimedia.org/T414484)
[13:46:14] <wikibugs>	 (03CR) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb)
[13:46:31] <logmsgbot>	 !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1259875|search: use the discovery ns record for the semanticsearch cluster (T414484)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:48:11] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1008.eqiad.wmnet with reason: host reimage
[13:49:52] <Daimona>	 dcausse: belated yes, sorry
[13:49:59] <cmelo>	 Hi we are done, sorry
[13:50:03] <dcausse>	 np, thanks! :)
[13:52:20] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[13:52:38] <wikibugs>	 (03PS1) 10Ottomata: dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330)
[13:54:09] <wikibugs>	 (03CR) 10AKhatun: [C:03+1] dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) (owner: 10Ottomata)
[13:54:35] <wikibugs>	 (03CR) 10JavierMonton: [C:03+1] dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) (owner: 10Ottomata)
[13:54:50] <wikibugs>	 (03CR) 10Btullis: "Do not merge until the maintenance window on Thursday March 26th 2026 at 10:30 UTC" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259973 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis)
[13:57:38] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[13:58:00] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595)
[13:59:09] <logmsgbot>	 !log dcausse@deploy2002 Sync cancelled.
[13:59:30] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595)
[13:59:30] <logmsgbot>	 !log jforrester@deploy2002 mwscript-k8s job started: sql --wiki=abstractwiki /srv/mediawiki/php-1.46.0-wmf.20/extensions/Translate/sql/mysql/translate_message_group_subscriptions.sql  # T420656 translate_message_group_subscriptions
[13:59:36] <stashbot>	 T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656
[14:00:05] <jouncebot>	 Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1400)
[14:00:17] <wikibugs>	 (03PS1) 10DCausse: Revert "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259979
[14:00:34] <wikibugs>	 (03CR) 10DCausse: [C:03+2] Revert "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259979 (owner: 10DCausse)
[14:00:47] <wikibugs>	 (03CR) 10Arnaudb: "added 1259121 as a dependency of this change" [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[14:01:29] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259979 (owner: 10DCausse)
[14:01:38] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:02:28] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: use Envoy on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909)
[14:02:48] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: use Envoy on gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1259945 (https://phabricator.wikimedia.org/T420909)
[14:04:38] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1008.eqiad.wmnet with OS trixie
[14:04:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11744048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1008.eqiad.wmnet with OS trixie comp...
[14:05:43] <logmsgbot>	 !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1259979|Revert "search: use the discovery ns record for the semanticsearch cluster"]]
[14:07:00] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:07:46] <logmsgbot>	 !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1259979|Revert "search: use the discovery ns record for the semanticsearch cluster"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:08:17] <logmsgbot>	 !log dcausse@deploy2002 dcausse: Continuing with sync
[14:08:53] <wikibugs>	 (03PS1) 10Klausman: admin_ng/knative-serving: enable emptyDir feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259985 (https://phabricator.wikimedia.org/T421105)
[14:09:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] FR-Tech Provision Script: add some checks to validate rack for vlan (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1238379 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney)
[14:10:23] <bjensen>	 hey folks, a reminder that we're going to start the services switchover (not mediawiki, that'll be tomorrow) at 15:00 UTC, no impact is expected
[14:10:52] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11744153 (10tappof) 05Open→03Resolved a:03tappof I filed a dedicated task ({T421078}) for offloading queries to remote instances with SSD disks. I think we can safely...
[14:11:59] <wikibugs>	 (03PS5) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484)
[14:12:11] <wikibugs>	 (03Merged) 10jenkins-bot: FR-Tech Provision Script: add some checks to validate rack for vlan [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1238379 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney)
[14:12:37] <logmsgbot>	 !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259979|Revert "search: use the discovery ns record for the semanticsearch cluster"]] (duration: 06m 54s)
[14:13:12] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[14:13:17] <wikibugs>	 (03PS3) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264)
[14:13:26] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[14:13:35] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[14:13:45] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis)
[14:14:05] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[14:15:04] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:16:28] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1009.eqiad.wmnet with OS trixie
[14:16:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11744184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1009.eqiad.wmnet with OS trixie
[14:17:35] <wikibugs>	 (03Abandoned) 10Arnaudb: gerrit: remove read-only config [puppet] - 10https://gerrit.wikimedia.org/r/1240217 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[14:17:35] <wikibugs>	 (03Abandoned) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb)
[14:18:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[14:19:23] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply
[14:19:35] <logmsgbot>	 !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply
[14:20:28] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:22:23] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:22:25] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:23:12] <logmsgbot>	 !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply
[14:23:26] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:23:26] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:23:29] <wikibugs>	 (03CR) 10ArielGlenn: "This looks really good, a big improvement in readabililty. I've left some small tweaks/questions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 (owner: 10Daniel Kinzler)
[14:23:31] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[14:23:51] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[14:24:04] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert)
[14:25:02] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert)
[14:25:32] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[14:25:32] <logmsgbot>	 !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply
[14:25:36] <wikibugs>	 (03PS2) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379)
[14:26:25] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:26:56] <wikibugs>	 (03PS3) 10Trueg: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415)
[14:27:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) (owner: 10Fabian Kaelin)
[14:27:47] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1009.eqiad.wmnet with reason: host reimage
[14:28:02] <wikibugs>	 (03CR) 10Trueg: wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg)
[14:30:04] <jouncebot>	 Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1430)
[14:31:43] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:33:00] <wikibugs>	 (03PS6) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484)
[14:34:08] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1009.eqiad.wmnet with reason: host reimage
[14:35:50] <wikibugs>	 (03CR) 10Btullis: [C:04-1] "This needs more work." [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis)
[14:36:08] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031)
[14:37:03] <wikibugs>	 (03PS7) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484)
[14:37:42] <wikibugs>	 (03PS3) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379)
[14:40:07] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] miscweb: add wmf-navigator aux ingress record [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[14:41:06] <logmsgbot>	 !log aokoth@dns1004 START - running authdns-update
[14:42:47] <logmsgbot>	 !log aokoth@dns1004 END - running authdns-update
[14:43:11] <wikibugs>	 (03PS1) 10Volans: Insetup role report: update receipients [puppet] - 10https://gerrit.wikimedia.org/r/1259989
[14:43:52] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] ats: add wmf-navigator entry [puppet] - 10https://gerrit.wikimedia.org/r/1255818 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)
[14:44:21] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2008.codfw.wmnet with OS trixie
[14:44:47] <wikibugs>	 (03CR) 10Clément Goubert: rest gateway: lower threshold for browser detection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) (owner: 10Daniel Kinzler)
[14:44:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11744373 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker2008.codfw.wmnet with OS trixie
[14:45:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Insetup role report: update receipients [puppet] - 10https://gerrit.wikimedia.org/r/1259989 (owner: 10Volans)
[14:48:22] <wikibugs>	 (03PS1) 10Jforrester: [abstractwiki] Don't list abstract as a langlist entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259992 (https://phabricator.wikimedia.org/T420654)
[14:49:19] <wikibugs>	 (03CR) 10Volans: [C:03+2] Insetup role report: update receipients [puppet] - 10https://gerrit.wikimedia.org/r/1259989 (owner: 10Volans)
[14:50:29] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1009.eqiad.wmnet with OS trixie
[14:50:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11744397 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1009.eqiad.wmnet with OS trixie comp...
[14:51:26] <wikibugs>	 (03PS1) 10Jforrester: dumpInterwiki: Re-generate to add Abstract Wikipedia (and others) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259993 (https://phabricator.wikimedia.org/T420654)
[14:51:57] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[14:52:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259993 (https://phabricator.wikimedia.org/T420654) (owner: 10Jforrester)
[14:52:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259992 (https://phabricator.wikimedia.org/T420654) (owner: 10Jforrester)
[14:53:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259967 (https://phabricator.wikimedia.org/T420916) (owner: 10Jforrester)
[14:53:44] <wikibugs>	 (03PS1) 10Jforrester: AbstractPreview: apply selected preview language lang/dir to abstract preview body [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259994 (https://phabricator.wikimedia.org/T420687)
[14:53:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259994 (https://phabricator.wikimedia.org/T420687) (owner: 10Jforrester)
[14:55:13] <wikibugs>	 (03PS1) 10JMeybohm: wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436)
[14:55:15] <wikibugs>	 (03PS1) 10JMeybohm: wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436)
[14:55:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[14:56:22] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage
[14:56:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11744436 (10VRiley-WMF) This drive has been replaced
[14:57:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11744451 (10FCeratto-WMF) a:05VRiley-WMF→03FCeratto-WMF thank you @VRiley-WMF !  I'm claiming the task, checking the host and repooling once the raid rebuilding is done.
[14:58:05] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice. Thanks for this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[14:59:16] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage
[14:59:17] <bjensen>	 !log beginning the Traffic and Services portions of the DC switchover, operational followup will be in #wikimedia-sre
[14:59:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:43] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool codfw [reason: no reason specified, no task ID specified]
[14:59:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol)
[14:59:52] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool codfw [reason: no reason specified, no task ID specified]
[15:00:04] <jouncebot>	 jelto, arnoldokoth, mutante, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1500).
[15:00:13] <wikibugs>	 (03PS2) 10JMeybohm: wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436)
[15:00:14] <wikibugs>	 (03PS2) 10JMeybohm: wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436)
[15:02:14] <wikibugs>	 (03CR) 10Elukey: "$ docker-pkg build images/ --select *mcrouter*" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) (owner: 10Elukey)
[15:03:55] <wikibugs>	 (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259998 (https://phabricator.wikimedia.org/T420448)
[15:06:39] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259998 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[15:09:38] <wikibugs>	 (03PS4) 10Daniel Kinzler: rest-gateway: update readme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848
[15:09:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11744533 (10VRiley-WMF) @BCornwall Is it okay to power down this unit and investigate this issue?
[15:10:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11744534 (10BCornwall) @VRiley-WMF Yes, please do!
[15:10:41] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: update readme (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 (owner: 10Daniel Kinzler)
[15:11:09] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259998 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[15:13:35] <wikibugs>	 (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259998 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton)
[15:16:11] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2008.codfw.wmnet with OS trixie
[15:16:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11744574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker2008.codfw.wmnet with OS trixie completed: - aux-k8s-w...
[15:18:44] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2009.codfw.wmnet with OS trixie
[15:19:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11744590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker2009.codfw.wmnet with OS trixie
[15:19:40] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS trixie
[15:20:16] <logmsgbot>	 !log blake@cumin1003 START - Cookbook sre.discovery.datacenter depool all services in codfw: Datacenter Switchover - T413974
[15:20:20] <stashbot>	 T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974
[15:21:02] <wikibugs>	 (03CR) 10Majavah: [C:03+2] cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah)
[15:22:37] <wikibugs>	 (03PS1) 10Majavah: cloudlb: Allow specifying multiple addresses per frontend [puppet] - 10https://gerrit.wikimedia.org/r/1260004 (https://phabricator.wikimedia.org/T420921)
[15:22:55] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11744624 (10RobH) Fixed on Friday, synced up in meeting today and no morre errors.  Cathal closing the ticket on the Lumen portal.
[15:23:08] <wikibugs>	 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11744626 (10RobH) 05Open→03Resolved
[15:23:16] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11744627 (10RobH) 05Open→03Resolved a:03RobH
[15:23:18] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11744629 (10RobH) 05Open→03Resolved a:03RobH
[15:23:43] <wikibugs>	 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11744633 (10RobH) 05Open→03Resolved
[15:24:50] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[15:25:35] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8332/co" [puppet] - 10https://gerrit.wikimedia.org/r/1260004 (https://phabricator.wikimedia.org/T420921) (owner: 10Majavah)
[15:26:06] <wikibugs>	 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11744655 (10RobH) description: Rule: Port with no description on access switch Faults: #1: ge-0/0/47 - ge-0/0/47  This is port https://netbox.wikimedia.org/dcim/int...
[15:30:35] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage
[15:32:02] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[15:32:09] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[15:33:26] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[15:34:21] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage
[15:36:34] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[15:36:40] <wikibugs>	 (03PS1) 10D3r1ck01: Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833)
[15:38:01] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1113.eqiad.wmnet with OS trixie
[15:38:13] <wikibugs>	 06SRE, 10SRE-swift-storage: ms swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419577#11744708 (10MatthewVernon) A quick back-of-the-envelope is about 73TB for commons transcoded buckets.
[15:38:16] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS trixie
[15:38:49] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' .
[15:39:37] <wikibugs>	 (03PS3) 10JMeybohm: wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436)
[15:39:38] <wikibugs>	 (03PS3) 10JMeybohm: wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436)
[15:40:01] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_debmonitor_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:40:57] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service fasw2-c8b-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#fasw2-c8b-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:43:16] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[15:44:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:45:01] <jinxer-wm>	 RESOLVED: [10x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_debmonitor_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:45:32] <wikibugs>	 (03PS1) 10Urbanecm: cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147)
[15:46:33] <logmsgbot>	 !log blake@cumin1003 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in codfw: Datacenter Switchover - T413974
[15:46:43] <stashbot>	 T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974
[15:49:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:50:07] <wikibugs>	 (03PS4) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379)
[15:50:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[15:50:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:50:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[15:50:57] <jinxer-wm>	 FIRING: [2x] CertAlmostExpired: Certificate for service fasw2-c8a-codfw.mgmt.codfw.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:50:59] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2009.codfw.wmnet with OS trixie
[15:51:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11744745 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker2009.codfw.wmnet with OS trixie completed: - aux-k8s-w...
[15:52:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) (owner: 10Fabian Kaelin)
[15:52:42] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm)
[15:53:18] <wikibugs>	 (03PS1) 10Blake: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013
[15:53:56] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 (owner: 10Blake)
[15:54:16] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 (owner: 10Blake)
[15:54:30] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:54:36] <bjensen>	 !log Services portion of the datacenter switchover is complete
[15:54:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:45] <wikibugs>	 (03CR) 10Elukey: [C:03+2] site: install the aux-k8s-worker1006-9 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1258704 (https://phabricator.wikimedia.org/T393053) (owner: 10Brouberol)
[15:54:49] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage
[15:55:20] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[15:55:55] <wikibugs>	 (03CR) 10Blake: [C:03+2] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 (owner: 10Blake)
[15:56:01] <wikibugs>	 (03PS5) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379)
[15:57:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:57:54] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 (owner: 10Blake)
[15:58:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[15:59:38] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage
[16:00:05] <jouncebot>	 jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:02:04] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11744841 (10Scott_French)
[16:02:39] <wikibugs>	 (03CR) 10Fabian Kaelin: vector-search: add initial deployment chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) (owner: 10Fabian Kaelin)
[16:03:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[16:03:32] <logmsgbot>	 !log blake@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[16:03:44] <logmsgbot>	 !log blake@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[16:03:45] <logmsgbot>	 !log blake@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[16:03:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[16:03:59] <logmsgbot>	 !log blake@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[16:04:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:05:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:05:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11744861 (10Scott_French) 05Open→03Resolved a:03Scott_French I'm going to optimistically resolve this, on the basis that granting membership in `nda` should allow access to Superset et al....
[16:06:23] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11744870 (10Scott_French) @kera_wmde - Great, thank you for confirming.
[16:07:08] <jinxer-wm>	 FIRING: KubernetesCalicoDown: aux-k8s-worker1006.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-aux&var-instance=aux-k8s-worker1006.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:07:57] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11744878 (10Scott_French)
[16:08:20] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[16:08:26] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:09:14] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert)
[16:09:37] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421137 (10phaultfinder) 03NEW
[16:09:58] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1258094 (https://phabricator.wikimedia.org/T420615) (owner: 10Pppery)
[16:12:08] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: aux-k8s-worker1006.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-aux&var-instance=aux-k8s-worker1006.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:14:20] <jinxer-wm>	 FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[16:15:55] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) (owner: 10Ottomata)
[16:16:57] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259850 (https://phabricator.wikimedia.org/T408186) (owner: 10Santiago Faci)
[16:18:01] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) (owner: 10Ottomata)
[16:19:23] <wikibugs>	 (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259850 (https://phabricator.wikimedia.org/T408186) (owner: 10Santiago Faci)
[16:22:03] <icinga-wm>	 RECOVERY - MegaRAID on db1170 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:22:33] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1113.eqiad.wmnet with OS trixie
[16:24:15] <logmsgbot>	 !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply
[16:24:42] <logmsgbot>	 !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply
[16:26:32] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on cirrussearch2080:9290 - https://phabricator.wikimedia.org/T420760#11745087 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack.
[16:26:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11745094 (10FCeratto-WMF) Rebuild completed: ` /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) === RaidStatus completed `
[16:27:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on logstash2036:9290 - https://phabricator.wikimedia.org/T420761#11745098 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack.
[16:27:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T420762#11745105 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack.
[16:28:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T420905#11745125 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack.
[16:29:20] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[16:30:32] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T420948#11745154 (10Jhancock.wm) there were a few power supplies that went down in the same rack. it wasn't a breaker trip. all on different channels on the PDUs. I...
[16:30:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T421042#11745162 (10Jhancock.wm) rmultiple servers found on different breakers. no correlation other than rack.
[16:31:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T421042#11745164 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:31:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T421043#11745172 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack.
[16:32:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11745182 (10FCeratto-WMF) Icinga is green, the MySQL dashboard looks uneventful. Pooling in slowly.
[16:32:42] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1170: Degraded drive replaced T420873
[16:32:49] <stashbot>	 T420873: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873
[16:33:26] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:34:04] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11745195 (10Jhancock.wm) I would definitely start with that and see if it clears the issue. the...
[16:34:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421137#11745196 (10phaultfinder)
[16:35:12] <wikibugs>	 (03PS1) 10Krinkle: Enable $wgTrackMediaRequestProvenance on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260029 (https://phabricator.wikimedia.org/T414338)
[16:36:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 18.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:36:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[16:37:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[16:37:26] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[16:38:17] <claime>	 dcausse: Should we repool search in codfw?
[16:38:39] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1113.*
[16:39:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11745213 (10VRiley-WMF) I checked the iDRAC to see if there were any failures showing. It doesn't seem like any hardware problems are showing. I performed a flea power drain....
[16:41:14] <inflatador>	 claime looks like the P95s are trending down, let's give it 10m? https://grafana.wikimedia.org/goto/efh012v6gabcwb?orgId=1 dcausse ebernhardson does that work for you?
[16:41:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:41:37] <claime>	 bjensen: we may want to give a little more replicas to mw-api-ext
[16:42:45] <jinxer-wm>	 FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:42:58] <dcausse>	 claime: latencies are getting better, can we wait ~10m to see if this gets better?
[16:43:38] <bjensen>	 claime: any sense of how many replicas we'd like to add?
[16:45:28] <wikibugs>	 (03PS1) 10Dpogorzelski: knative: update images to 1.21.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722)
[16:46:20] <jinxer-wm>	 FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[16:47:03] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie
[16:47:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[16:47:45] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.68% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:47:49] <wikibugs>	 (03PS1) 10Cathal Mooney: Nokia SR Linux: add BGP policy for aux K8S hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1260033 (https://phabricator.wikimedia.org/T371088)
[16:48:17] <wikibugs>	 (03PS14) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[16:49:16] <dcausse>	 also the alert search latency alert (mw@codfw to dnsdisc) is probably just noise, we should add a threshold on the qps from the source cluster
[16:49:29] <wikibugs>	 (03CR) 10JHathaway: "Looks good @taavi@wikimedia.org, made a couple of additions" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[16:49:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421137#11745261 (10Jclark-ctr) a:03Jclark-ctr
[16:49:39] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah)
[16:50:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421137#11745273 (10Jclark-ctr) 05Open→03Resolved
[16:50:16] <inflatador>	 cirrussearch1079 looks to be thrashing, gonna try restarting services
[16:50:46] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "+1 for now because we need to fix existing behaviour" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) (owner: 10Daniel Kinzler)
[16:51:20] <jinxer-wm>	 RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh
[16:52:01] <wikibugs>	 (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: lower threshold for browser detection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) (owner: 10Daniel Kinzler)
[16:52:08] <claime>	 bjensen: give it 10% or so for now, we'll see
[16:52:17] <claime>	 So like +24
[16:52:20] <claime>	 25*
[16:52:33] <bjensen>	 claime: ack, on it
[16:52:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11745312 (10BCornwall) Unfortunately, it's still throwing the errors. :(
[16:53:42] <wikibugs>	 (03PS1) 10Blake: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260038
[16:54:20] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) (owner: 10Daniel Kinzler)
[16:55:45] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260038 (owner: 10Blake)
[16:55:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11745343 (10VRiley-WMF) Hey @BTullis   We have recieved a replacement drive for this unit, and we are able to swap it out at anytime.
[16:56:37] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "not tested but overall lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/1260033 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[16:56:41] <wikibugs>	 (03CR) 10Blake: [C:03+2] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260038 (owner: 10Blake)
[16:56:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Nokia SR Linux: add BGP policy for aux K8S hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1260033 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[16:58:06] <dcausse>	 claime: we're still seeing some traffic to search from mw@codfw (20qps) is this expected?
[16:58:27] <claime>	 dcausse: it's not depooled so yes
[16:58:30] <wikibugs>	 (03Merged) 10jenkins-bot: Nokia SR Linux: add BGP policy for aux K8S hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1260033 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[16:58:36] <claime>	 Traffic is depooled from codfw, but jobs still run there for instance
[16:58:36] <dcausse>	 ack
[16:59:13] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260038 (owner: 10Blake)
[16:59:19] <claime>	 There's some residual traffic as well
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1700)
[17:00:08] <logmsgbot>	 !log blake@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:00:25] <wikibugs>	 (03CR) 10Elukey: "LGTM! Could you also remove the *patch files as well?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski)
[17:00:26] <logmsgbot>	 !log blake@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:00:27] <logmsgbot>	 !log blake@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:00:46] <logmsgbot>	 !log blake@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:00:50] <wikibugs>	 (03PS1) 10Clément Goubert: api-gateway: Chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260043
[17:01:24] <wikibugs>	 (03CR) 10ArielGlenn: [C:03+1] "Looks great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 (owner: 10Daniel Kinzler)
[17:01:35] <wikibugs>	 (03PS1) 10Kamila Součková: rest gateway: bump chart version for previous [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260044
[17:02:42] <wikibugs>	 (03Abandoned) 10Clément Goubert: api-gateway: Chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260043 (owner: 10Clément Goubert)
[17:03:01] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest gateway: bump chart version for previous [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260044 (owner: 10Kamila Součková)
[17:04:44] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] rest gateway: bump chart version for previous [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260044 (owner: 10Kamila Součková)
[17:05:07] <wikibugs>	 (03PS1) 10DCausse: Revert^2 "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260045
[17:05:49] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 103367000 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:06:49] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:06:55] <wikibugs>	 (03Merged) 10jenkins-bot: rest gateway: bump chart version for previous [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260044 (owner: 10Kamila Součková)
[17:07:46] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm)
[17:08:58] <wikibugs>	 (03CR) 10Jasmine: [C:03+1] wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot)
[17:09:57] <claime>	 bjensen: I got confused between what I asked for and what you did, I assume because of the alert in between. I think we need to bump mw-api-ext as well as mw-web
[17:10:13] <claime>	 So mw-web now done, but mw-api-ext needs 25 replicas as well
[17:10:59] <bjensen>	 claime: ah, gotcha, on it
[17:11:36] <wikibugs>	 (03PS1) 10Cathal Mooney: AUX K8s: user underscore not dash in ASN mapping [homer/public] - 10https://gerrit.wikimedia.org/r/1260046 (https://phabricator.wikimedia.org/T371088)
[17:12:39] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] jenkins: allow rsyncing of data for migrating a jenkins server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn)
[17:13:32] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] AUX K8s: user underscore not dash in ASN mapping [homer/public] - 10https://gerrit.wikimedia.org/r/1260046 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[17:13:54] <wikibugs>	 (03PS1) 10Blake: mw-api-ext: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260047
[17:14:17] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-api-ext: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260047 (owner: 10Blake)
[17:14:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: use Envoy on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb)
[17:14:44] <wikibugs>	 (03Merged) 10jenkins-bot: AUX K8s: user underscore not dash in ASN mapping [homer/public] - 10https://gerrit.wikimedia.org/r/1260046 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[17:15:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.35% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:17:30] <wikibugs>	 (03CR) 10Blake: [C:03+2] mw-api-ext: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260047 (owner: 10Blake)
[17:19:29] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-ext: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260047 (owner: 10Blake)
[17:20:19] <logmsgbot>	 !log blake@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:20:36] <logmsgbot>	 !log blake@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[17:20:37] <logmsgbot>	 !log blake@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[17:20:58] <logmsgbot>	 !log blake@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[17:30:41] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003.magru.wmnet} and A:liberica
[17:32:39] <icinga-wm>	 RECOVERY - MD RAID on aqs1010 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[17:34:17] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003.magru.wmnet} and A:liberica
[17:36:27] <wikibugs>	 (03PS1) 10AOkoth: aux: fix location of wmf-navigator cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260055 (https://phabricator.wikimedia.org/T414405)
[17:38:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] aux: fix location of wmf-navigator cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260055 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)