[00:06:53] 10ops-eqiad, 06SRE, 06DC-Ops: firmware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741583 (10Papaul) @BCornwall The server is pxe booting but failed at see below {F73537240} [00:18:45] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1115.eqiad.wmnet with OS trixie [00:19:27] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [00:27:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:32:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:37:27] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:38:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:39:17] 10ops-eqiad, 06SRE, 06DC-Ops: firmware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741630 (10BCornwall) Hm. The specific output: ` Mar 24 00:30:35 in-target: You are about to format nvme1n1, namespace 0x1. Mar 24 00:30:35 in-target: WARNING: Fo... [00:39:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1259309 [00:39:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1259309 (owner: 10TrainBranchBot) [00:42:04] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741632 (10BCornwall) [00:43:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:43:50] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: Unable to PXE boot cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741634 (10BCornwall) Marked cp1115 as "failed" in netbox [00:44:19] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:50:35] RECOVERY - dump of db_inventory in eqiad on backupmon1001 is OK: Last dump for db_inventory at eqiad (db1215) taken on 2026-03-24 00:38:56 (3 MiB, -3.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:51:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1259309 (owner: 10TrainBranchBot) [00:52:01] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1104.eqiad.wmnet with OS trixie [00:55:57] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741650 (10BCornwall) [00:57:14] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741654 (10BCornwall) [00:59:04] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11741661 (10BCornwall) [01:05:35] RECOVERY - dump of db_inventory in codfw on backupmon1001 is OK: Last dump for db_inventory at codfw (db2185) taken on 2026-03-24 00:36:42 (3 MiB, -3.6 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:08:53] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage [01:09:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1259313 [01:09:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1259313 (owner: 10TrainBranchBot) [01:14:07] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1104.eqiad.wmnet with reason: host reimage [01:18:51] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 123.19 ms [01:19:55] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:23:04] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1259313 (owner: 10TrainBranchBot) [01:37:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1104.eqiad.wmnet with OS trixie [01:40:57] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1104.* [01:56:09] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [02:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [02:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0200) [02:00:52] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [02:08:57] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 08m 04s) [02:09:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.21 [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259332 (https://phabricator.wikimedia.org/T420479) [02:09:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.21 [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259332 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [02:21:12] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.21 [core] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259332 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [02:31:55] FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:34:19] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:19] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:59] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [03:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0300) [03:01:57] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259366 (https://phabricator.wikimedia.org/T420479) [03:01:59] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259366 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [03:03:00] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259366 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [03:03:27] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.21 refs T420479 [03:03:32] T420479: 1.46.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T420479 [03:30:48] (03PS1) 10Andrea Denisse: grafana: Add a SameSite attribute to cookies [puppet] - 10https://gerrit.wikimedia.org/r/1259382 (https://phabricator.wikimedia.org/T402844) [03:30:48] (03CR) 10Andrea Denisse: "Even tho the docs state that this doesn't work with Oauth [1] I tested it on the grafana-next host and I was able to log-in with our setup" [puppet] - 10https://gerrit.wikimedia.org/r/1259382 (https://phabricator.wikimedia.org/T402844) (owner: 10Andrea Denisse) [03:34:52] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8330/co" [puppet] - 10https://gerrit.wikimedia.org/r/1259254 (https://phabricator.wikimedia.org/T402844) (owner: 10Andrea Denisse) [03:42:54] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.21 refs T420479 (duration: 39m 27s) [03:42:59] T420479: 1.46.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T420479 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0400) [04:01:15] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.18 (duration: 01m 13s) [04:14:19] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:19:55] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:19] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:38:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:00] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [06:00:00] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0600). [06:31:55] FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [06:57:44] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742014 (10ayounsi) [07:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:11:12] (03PS1) 10Ayounsi: Remove old ulsfo ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1259748 (https://phabricator.wikimedia.org/T418993) [07:12:28] (03PS2) 10Ayounsi: Remove old ulsfo ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1259748 (https://phabricator.wikimedia.org/T418993) [07:40:20] (03CR) 10Ayounsi: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1255580 (owner: 10Slyngshede) [07:45:47] (03CR) 10Slyngshede: [C:03+2] C:external_clouds_vendors remove GeekyWorld [puppet] - 10https://gerrit.wikimedia.org/r/1255580 (owner: 10Slyngshede) [07:59:44] !log Changed https://logstash.wikimedia.org/ default page back to /app/dashboards [07:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:41] I am running the MediaWiki train [08:01:54] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259837 (https://phabricator.wikimedia.org/T420479) [08:01:56] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259837 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [08:02:15] the script generating the Deployment calendar got bugged for some reason [08:03:05] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259837 (https://phabricator.wikimedia.org/T420479) (owner: 10TrainBranchBot) [08:13:25] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.21 refs T420479 [08:13:30] T420479: 1.46.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T420479 [08:19:35] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1259748 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [08:20:57] (03PS2) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) [08:20:57] (03CR) 10Arnaudb: "thanks, I've amended the change according to that very good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [08:21:07] (03PS2) 10Daniel Kinzler: rest gateway: add support for centralauthtoken [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) [08:21:44] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11742117 (10MatthewVernon) [08:24:24] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:25:31] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:27:07] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:27:08] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:29:54] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:31:03] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:31:58] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:34:52] (03PS1) 10Brouberol: dse-k8s-eqiad: document current version of 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) [08:35:48] (03CR) 10Ayounsi: [C:03+2] Remove old ulsfo ganeti cluster [puppet] - 10https://gerrit.wikimedia.org/r/1259748 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [08:36:37] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [08:38:42] (03PS2) 10Brouberol: dse-k8s-eqiad: document current version of 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) [08:39:41] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reimage for host ganeti4008.ulsfo.wmnet with OS bookworm [08:39:55] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ayounsi@cumin1003 for host ganeti4008.ulsfo.wmnet w... [08:42:15] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8331/console" [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [08:43:03] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:45:12] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool depool db1170: Degraded drive T420873 [08:45:17] T420873: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873 [08:45:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:45:29] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db1170: Degraded drive T420873 [08:46:17] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [08:47:02] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11742221 (10FCeratto-WMF) @VRiley-WMF I depooled the host to reduce I/O load when the new drive will be rebuilt, please go ahead and replace the drive ASAP. Thank you! [08:47:20] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11742223 (10FCeratto-WMF) 05Open→03In progress p:05Triage→03High [08:49:27] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:49:51] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:50:21] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old ulsfo ganeti VIP - ayounsi@cumin1003" [08:50:31] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet, ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:51:31] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:51:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove old ulsfo ganeti VIP - ayounsi@cumin1003" [08:51:51] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:52:20] jouncebot: nowandnext [08:52:20] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [08:52:20] In 1 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1000) [08:52:24] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:52:45] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [08:52:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:56:17] (03PS1) 10Ayounsi: Make ganeti4008 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1259848 (https://phabricator.wikimedia.org/T418993) [08:56:45] (03PS1) 10Ilias Sarantopoulos: alertmanager: Add Slack alerts receiver for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1259849 (https://phabricator.wikimedia.org/T421040) [08:57:11] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:15] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:57:17] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259850 (https://phabricator.wikimedia.org/T408186) [08:57:21] (03CR) 10CI reject: [V:04-1] alertmanager: Add Slack alerts receiver for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1259849 (https://phabricator.wikimedia.org/T421040) (owner: 10Ilias Sarantopoulos) [08:57:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11742249 (10kera_wmde) @Scott_French Yes, my request is for Level 1. [08:58:29] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploy v1.2.7 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259851 (https://phabricator.wikimedia.org/T408186) [08:58:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:59:19] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:14] (03CR) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [09:00:40] FIRING: [6x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:01:10] !log ayounsi@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage [09:04:19] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:04] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage [09:09:19] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:09:33] (03PS2) 10Ilias Sarantopoulos: alertmanager: Add Slack alerts receiver for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1259849 (https://phabricator.wikimedia.org/T421040) [09:15:40] FIRING: [6x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:17:23] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259859 (https://phabricator.wikimedia.org/T420448) [09:18:20] (03CR) 10Cathal Mooney: [C:03+1] Make ganeti4008 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1259848 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [09:19:21] (03PS4) 10Arnaudb: gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) [09:23:08] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4008.ulsfo.wmnet with OS bookworm [09:23:22] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742307 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ayounsi@cumin1003 for host ganeti4008.ulsfo.wmnet with... [09:24:18] (03CR) 10Ayounsi: [C:03+2] Make ganeti4008 a Ganeti node on routed Ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1259848 (https://phabricator.wikimedia.org/T418993) (owner: 10Ayounsi) [09:27:02] (03CR) 10Dpogorzelski: [C:03+2] alertmanager: Add Slack alerts receiver for ML team [puppet] - 10https://gerrit.wikimedia.org/r/1259849 (https://phabricator.wikimedia.org/T421040) (owner: 10Ilias Sarantopoulos) [09:29:08] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [09:29:22] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [09:29:45] (03CR) 10A-pizzata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259859 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [09:30:28] (03CR) 10Arnaudb: [C:03+2] gerrit: add Envoy TLS termination for the CDN path [puppet] - 10https://gerrit.wikimedia.org/r/1258976 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [09:30:31] RESOLVED: BFDdown: BFD session down between cr3-ulsfo and 198.35.26.13 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:30:40] FIRING: [6x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:30:41] 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T421042 (10phaultfinder) 03NEW [09:31:17] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [09:31:26] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [09:33:30] 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T421043 (10phaultfinder) 03NEW [09:34:01] !log ayounsi@cumin1003 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [09:35:27] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259859 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [09:35:31] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:37:25] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259859 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [09:38:26] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:32] ayounsi@cumin1003 addnode (PID 1674010) is awaiting input [09:43:27] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo02 and group 01 [09:46:36] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742391 (10ayounsi) [09:46:44] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:47:01] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [09:52:22] (03PS1) 10Cathal Mooney: nftables: place notrack rules into the /etc/nftables/prerouting [puppet] - 10https://gerrit.wikimedia.org/r/1259874 (https://phabricator.wikimedia.org/T420715) [09:52:44] (03PS1) 10DCausse: search: use the discovery ns record for the semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) [09:55:13] (03PS1) 10Arnaudb: gerrit: use Envoy on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1259869 (https://phabricator.wikimedia.org/T420909) [09:57:23] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11742409 (10MLechvien-WMF) @Jclark-ctr gentle follow-up on that? [09:58:55] 06SRE, 06collaboration-services, 10Ganeti, 06Infrastructure-Foundations: Migrating ulsfo to routed Ganeti - https://phabricator.wikimedia.org/T418993#11742424 (10ayounsi) 05Open→03Resolved a:03ayounsi All done here. I've also opened {T421044} to balance the VMs better. [10:00:00] (03CR) 10DCausse: "Sorry I just saw this patch before uploading Ie7d9b5b489a38744d73eef9d2a704af532df74af. I think we can now use the discovery ns record and" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259143 (owner: 10Ebernhardson) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1000) [10:00:32] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [10:00:32] FIRING: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [10:00:32] 10ops-eqiad, 06DC-Ops, 06ServiceOps new, 10ServiceOps-Upgrades-Hardware: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11742436 (10Jclark-ctr) >>! In T412255#11742409, @MLechvien-WMF wrote: > @Jclark-ctr @Jhancock.wm gentle follow-up on that? For shipping updates procurement... [10:06:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-deprecated, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742476 (10elukey) To keep archives happy - I used the following workaround in provisioning and it worked: ` # For som... [10:07:41] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1006.eqiad.wmnet with OS trixie [10:07:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-deprecated, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1006.... [10:08:26] RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:06] 10SRE-swift-storage, 10Ceph, 06Infrastructure-Foundations, 06Machine-Learning-Team: Move the Docker Registry's /ml prefix to S3/apus - https://phabricator.wikimedia.org/T420978#11742508 (10elukey) p:05Triage→03Medium [10:10:17] (03CR) 10Arnaudb: [C:03+2] gerrit: use Envoy on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1259869 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [10:10:32] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:52] 10SRE-tools, 06ServiceOps new: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11742517 (10MLechvien-WMF) 05Open→03Resolved Tentatively resolving as this was tested and merged, please reopen if any concerns [10:16:23] (03PS1) 10Cathal Mooney: nftables: remove 'notrack' directory from /etc/nftables [puppet] - 10https://gerrit.wikimedia.org/r/1259896 (https://phabricator.wikimedia.org/T420715) [10:16:47] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2006.codfw.wmnet with OS trixie [10:16:48] (03PS1) 10Ayounsi: Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897 [10:16:56] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops-deprecated: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742530 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1003 for host aux-k8s-worker2006.codfw.wmnet wit... [10:17:32] !log brouberol@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2007.codfw.wmnet with OS trixie [10:17:39] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aux-k8s-worker1006.eqiad.wmnet with OS trixie [10:17:40] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops-deprecated: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742535 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1003 for host aux-k8s-worker2007.codfw.wmnet wit... [10:17:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-deprecated, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1006.eqia... [10:18:08] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:18:10] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:18:38] (03PS1) 10Cathal Mooney: nftables: remove the file definition for /etc/nftables/notrack [puppet] - 10https://gerrit.wikimedia.org/r/1259898 (https://phabricator.wikimedia.org/T420715) [10:18:56] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:18:59] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:19:34] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi) [10:20:29] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:20:32] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:21:15] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:22:05] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:22:14] (03PS1) 10Arnaudb: Revert "gerrit: use Envoy on gerrit-spare" [puppet] - 10https://gerrit.wikimedia.org/r/1259899 [10:23:08] (03CR) 10Arnaudb: [C:03+2] Revert "gerrit: use Envoy on gerrit-spare" [puppet] - 10https://gerrit.wikimedia.org/r/1259899 (owner: 10Arnaudb) [10:24:23] (03PS1) 10Jcrespo: mediabackup: Make the recovery account obsolete (but not remove it yet) [puppet] - 10https://gerrit.wikimedia.org/r/1259901 (https://phabricator.wikimedia.org/T420506) [10:28:01] (03CR) 10Arnaudb: [C:03+2] gerrit: use Envoy on gerrit-spare (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259869 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [10:28:16] (03CR) 10JMeybohm: "I'd stack this on top of 1259141 so that the CI does not fail here" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259158 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [10:28:31] (03PS1) 10Arnaudb: gerrit: use Envoy on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1259902 (https://phabricator.wikimedia.org/T420909) [10:28:41] RECOVERY - Host thanos-be2006 is UP: PING OK - Packet loss = 0%, RTA = 30.54 ms [10:28:55] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage [10:29:36] !log brouberol@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage [10:29:44] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:29:47] (03PS2) 10Jcrespo: mediabackup: Make the recovery account obsolete (but not remove it yet) [puppet] - 10https://gerrit.wikimedia.org/r/1259901 (https://phabricator.wikimedia.org/T420506) [10:29:51] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259901 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [10:30:00] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:30:02] (03PS13) 10Majavah: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) [10:30:32] RESOLVED: SwiftLowContainerAvailability: Swift eqiad container availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowContainerAvailability [10:30:32] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [10:30:39] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [10:32:24] (03PS2) 10Ayounsi: Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897 [10:32:38] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi) [10:32:46] (03CR) 10Jcrespo: [C:03+2] mediabackup: Make the recovery account obsolete (but not remove it yet) [puppet] - 10https://gerrit.wikimedia.org/r/1259901 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [10:33:06] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2006.codfw.wmnet with reason: host reimage [10:35:35] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742608 (10Clement_Goubert) [10:35:44] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742609 (10Clement_Goubert) [10:36:16] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2007.codfw.wmnet with reason: host reimage [10:37:38] 06SRE, 06Traffic: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup - https://phabricator.wikimedia.org/T411097#11742616 (10MLechvien-WMF) #traffic do we know when we can do this cleanup? [10:38:26] RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:32] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:26] (03PS3) 10Ayounsi: Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897 [10:41:56] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi) [10:42:00] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:42:09] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:49:23] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2006.codfw.wmnet with OS trixie [10:49:36] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1003 for host aux-k8s-worker2006.codfw.wmnet with OS trixie completed: - aux-k8... [10:51:53] (03CR) 10Majavah: "fixed in PS13." [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [10:53:15] !log brouberol@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2007.codfw.wmnet with OS trixie [10:53:21] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11742710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1003 for host aux-k8s-worker2007.codfw.wmnet with OS trixie completed: - aux-k8... [10:55:23] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:55:34] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [10:57:17] (03PS1) 10Btullis: Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) [10:57:44] (03CR) 10CI reject: [V:04-1] Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) (owner: 10Btullis) [10:59:28] (03PS2) 10Btullis: Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) [10:59:29] (03CR) 10Majavah: [C:03+1] conftool-data: move s3, x3 to new hosts (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/1256417 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri) [11:00:17] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) (owner: 10Btullis) [11:03:22] (03CR) 10Brouberol: [C:03+1] Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) (owner: 10Btullis) [11:03:57] (03CR) 10FNegri: [C:03+2] conftool-data: move s3, x3 to new hosts (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/1256417 (https://phabricator.wikimedia.org/T409557) (owner: 10FNegri) [11:04:27] (03PS4) 10Ayounsi: Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897 [11:04:58] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi) [11:06:59] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: document current version of 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [11:07:20] (03CR) 10Brouberol: [V:03+1 C:03+2] dse-k8s-eqiad: document current version of 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/1259845 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [11:07:22] (03CR) 10Btullis: [C:03+2] Allow wdqs::alternatives hosts to access kafka jumbo and test [puppet] - 10https://gerrit.wikimedia.org/r/1259921 (https://phabricator.wikimedia.org/T421048) (owner: 10Btullis) [11:07:47] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1006.eqiad.wmnet with OS trixie [11:08:20] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742757 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1006.eqiad.wmnet with OS trixie [11:09:18] (03PS3) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) [11:09:36] (03CR) 10Brouberol: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [11:10:55] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Nice work :)" [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi) [11:11:49] (03CR) 10Ayounsi: [C:03+2] Create INSTALL_HOSTS firewall definition [puppet] - 10https://gerrit.wikimedia.org/r/1259897 (owner: 10Ayounsi) [11:14:51] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=s3 [11:14:58] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1023.eqiad.wmnet,service=s3 [11:15:05] (03PS4) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) [11:17:35] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3 [11:18:03] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s3 [11:18:37] (03CR) 10Phuedx: [C:03+1] Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259850 (https://phabricator.wikimedia.org/T408186) (owner: 10Santiago Faci) [11:18:55] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1006.eqiad.wmnet with reason: host reimage [11:19:06] (03PS1) 10Ayounsi: Define profile::installserver::dhcp::install_servers6 in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1259934 [11:19:38] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1023.eqiad.wmnet,service=s3 [11:19:44] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1022.eqiad.wmnet,service=s3 [11:19:53] (03CR) 10Cathal Mooney: [C:03+1] Define profile::installserver::dhcp::install_servers6 in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1259934 (owner: 10Ayounsi) [11:20:48] (03CR) 10CI reject: [V:04-1] dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [11:22:02] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=s3 [11:22:25] (03CR) 10Ayounsi: [C:03+2] Define profile::installserver::dhcp::install_servers6 in cloud [puppet] - 10https://gerrit.wikimedia.org/r/1259934 (owner: 10Ayounsi) [11:24:47] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1006.eqiad.wmnet with reason: host reimage [11:25:53] (03PS1) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) [11:26:23] (03CR) 10CI reject: [V:04-1] routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney) [11:26:54] !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1022.eqiad.wmnet,service=s3 [11:26:58] (03PS2) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) [11:27:29] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney) [11:27:30] !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1023.eqiad.wmnet,service=s3 [11:27:45] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1023.eqiad.wmnet,service=s3 [11:27:55] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s3 [11:31:37] !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1022.eqiad.wmnet,service=x3 [11:31:41] !log fnegri@cumin1003 conftool action : set/weight=100; selector: name=clouddb1023.eqiad.wmnet,service=x3 [11:31:54] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1022.eqiad.wmnet,service=x3 [11:31:58] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1023.eqiad.wmnet,service=x3 [11:32:07] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet [11:32:15] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1016.eqiad.wmnet,service=x3 [11:32:20] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=x3 [11:34:14] (03PS3) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) [11:36:03] !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet [11:37:29] (03CR) 10Hnowlan: trafficserver: Add api.w.o to gateway-check.lua.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:38:26] RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11742853 (10Daria-WMDE) @Scott_French thank you! Signed the NDA [11:39:12] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney) [11:40:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1006.eqiad.wmnet with OS trixie [11:40:32] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:33] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11742856 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1006.eqiad.wmnet with OS trixie comp... [11:41:05] (03PS4) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) [11:42:25] (03PS3) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) [11:42:29] (03CR) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:42:49] (03PS3) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) [11:43:02] (03PS3) 10Clément Goubert: trafficserver: 100% of device-analytics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147) [11:44:28] (03PS3) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) [11:46:00] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:46:02] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:46:11] (03PS3) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259078 (https://phabricator.wikimedia.org/T418146) [11:47:23] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:47:24] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:47:25] (03PS5) 10Cathal Mooney: routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) [11:47:51] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:47:53] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:48:09] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:48:11] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:48:26] RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:26] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:49:28] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:49:44] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney) [11:51:06] !log fnegri@cumin1003 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [11:51:18] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1017.eqiad.wmnet [11:51:19] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1017.eqiad.wmnet [11:51:47] !log fnegri@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1017.eqiad.wmnet with reason: Rebooting clouddb1017 T419960 [11:52:07] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney) [11:53:44] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:53:54] !log javiermonton@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-html-content-change-enrich: apply [11:53:56] (03CR) 10Arnaudb: [C:03+2] gerrit: use Envoy on gerrit-spare [puppet] - 10https://gerrit.wikimedia.org/r/1259902 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [11:57:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:57:53] (03CR) 10Hnowlan: [C:04-2] "This API isn't used in public and doesn't need to be rerouted. It existing in parallel is an awkward side-effect of the first rollout of A" [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert) [11:58:18] (03CR) 10Hnowlan: [C:03+1] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [11:58:25] (03Abandoned) 10Clément Goubert: trafficserver: 100% of device-analytics to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259075 (https://phabricator.wikimedia.org/T418147) (owner: 10Clément Goubert) [11:58:26] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:58:53] (03CR) 10Hnowlan: [C:03+1] "lgtm bar the unique-devices line" [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1200) [12:00:16] (03CR) 10Cathal Mooney: [C:03+2] routed-ganeti: allow sandbox replies to HTTP from install hosts [puppet] - 10https://gerrit.wikimedia.org/r/1259935 (https://phabricator.wikimedia.org/T420975) (owner: 10Cathal Mooney) [12:00:44] (03PS1) 10Clément Goubert: Revert "rest-gateway: Add api.w.o device-analytics support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 [12:01:41] !log fnegri@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1017.eqiad.wmnet,service=s1 [12:01:55] (03CR) 10Hnowlan: [C:03+1] Revert "rest-gateway: Add api.w.o device-analytics support" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259942 (owner: 10Clément Goubert) [12:02:18] !log fnegri@cumin1003 START - Cookbook sre.hosts.remove-downtime for clouddb1017.eqiad.wmnet [12:02:19] !log fnegri@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for clouddb1017.eqiad.wmnet [12:03:25] (03PS4) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) [12:03:36] (03PS4) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) [12:04:31] (03PS5) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) [12:05:36] (03PS5) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) [12:06:47] (03PS6) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259078 (https://phabricator.wikimedia.org/T418146) [12:07:50] (03Abandoned) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259078 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [12:09:58] (03PS1) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) [12:10:55] (03CR) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [12:11:52] (03PS1) 10Arnaudb: gerrit: use Envoy on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909) [12:12:25] (03PS1) 10Arnaudb: gerrit: use Envoy on gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1259945 (https://phabricator.wikimedia.org/T420909) [12:13:24] (03PS6) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) [12:14:11] (03CR) 10Arnaudb: "if we merge this after 1259944 the backend config will have to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [12:15:24] (03PS7) 10Clément Goubert: trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) [12:16:41] (03PS8) 10Clément Goubert: trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) [12:17:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11742983 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [12:17:35] (03PS3) 10Clément Goubert: trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) [12:18:47] (03PS6) 10Clément Goubert: trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) [12:34:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11743055 (10Jclark-ctr) a:05Jclark-ctr→03Jgreen Updated passwords and sent temp password via private message to you [12:38:56] (03PS1) 10Daniel Kinzler: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) [12:43:00] 06SRE, 06Infrastructure-Foundations, 10netops: Atlas no longer reachable from monitoring on routed ganeti - https://phabricator.wikimedia.org/T420975#11743113 (10cmooney) 05Open→03Resolved a:03cmooney This should now be working again. Big thanks to @ayounsi for the heavy-lifting with all the puppe... [12:50:47] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1007.eqiad.wmnet with OS trixie [12:51:04] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11743142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1007.eqiad.wmnet with OS trixie [12:54:14] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 10Puppet-Infrastructure, 13Patch-For-Review: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one - https://phabricator.wikimedia.org/T367399#11743147 (10hashar) 05Resolved→03Open We still have the old Puppet 5 /... [12:54:52] (03PS2) 10Daniel Kinzler: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) [12:56:34] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1300). [13:00:04] Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] I can’t really deploy today, anyone else around? [13:00:21] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: modify records for payments servers frack - cmooney@cumin1003" [13:00:27] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: modify records for payments servers frack - cmooney@cumin1003" [13:00:27] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:00:32] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:01:59] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage [13:02:49] (03PS1) 10Hashar: ci: fix typo in manage_srv [puppet] - 10https://gerrit.wikimedia.org/r/1259961 [13:03:26] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:38] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache payments1010.frack.eqiad.wmnet on all recursors [13:03:41] (03CR) 10Hashar: jenkins: allow rsyncing of data for migrating a jenkins server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [13:03:42] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) payments1010.frack.eqiad.wmnet on all recursors [13:03:53] (03PS3) 10Daniel Kinzler: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) [13:03:58] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache payments1011.frack.eqiad.wmnet on all recursors [13:04:02] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) payments1011.frack.eqiad.wmnet on all recursors [13:04:07] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache payments1012.frack.eqiad.wmnet on all recursors [13:04:11] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) payments1012.frack.eqiad.wmnet on all recursors [13:08:31] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage [13:09:40] I'm here for the deployment BTW, lost track of time, sorry! [13:09:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11743235 (10AnnieKim_WMDE) I'm in! Thanks for your help. [13:10:01] (03PS8) 10Andrew Bogott: cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [13:10:05] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [13:10:32] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:13:01] \o/ [13:14:48] (03CR) 10Andrew Bogott: [C:03+1] cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [13:15:32] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:15:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cmelo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259231 (https://phabricator.wikimedia.org/T419597) (owner: 10Daimona Eaytoy) [13:15:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cmelo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259237 (https://phabricator.wikimedia.org/T414149) (owner: 10Daimona Eaytoy) [13:16:56] (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on all wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259231 (https://phabricator.wikimedia.org/T419597) (owner: 10Daimona Eaytoy) [13:17:09] (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableEventGoals in prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259237 (https://phabricator.wikimedia.org/T414149) (owner: 10Daimona Eaytoy) [13:18:03] !log cmelo@deploy2002 Started scap sync-world: Backport for [[gerrit:1259231|Enable the CampaignEvents extension on all wikibooks (T419597)]], [[gerrit:1259237|Enable $wgCampaignEventsEnableEventGoals in prod wikis (T414149)]] [13:18:09] T419597: Enable CampaignEvents extension on Wikibooks [week of March 23] - https://phabricator.wikimedia.org/T419597 [13:18:10] T414149: Enable event goals in production - https://phabricator.wikimedia.org/T414149 [13:20:12] !log cmelo@deploy2002 cmelo, daimona: Backport for [[gerrit:1259231|Enable the CampaignEvents extension on all wikibooks (T419597)]], [[gerrit:1259237|Enable $wgCampaignEventsEnableEventGoals in prod wikis (T414149)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:20:35] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install payments101[0-2] - https://phabricator.wikimedia.org/T416252#11743394 (10Jclark-ctr) a:05Jclark-ctr→03Jgreen @jgreen you should be good @cmooney was able to assist with vlan [13:21:16] (03CR) 10Ebernhardson: [C:03+1] search: use the discovery ns record for the semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse) [13:22:54] (03Abandoned) 10Ebernhardson: search: Add codfw semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259143 (owner: 10Ebernhardson) [13:23:08] !log sudo cumin 'C:bird' "disable-puppet 'merging CR 1248385, T413740'" [13:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:13] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [13:24:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1007.eqiad.wmnet with OS trixie [13:24:27] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11743426 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1007.eqiad.wmnet with OS trixie comp... [13:24:53] (03CR) 10Ssingh: [V:03+1 C:03+2] Remove support for enabling Bird 2.18 selectively [puppet] - 10https://gerrit.wikimedia.org/r/1248385 (https://phabricator.wikimedia.org/T413740) (owner: 10Muehlenhoff) [13:25:05] (03CR) 10CDanis: vector-search: add initial deployment chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) (owner: 10Fabian Kaelin) [13:26:29] !log cmelo@deploy2002 cmelo, daimona: Continuing with sync [13:27:12] (03PS1) 10Jforrester: Set json object before setting Abstract Wiki Id [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259967 (https://phabricator.wikimedia.org/T420916) [13:27:44] (03CR) 10Arnaudb: [C:03+2] "lgtm, will merge" [puppet] - 10https://gerrit.wikimedia.org/r/1259961 (owner: 10Hashar) [13:29:28] (03CR) 10Bking: [C:03+1] search: use the discovery ns record for the semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse) [13:30:46] !log cmelo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259231|Enable the CampaignEvents extension on all wikibooks (T419597)]], [[gerrit:1259237|Enable $wgCampaignEventsEnableEventGoals in prod wikis (T414149)]] (duration: 12m 43s) [13:30:52] T419597: Enable CampaignEvents extension on Wikibooks [week of March 23] - https://phabricator.wikimedia.org/T419597 [13:30:53] T414149: Enable event goals in production - https://phabricator.wikimedia.org/T414149 [13:31:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse) [13:32:13] !log sudo cumin -b1 -s20 'C:bird' "run-puppet-agent --enable 'merging CR 1248385, T413740'" [13:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:18] T413740: Backport and test Bird 2.18 - https://phabricator.wikimedia.org/T413740 [13:33:19] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1008.eqiad.wmnet with OS trixie [13:33:26] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [13:33:35] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11743487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1008.eqiad.wmnet with OS trixie [13:33:53] just added a patch to the backport window [13:35:25] o/ cmelo, Daimona are you done with your deploys? [13:37:31] (03CR) 10CDanis: [C:03+1] trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759) (owner: 10Bking) [13:37:54] jouncebot: now [13:37:54] For the next 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1300) [13:38:54] (03CR) 10CDanis: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [13:39:10] (03CR) 10Bking: [C:03+2] trixie: Add component/opensearch2 [puppet] - 10https://gerrit.wikimedia.org/r/1259232 (https://phabricator.wikimedia.org/T420759) (owner: 10Bking) [13:41:38] (03CR) 10CDanis: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [13:42:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse) [13:43:57] (03Merged) 10jenkins-bot: search: use the discovery ns record for the semanticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259875 (https://phabricator.wikimedia.org/T414484) (owner: 10DCausse) [13:44:28] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1259875|search: use the discovery ns record for the semanticsearch cluster (T414484)]] [13:44:33] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1008.eqiad.wmnet with reason: host reimage [13:44:34] T414484: Upgrade DSE clusters to kubernetes 1.31 - https://phabricator.wikimedia.org/T414484 [13:45:20] (03CR) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [13:46:13] (03PS1) 10Btullis: Temporarily suspend the flink applications running in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259973 (https://phabricator.wikimedia.org/T414484) [13:46:14] (03CR) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) (owner: 10Arnaudb) [13:46:31] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1259875|search: use the discovery ns record for the semanticsearch cluster (T414484)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:48:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1008.eqiad.wmnet with reason: host reimage [13:49:52] dcausse: belated yes, sorry [13:49:59] Hi we are done, sorry [13:50:03] np, thanks! :) [13:52:20] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:52:38] (03PS1) 10Ottomata: dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) [13:54:09] (03CR) 10AKhatun: [C:03+1] dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) (owner: 10Ottomata) [13:54:35] (03CR) 10JavierMonton: [C:03+1] dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) (owner: 10Ottomata) [13:54:50] (03CR) 10Btullis: "Do not merge until the maintenance window on Thursday March 26th 2026 at 10:30 UTC" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259973 (https://phabricator.wikimedia.org/T414484) (owner: 10Btullis) [13:57:38] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [13:58:00] (03PS3) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) [13:59:09] !log dcausse@deploy2002 Sync cancelled. [13:59:30] (03PS4) 10Arnaudb: gerrit: forward Gitiles traffic to gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259121 (https://phabricator.wikimedia.org/T420595) [13:59:30] !log jforrester@deploy2002 mwscript-k8s job started: sql --wiki=abstractwiki /srv/mediawiki/php-1.46.0-wmf.20/extensions/Translate/sql/mysql/translate_message_group_subscriptions.sql # T420656 translate_message_group_subscriptions [13:59:36] T420656: Enable Translate extension for Abstract Wikipedia - https://phabricator.wikimedia.org/T420656 [14:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1400) [14:00:17] (03PS1) 10DCausse: Revert "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259979 [14:00:34] (03CR) 10DCausse: [C:03+2] Revert "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259979 (owner: 10DCausse) [14:00:47] (03CR) 10Arnaudb: "added 1259121 as a dependency of this change" [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [14:01:29] (03Merged) 10jenkins-bot: Revert "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259979 (owner: 10DCausse) [14:01:38] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:02:28] (03PS2) 10Arnaudb: gerrit: use Envoy on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909) [14:02:48] (03PS2) 10Arnaudb: gerrit: use Envoy on gerrit [puppet] - 10https://gerrit.wikimedia.org/r/1259945 (https://phabricator.wikimedia.org/T420909) [14:04:38] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1008.eqiad.wmnet with OS trixie [14:04:47] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11744048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1008.eqiad.wmnet with OS trixie comp... [14:05:43] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1259979|Revert "search: use the discovery ns record for the semanticsearch cluster"]] [14:07:00] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2007.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:07:46] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1259979|Revert "search: use the discovery ns record for the semanticsearch cluster"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:17] !log dcausse@deploy2002 dcausse: Continuing with sync [14:08:53] (03PS1) 10Klausman: admin_ng/knative-serving: enable emptyDir feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259985 (https://phabricator.wikimedia.org/T421105) [14:09:56] (03CR) 10Cathal Mooney: [C:03+2] FR-Tech Provision Script: add some checks to validate rack for vlan (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1238379 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney) [14:10:23] hey folks, a reminder that we're going to start the services switchover (not mediawiki, that'll be tomorrow) at 15:00 UTC, no impact is expected [14:10:52] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11744153 (10tappof) 05Open→03Resolved a:03tappof I filed a dedicated task ({T421078}) for offloading queries to remote instances with SSD disks. I think we can safely... [14:11:59] (03PS5) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) [14:12:11] (03Merged) 10jenkins-bot: FR-Tech Provision Script: add some checks to validate rack for vlan [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1238379 (https://phabricator.wikimedia.org/T403035) (owner: 10Cathal Mooney) [14:12:37] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1259979|Revert "search: use the discovery ns record for the semanticsearch cluster"]] (duration: 06m 54s) [14:13:12] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:13:17] (03PS3) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) [14:13:26] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:13:35] !log cmooney@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:13:45] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [14:14:05] !log cmooney@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:15:04] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:16:28] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1009.eqiad.wmnet with OS trixie [14:16:38] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11744184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker1009.eqiad.wmnet with OS trixie [14:17:35] (03Abandoned) 10Arnaudb: gerrit: remove read-only config [puppet] - 10https://gerrit.wikimedia.org/r/1240217 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [14:17:35] (03Abandoned) 10Arnaudb: gitlab_runner: add nftables logic [puppet] - 10https://gerrit.wikimedia.org/r/1114726 (https://phabricator.wikimedia.org/T370677) (owner: 10Arnaudb) [14:18:46] (03CR) 10CI reject: [V:04-1] dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [14:19:23] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [14:19:35] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-page-edit-type-enrich-next: apply [14:20:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2008.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:22:23] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:22:25] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:23:12] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [14:23:26] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.58.139 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [14:23:26] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:23:29] (03CR) 10ArielGlenn: "This looks really good, a big improvement in readabililty. I've left some small tweaks/questions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 (owner: 10Daniel Kinzler) [14:23:31] (03CR) 10JMeybohm: [C:03+1] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [14:23:51] (03CR) 10JMeybohm: [C:03+1] trafficserver: 50% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259077 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [14:24:04] (03CR) 10JMeybohm: [C:03+1] trafficserver: 100% of /core/v1 to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259946 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [14:25:02] (03CR) 10JMeybohm: [C:03+1] trafficserver: 100% of linkrecommendation to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1259071 (https://phabricator.wikimedia.org/T418148) (owner: 10Clément Goubert) [14:25:32] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:25:32] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [14:25:36] (03PS2) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) [14:26:25] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:26:56] (03PS3) 10Trueg: wdqs-queryhammer: Deployment fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) [14:27:23] (03CR) 10CI reject: [V:04-1] vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) (owner: 10Fabian Kaelin) [14:27:47] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1009.eqiad.wmnet with reason: host reimage [14:28:02] (03CR) 10Trueg: wdqs-queryhammer: Deployment fixes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1258956 (https://phabricator.wikimedia.org/T417415) (owner: 10Trueg) [14:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1430) [14:31:43] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:33:00] (03PS6) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) [14:34:08] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1009.eqiad.wmnet with reason: host reimage [14:35:50] (03CR) 10Btullis: [C:04-1] "This needs more work." [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [14:36:08] (03PS4) 10Daniel Kinzler: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) [14:37:03] (03PS7) 10Brouberol: dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) [14:37:42] (03PS3) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) [14:40:07] (03CR) 10AOkoth: [C:03+2] miscweb: add wmf-navigator aux ingress record [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [14:41:06] !log aokoth@dns1004 START - running authdns-update [14:42:47] !log aokoth@dns1004 END - running authdns-update [14:43:11] (03PS1) 10Volans: Insetup role report: update receipients [puppet] - 10https://gerrit.wikimedia.org/r/1259989 [14:43:52] (03CR) 10AOkoth: [C:03+2] ats: add wmf-navigator entry [puppet] - 10https://gerrit.wikimedia.org/r/1255818 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [14:44:21] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2008.codfw.wmnet with OS trixie [14:44:47] (03CR) 10Clément Goubert: rest gateway: lower threshold for browser detection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) (owner: 10Daniel Kinzler) [14:44:52] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11744373 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker2008.codfw.wmnet with OS trixie [14:45:57] (03CR) 10Filippo Giunchedi: [C:03+1] Insetup role report: update receipients [puppet] - 10https://gerrit.wikimedia.org/r/1259989 (owner: 10Volans) [14:48:22] (03PS1) 10Jforrester: [abstractwiki] Don't list abstract as a langlist entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259992 (https://phabricator.wikimedia.org/T420654) [14:49:19] (03CR) 10Volans: [C:03+2] Insetup role report: update receipients [puppet] - 10https://gerrit.wikimedia.org/r/1259989 (owner: 10Volans) [14:50:29] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1009.eqiad.wmnet with OS trixie [14:50:44] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install aux-k8s-worker100[6-9].eqiad.wmnet - https://phabricator.wikimedia.org/T393053#11744397 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker1009.eqiad.wmnet with OS trixie comp... [14:51:26] (03PS1) 10Jforrester: dumpInterwiki: Re-generate to add Abstract Wikipedia (and others) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259993 (https://phabricator.wikimedia.org/T420654) [14:51:57] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:52:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259993 (https://phabricator.wikimedia.org/T420654) (owner: 10Jforrester) [14:52:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1259992 (https://phabricator.wikimedia.org/T420654) (owner: 10Jforrester) [14:53:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259967 (https://phabricator.wikimedia.org/T420916) (owner: 10Jforrester) [14:53:44] (03PS1) 10Jforrester: AbstractPreview: apply selected preview language lang/dir to abstract preview body [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259994 (https://phabricator.wikimedia.org/T420687) [14:53:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLambda] (wmf/1.46.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1259994 (https://phabricator.wikimedia.org/T420687) (owner: 10Jforrester) [14:55:13] (03PS1) 10JMeybohm: wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) [14:55:15] (03PS1) 10JMeybohm: wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436) [14:55:44] (03CR) 10CI reject: [V:04-1] wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [14:56:22] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage [14:56:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11744436 (10VRiley-WMF) This drive has been replaced [14:57:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11744451 (10FCeratto-WMF) a:05VRiley-WMF→03FCeratto-WMF thank you @VRiley-WMF ! I'm claiming the task, checking the host and repooling once the raid rebuilding is done. [14:58:05] (03CR) 10Btullis: [C:03+1] "Nice. Thanks for this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [14:59:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2008.codfw.wmnet with reason: host reimage [14:59:17] !log beginning the Traffic and Services portions of the DC switchover, operational followup will be in #wikimedia-sre [14:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:43] !log blake@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool codfw [reason: no reason specified, no task ID specified] [14:59:48] (03CR) 10Brouberol: [C:03+2] dse-k8s: ensure helm3.17 is used everywhere post upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259838 (https://phabricator.wikimedia.org/T414484) (owner: 10Brouberol) [14:59:52] !log blake@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool codfw [reason: no reason specified, no task ID specified] [15:00:04] jelto, arnoldokoth, mutante, and arnaudb: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1500). [15:00:13] (03PS2) 10JMeybohm: wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) [15:00:14] (03PS2) 10JMeybohm: wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436) [15:02:14] (03CR) 10Elukey: "$ docker-pkg build images/ --select *mcrouter*" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) (owner: 10Elukey) [15:03:55] (03PS1) 10JavierMonton: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259998 (https://phabricator.wikimedia.org/T420448) [15:06:39] (03CR) 10Ottomata: [C:03+1] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259998 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [15:09:38] (03PS4) 10Daniel Kinzler: rest-gateway: update readme [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 [15:09:53] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11744533 (10VRiley-WMF) @BCornwall Is it okay to power down this unit and investigate this issue? [15:10:23] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11744534 (10BCornwall) @VRiley-WMF Yes, please do! [15:10:41] (03CR) 10Daniel Kinzler: rest-gateway: update readme (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 (owner: 10Daniel Kinzler) [15:11:09] (03CR) 10JavierMonton: [C:03+2] stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259998 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [15:13:35] (03Merged) 10jenkins-bot: stream: mw-page-html-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259998 (https://phabricator.wikimedia.org/T420448) (owner: 10JavierMonton) [15:16:11] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2008.codfw.wmnet with OS trixie [15:16:20] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11744574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker2008.codfw.wmnet with OS trixie completed: - aux-k8s-w... [15:18:44] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker2009.codfw.wmnet with OS trixie [15:19:02] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11744590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1003 for host aux-k8s-worker2009.codfw.wmnet with OS trixie [15:19:40] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS trixie [15:20:16] !log blake@cumin1003 START - Cookbook sre.discovery.datacenter depool all services in codfw: Datacenter Switchover - T413974 [15:20:20] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [15:21:02] (03CR) 10Majavah: [C:03+2] cloudlb: Merge http-by-host to main http service type [puppet] - 10https://gerrit.wikimedia.org/r/1259134 (owner: 10Majavah) [15:22:37] (03PS1) 10Majavah: cloudlb: Allow specifying multiple addresses per frontend [puppet] - 10https://gerrit.wikimedia.org/r/1260004 (https://phabricator.wikimedia.org/T420921) [15:22:55] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11744624 (10RobH) Fixed on Friday, synced up in meeting today and no morre errors. Cathal closing the ticket on the Lumen portal. [15:23:08] 10ops-magru: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}) - https://phabricator.wikimedia.org/T415743#11744626 (10RobH) 05Open→03Resolved [15:23:16] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11744627 (10RobH) 05Open→03Resolved a:03RobH [15:23:18] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11744629 (10RobH) 05Open→03Resolved a:03RobH [15:23:43] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11744633 (10RobH) 05Open→03Resolved [15:24:50] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [15:25:35] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8332/co" [puppet] - 10https://gerrit.wikimedia.org/r/1260004 (https://phabricator.wikimedia.org/T420921) (owner: 10Majavah) [15:26:06] 10ops-magru: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T419298#11744655 (10RobH) description: Rule: Port with no description on access switch Faults: #1: ge-0/0/47 - ge-0/0/47 This is port https://netbox.wikimedia.org/dcim/int... [15:30:35] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage [15:32:02] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [15:32:09] (03CR) 10Clément Goubert: [C:03+1] wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [15:33:26] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [15:34:21] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker2009.codfw.wmnet with reason: host reimage [15:36:34] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:36:40] (03PS1) 10D3r1ck01: Enable JWTs for OAuth1 consumers and OAuth2 owner-only consumers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260006 (https://phabricator.wikimedia.org/T417833) [15:38:01] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1113.eqiad.wmnet with OS trixie [15:38:13] 06SRE, 10SRE-swift-storage: ms swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419577#11744708 (10MatthewVernon) A quick back-of-the-envelope is about 73TB for commons transcoded buckets. [15:38:16] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS trixie [15:38:49] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:39:37] (03PS3) 10JMeybohm: wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) [15:39:38] (03PS3) 10JMeybohm: wikikube: Enable ipip_encapsulation and mh scheduler [puppet] - 10https://gerrit.wikimedia.org/r/1259996 (https://phabricator.wikimedia.org/T420436) [15:40:01] FIRING: [10x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_debmonitor_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:40:57] FIRING: CertAlmostExpired: Certificate for service fasw2-c8b-codfw.mgmt.codfw.wmnet:32767 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#fasw2-c8b-codfw.mgmt.codfw.wmnet:32767 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:43:16] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [15:44:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:45:01] RESOLVED: [10x] ProbeDown: Service pki1002:443 has failed probes (http_PKI_debmonitor_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#pki1002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:32] (03PS1) 10Urbanecm: cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) [15:46:33] !log blake@cumin1003 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in codfw: Datacenter Switchover - T413974 [15:46:43] T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad) - https://phabricator.wikimedia.org/T413974 [15:49:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:50:07] (03PS4) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) [15:50:10] (03CR) 10Giuseppe Lavagetto: [C:03+1] wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [15:50:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:50:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [15:50:57] FIRING: [2x] CertAlmostExpired: Certificate for service fasw2-c8a-codfw.mgmt.codfw.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:50:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker2009.codfw.wmnet with OS trixie [15:51:05] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install aux-k8s-worker200[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T393054#11744745 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1003 for host aux-k8s-worker2009.codfw.wmnet with OS trixie completed: - aux-k8s-w... [15:52:03] (03CR) 10CI reject: [V:04-1] vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) (owner: 10Fabian Kaelin) [15:52:42] (03CR) 10Clément Goubert: [C:03+1] wikikube: Switch to IPIP mode for kube-apiserver [puppet] - 10https://gerrit.wikimedia.org/r/1259995 (https://phabricator.wikimedia.org/T420436) (owner: 10JMeybohm) [15:53:18] (03PS1) 10Blake: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 [15:53:56] (03CR) 10Jasmine: [C:03+1] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 (owner: 10Blake) [15:54:16] (03CR) 10Clément Goubert: [C:03+1] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 (owner: 10Blake) [15:54:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:54:36] !log Services portion of the datacenter switchover is complete [15:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:45] (03CR) 10Elukey: [C:03+2] site: install the aux-k8s-worker1006-9 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1258704 (https://phabricator.wikimedia.org/T393053) (owner: 10Brouberol) [15:54:49] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [15:55:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [15:55:55] (03CR) 10Blake: [C:03+2] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 (owner: 10Blake) [15:56:01] (03PS5) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) [15:57:40] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:54] (03Merged) 10jenkins-bot: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260013 (owner: 10Blake) [15:58:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [15:59:38] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [16:00:05] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:04] 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11744841 (10Scott_French) [16:02:39] (03CR) 10Fabian Kaelin: vector-search: add initial deployment chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) (owner: 10Fabian Kaelin) [16:03:31] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [16:03:32] !log blake@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [16:03:44] !log blake@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [16:03:45] !log blake@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:03:52] !log brouberol@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [16:03:59] !log blake@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:04:19] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [16:05:03] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:05:43] 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11744861 (10Scott_French) 05Open→03Resolved a:03Scott_French I'm going to optimistically resolve this, on the basis that granting membership in `nda` should allow access to Superset et al.... [16:06:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11744870 (10Scott_French) @kera_wmde - Great, thank you for confirming. [16:07:08] FIRING: KubernetesCalicoDown: aux-k8s-worker1006.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-aux&var-instance=aux-k8s-worker1006.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:07:57] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Superset for keren.ramirezWMDE - https://phabricator.wikimedia.org/T420896#11744878 (10Scott_French) [16:08:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:08:26] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:14] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Add api.w.o to gateway-check.lua.conf [puppet] - 10https://gerrit.wikimedia.org/r/1259067 (https://phabricator.wikimedia.org/T418145) (owner: 10Clément Goubert) [16:09:37] 10ops-eqiad, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421137 (10phaultfinder) 03NEW [16:09:58] (03CR) 10BCornwall: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1258094 (https://phabricator.wikimedia.org/T420615) (owner: 10Pppery) [16:12:08] RESOLVED: KubernetesCalicoDown: aux-k8s-worker1006.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-aux&var-instance=aux-k8s-worker1006.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:14:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:15:55] (03CR) 10Ottomata: [C:03+2] dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) (owner: 10Ottomata) [16:16:57] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259850 (https://phabricator.wikimedia.org/T408186) (owner: 10Santiago Faci) [16:18:01] (03Merged) 10jenkins-bot: dse-k8s - unset some Flink JobManager off-heap.size override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259975 (https://phabricator.wikimedia.org/T397330) (owner: 10Ottomata) [16:19:23] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploy v1.2.7 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259850 (https://phabricator.wikimedia.org/T408186) (owner: 10Santiago Faci) [16:22:03] RECOVERY - MegaRAID on db1170 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:22:33] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1113.eqiad.wmnet with OS trixie [16:24:15] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:24:42] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen-next: apply [16:26:32] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on cirrussearch2080:9290 - https://phabricator.wikimedia.org/T420760#11745087 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack. [16:26:55] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11745094 (10FCeratto-WMF) Rebuild completed: ` /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) === RaidStatus completed ` [16:27:02] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on logstash2036:9290 - https://phabricator.wikimedia.org/T420761#11745098 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack. [16:27:27] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T420762#11745105 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack. [16:28:15] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T420905#11745125 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack. [16:29:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:30:32] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Power Supply - Status - issue on cloudbackup2003:9290 - https://phabricator.wikimedia.org/T420948#11745154 (10Jhancock.wm) there were a few power supplies that went down in the same rack. it wasn't a breaker trip. all on different channels on the PDUs. I... [16:30:55] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T421042#11745162 (10Jhancock.wm) rmultiple servers found on different breakers. no correlation other than rack. [16:31:01] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T421042#11745164 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:31:27] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-ctrl2001:9290 - https://phabricator.wikimedia.org/T421043#11745172 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm rmultiple servers found on different breakers. no correlation other than rack. [16:32:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873#11745182 (10FCeratto-WMF) Icinga is green, the MySQL dashboard looks uneventful. Pooling in slowly. [16:32:42] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool pool db1170: Degraded drive replaced T420873 [16:32:49] T420873: Degraded RAID on db1170 - https://phabricator.wikimedia.org/T420873 [16:33:26] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:04] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Phabricator: phab2002: SEL System Event:, System Board Front LED Panel, Critical, management controller unavailable - https://phabricator.wikimedia.org/T420228#11745195 (10Jhancock.wm) I would definitely start with that and see if it clears the issue. the... [16:34:59] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421137#11745196 (10phaultfinder) [16:35:12] (03PS1) 10Krinkle: Enable $wgTrackMediaRequestProvenance on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260029 (https://phabricator.wikimedia.org/T414338) [16:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 18.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:36:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:37:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:37:26] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [16:38:17] dcausse: Should we repool search in codfw? [16:38:39] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1113.* [16:39:47] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11745213 (10VRiley-WMF) I checked the iDRAC to see if there were any failures showing. It doesn't seem like any hardware problems are showing. I performed a flea power drain.... [16:41:14] claime looks like the P95s are trending down, let's give it 10m? https://grafana.wikimedia.org/goto/efh012v6gabcwb?orgId=1 dcausse ebernhardson does that work for you? [16:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:41:37] bjensen: we may want to give a little more replicas to mw-api-ext [16:42:45] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.91% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:42:58] claime: latencies are getting better, can we wait ~10m to see if this gets better? [16:43:38] claime: any sense of how many replicas we'd like to add? [16:45:28] (03PS1) 10Dpogorzelski: knative: update images to 1.21.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) [16:46:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:47:03] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS trixie [16:47:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [16:47:45] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.68% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:47:49] (03PS1) 10Cathal Mooney: Nokia SR Linux: add BGP policy for aux K8S hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1260033 (https://phabricator.wikimedia.org/T371088) [16:48:17] (03PS14) 10JHathaway: nftables::service: Improve src/dst filter handling [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [16:49:16] also the alert search latency alert (mw@codfw to dnsdisc) is probably just noise, we should add a threshold on the qps from the source cluster [16:49:29] (03CR) 10JHathaway: "Looks good @taavi@wikimedia.org, made a couple of additions" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [16:49:32] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421137#11745261 (10Jclark-ctr) a:03Jclark-ctr [16:49:39] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212097 (https://phabricator.wikimedia.org/T411102) (owner: 10Majavah) [16:50:11] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T421137#11745273 (10Jclark-ctr) 05Open→03Resolved [16:50:16] cirrussearch1079 looks to be thrashing, gonna try restarting services [16:50:46] (03CR) 10Clément Goubert: [C:03+1] "+1 for now because we need to fix existing behaviour" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) (owner: 10Daniel Kinzler) [16:51:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [16:52:01] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: lower threshold for browser detection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) (owner: 10Daniel Kinzler) [16:52:08] bjensen: give it 10% or so for now, we'll see [16:52:17] So like +24 [16:52:20] 25* [16:52:33] claime: ack, on it [16:52:59] 10ops-eqiad, 06SRE, 06DC-Ops: hardware troubleshooting: NVMe errors on cp1115.eqiad.wmnet - https://phabricator.wikimedia.org/T421007#11745312 (10BCornwall) Unfortunately, it's still throwing the errors. :( [16:53:42] (03PS1) 10Blake: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260038 [16:54:20] (03Merged) 10jenkins-bot: rest gateway: lower threshold for browser detection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259956 (https://phabricator.wikimedia.org/T421031) (owner: 10Daniel Kinzler) [16:55:45] (03CR) 10Clément Goubert: [C:03+1] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260038 (owner: 10Blake) [16:55:55] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11745343 (10VRiley-WMF) Hey @BTullis We have recieved a replacement drive for this unit, and we are able to swap it out at anytime. [16:56:37] (03CR) 10Ayounsi: [C:03+1] "not tested but overall lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/1260033 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [16:56:41] (03CR) 10Blake: [C:03+2] mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260038 (owner: 10Blake) [16:56:54] (03CR) 10Cathal Mooney: [C:03+2] Nokia SR Linux: add BGP policy for aux K8S hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1260033 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [16:58:06] claime: we're still seeing some traffic to search from mw@codfw (20qps) is this expected? [16:58:27] dcausse: it's not depooled so yes [16:58:30] (03Merged) 10jenkins-bot: Nokia SR Linux: add BGP policy for aux K8S hosts [homer/public] - 10https://gerrit.wikimedia.org/r/1260033 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [16:58:36] Traffic is depooled from codfw, but jobs still run there for instance [16:58:36] ack [16:59:13] (03Merged) 10jenkins-bot: mw-web: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260038 (owner: 10Blake) [16:59:19] There's some residual traffic as well [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260324T1700) [17:00:08] !log blake@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:00:25] (03CR) 10Elukey: "LGTM! Could you also remove the *patch files as well?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1260031 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [17:00:26] !log blake@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:00:27] !log blake@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:00:46] !log blake@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:00:50] (03PS1) 10Clément Goubert: api-gateway: Chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260043 [17:01:24] (03CR) 10ArielGlenn: [C:03+1] "Looks great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1254848 (owner: 10Daniel Kinzler) [17:01:35] (03PS1) 10Kamila Součková: rest gateway: bump chart version for previous [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260044 [17:02:42] (03Abandoned) 10Clément Goubert: api-gateway: Chart version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260043 (owner: 10Clément Goubert) [17:03:01] (03CR) 10Clément Goubert: [C:03+1] rest gateway: bump chart version for previous [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260044 (owner: 10Kamila Součková) [17:04:44] (03CR) 10Kamila Součková: [C:03+2] rest gateway: bump chart version for previous [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260044 (owner: 10Kamila Součková) [17:05:07] (03PS1) 10DCausse: Revert^2 "search: use the discovery ns record for the semanticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260045 [17:05:49] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 103367000 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:06:49] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:06:55] (03Merged) 10jenkins-bot: rest gateway: bump chart version for previous [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260044 (owner: 10Kamila Součková) [17:07:46] (03CR) 10Michael Große: [C:03+1] cleanup: Remove UserEmailConfirmationUseHTML (defaults to true) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1260011 (https://phabricator.wikimedia.org/T411147) (owner: 10Urbanecm) [17:08:58] (03CR) 10Jasmine: [C:03+1] wmnet: update CNAME records for DB masters for dc switchover [dns] - 10https://gerrit.wikimedia.org/r/1255669 (https://phabricator.wikimedia.org/T416705) (owner: 10Gerrit maintenance bot) [17:09:57] bjensen: I got confused between what I asked for and what you did, I assume because of the alert in between. I think we need to bump mw-api-ext as well as mw-web [17:10:13] So mw-web now done, but mw-api-ext needs 25 replicas as well [17:10:59] claime: ah, gotcha, on it [17:11:36] (03PS1) 10Cathal Mooney: AUX K8s: user underscore not dash in ASN mapping [homer/public] - 10https://gerrit.wikimedia.org/r/1260046 (https://phabricator.wikimedia.org/T371088) [17:12:39] (03CR) 10Dzahn: [V:03+1 C:03+2] jenkins: allow rsyncing of data for migrating a jenkins server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1255136 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [17:13:32] (03CR) 10Cathal Mooney: [C:03+2] AUX K8s: user underscore not dash in ASN mapping [homer/public] - 10https://gerrit.wikimedia.org/r/1260046 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [17:13:54] (03PS1) 10Blake: mw-api-ext: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260047 [17:14:17] (03CR) 10Clément Goubert: [C:03+1] mw-api-ext: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260047 (owner: 10Blake) [17:14:27] (03CR) 10Dzahn: [C:03+1] gerrit: use Envoy on gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1259944 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [17:14:44] (03Merged) 10jenkins-bot: AUX K8s: user underscore not dash in ASN mapping [homer/public] - 10https://gerrit.wikimedia.org/r/1260046 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [17:15:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 22.35% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:17:30] (03CR) 10Blake: [C:03+2] mw-api-ext: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260047 (owner: 10Blake) [17:19:29] (03Merged) 10jenkins-bot: mw-api-ext: upsize for single-DC serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260047 (owner: 10Blake) [17:20:19] !log blake@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:20:36] !log blake@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:20:37] !log blake@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:20:58] !log blake@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:30:41] !log brett@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003.magru.wmnet} and A:liberica [17:32:39] RECOVERY - MD RAID on aqs1010 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:34:17] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003.magru.wmnet} and A:liberica [17:36:27] (03PS1) 10AOkoth: aux: fix location of wmf-navigator cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260055 (https://phabricator.wikimedia.org/T414405) [17:38:13] (03CR) 10Dzahn: [C:03+1] aux: fix location of wmf-navigator cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1260055 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth)