[00:08:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195082 [00:08:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195082 (owner: 10TrainBranchBot) [00:18:50] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:23:50] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:30:48] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1195082 (owner: 10TrainBranchBot) [00:48:50] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:58:50] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:49] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:08:50] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:21] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 32s) [01:18:50] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:28:50] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:33:55] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:34:00] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:11] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:43:50] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:24:02] FIRING: KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:28:50] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:46:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:33:50] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:38:50] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:48:50] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:53:50] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:22:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:27:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [04:30:42] PROBLEM - Backup freshness on backup1014 is CRITICAL: All failures: 1 (gitlab1004), Fresh: 145 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:43:50] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:48:50] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:48:50] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:50] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:50] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:25:26] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1034 gradually with 4 steps - Pool es1034.eqiad.wmnet in after cloning [05:25:27] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1029 gradually with 4 steps - Pool es1029.eqiad.wmnet in after cloning [05:28:24] (03PS1) 10Marostegui: db1249: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195098 (https://phabricator.wikimedia.org/T406541) [05:30:04] (03CR) 10Marostegui: [C:03+2] db1249: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195098 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [05:30:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1249.eqiad.wmnet with reason: Maintenance [05:30:40] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 146 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:30:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1249 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83727 and previous config saved to /var/cache/conftool/dbconfig/20251010-053040-marostegui.json [05:38:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1249 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83728 and previous config saved to /var/cache/conftool/dbconfig/20251010-053848-root.json [05:38:50] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:38:50] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:43:50] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1249 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83731 and previous config saved to /var/cache/conftool/dbconfig/20251010-055354-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251010T0600) [06:09:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1249 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83734 and previous config saved to /var/cache/conftool/dbconfig/20251010-060900-root.json [06:10:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1034 gradually with 4 steps - Pool es1034.eqiad.wmnet in after cloning [06:10:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1034.eqiad.wmnet onto es1057.eqiad.wmnet [06:10:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1029 gradually with 4 steps - Pool es1029.eqiad.wmnet in after cloning [06:10:57] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es1029.eqiad.wmnet onto es1052.eqiad.wmnet [06:13:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:24:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1249 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83737 and previous config saved to /var/cache/conftool/dbconfig/20251010-062406-root.json [06:27:47] FIRING: KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:28:50] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:29:32] (03PS1) 10Muehlenhoff: Enable Hadoop access for a-pizzata [puppet] - 10https://gerrit.wikimedia.org/r/1195103 (https://phabricator.wikimedia.org/T406328) [06:30:17] (03CR) 10CI reject: [V:04-1] Enable Hadoop access for a-pizzata [puppet] - 10https://gerrit.wikimedia.org/r/1195103 (https://phabricator.wikimedia.org/T406328) (owner: 10Muehlenhoff) [06:32:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11263137 (10MoritzMuehlenhoff) [06:33:38] (03PS2) 10Muehlenhoff: Enable Hadoop access for a-pizzata [puppet] - 10https://gerrit.wikimedia.org/r/1195103 (https://phabricator.wikimedia.org/T406328) [06:36:41] (03CR) 10Muehlenhoff: [C:03+2] Enable Hadoop access for a-pizzata [puppet] - 10https://gerrit.wikimedia.org/r/1195103 (https://phabricator.wikimedia.org/T406328) (owner: 10Muehlenhoff) [06:42:27] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11263149 (10MoritzMuehlenhoff) 05Open→03Resolved >>! In T406328#11257321, @Ahoelzl wrote: > Dear SRE, I'd appreciate if you could expedite this ticket. Thank... [06:45:05] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11263154 (10MoritzMuehlenhoff) [06:46:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:11] (03PS1) 10Muehlenhoff: Enable Hadoop access for javiermonton [puppet] - 10https://gerrit.wikimedia.org/r/1195104 (https://phabricator.wikimedia.org/T406331) [06:50:53] (03CR) 10Muehlenhoff: [C:03+2] Enable Hadoop access for javiermonton [puppet] - 10https://gerrit.wikimedia.org/r/1195104 (https://phabricator.wikimedia.org/T406331) (owner: 10Muehlenhoff) [06:51:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11263163 (10MoritzMuehlenhoff) Sorry for the delay, there was some scheduling error, this should have been processed by the weekly SRE clinic duty. @JMonton-WMF... [06:52:29] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11263166 (10MoritzMuehlenhoff) @thcipriani This needs your approval [06:52:45] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11263167 (10MoritzMuehlenhoff) p:05Triage→03Medium [06:56:21] (03PS1) 10Muehlenhoff: Add gengh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1195105 (https://phabricator.wikimedia.org/T405713) [06:59:24] (03CR) 10Muehlenhoff: [C:03+2] Add gengh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1195105 (https://phabricator.wikimedia.org/T405713) (owner: 10Muehlenhoff) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251010T0700) [07:01:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11263177 (10Gehel) [07:03:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for gengh - https://phabricator.wikimedia.org/T405713#11263181 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff @gengh Your access has been enabled. I'm resolving the task, if you... [07:10:36] (03PS1) 10Muehlenhoff: Add aramilferaxa to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1195107 (https://phabricator.wikimedia.org/T405796) [07:12:21] (03CR) 10Muehlenhoff: [C:03+2] Add aramilferaxa to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1195107 (https://phabricator.wikimedia.org/T405796) (owner: 10Muehlenhoff) [07:15:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to restricted for AramilFeraxa - https://phabricator.wikimedia.org/T405796#11263215 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff @MKopec Your access has been enabled. It will take 30 minutes until Puppet has... [07:17:02] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11263220 (10MoritzMuehlenhoff) @thcipriani Ths needs your approval [07:17:15] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11263222 (10MoritzMuehlenhoff) @thcipriani This needs your approval [07:19:10] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11263234 (10MoritzMuehlenhoff) @greg This needs your approval [07:29:55] (03CR) 10Elukey: [C:03+1] "I don't see anything against the approach, LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1193467 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [07:42:29] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply Redundancy alert on db1241 - https://phabricator.wikimedia.org/T406863#11263282 (10FCeratto-WMF) Thanks! [07:50:17] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for a-pizzata - https://phabricator.wikimedia.org/T406328#11263310 (10APizzata-WMF) @MoritzMuehlenhoff perfect, thank you very much! [08:01:20] (03PS4) 10Elukey: sre.hosts.provision: fix issue when moving a Dell host to UEFI [cookbooks] - 10https://gerrit.wikimedia.org/r/1194892 [08:03:40] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: No disk boot option when moving ms-be2078 to UEFI - https://phabricator.wikimedia.org/T406964 (10elukey) 03NEW [08:04:19] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11263342 (10elukey) Tests on ms-be2078 are blocked by T406964 :( [08:07:54] PROBLEM - Host ms-be1088 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:34] (03CR) 10Giuseppe Lavagetto: "LGTM as a fix, see my comment on hwo it could be maybe simplified" [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [08:11:55] ms-be1088 is me, not pooled :) [08:19:32] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1), 13Patch-For-Review: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11263371 (10fgiunchedi) cloudcephosd1050 is now running with a single nic, and at least I c... [08:27:30] (03CR) 10Btullis: dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [08:30:50] 06SRE, 10SRE-Access-Requests: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11263395 (10JMonton-WMF) Thank you @MoritzMuehlenhoff! You are right, I missed the Kerberos part, as you said, I'd need also a Kerberos principal. Many thanks! [08:34:53] (03PS1) 10Btullis: Re-enable YARN and HDFS on an-worker123[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/1195154 (https://phabricator.wikimedia.org/T398438) [08:35:36] RECOVERY - Host ms-be1088 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [08:36:12] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [08:37:10] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11263414 (10elukey) While checking the BIOS/etc.. settings for ms-be2078 (Dell), I noticed that in the config util of the RAID contro... [08:37:14] (03PS1) 10Muehlenhoff: Enable Kerberos principal for javiermonton [puppet] - 10https://gerrit.wikimedia.org/r/1195155 (https://phabricator.wikimedia.org/T406331) [08:39:57] (03CR) 10Muehlenhoff: [C:03+2] Enable Kerberos principal for javiermonton [puppet] - 10https://gerrit.wikimedia.org/r/1195155 (https://phabricator.wikimedia.org/T406331) (owner: 10Muehlenhoff) [08:39:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Data Platform for JavierMonton - https://phabricator.wikimedia.org/T406331#11263422 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff >>! In T406331#11263395, @JMonton-WMF wrote: > You are right, I missed the Kerbero... [08:44:08] (03CR) 10Vgutierrez: haproxy tls_terminator template cleanups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [08:46:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:46] (03PS1) 10Btullis: Update dummy keytabs to reflect the current list of hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1195156 [08:48:50] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:51:03] 06SRE, 06Infrastructure-Foundations: nodesource node22 apt mirror is broken - https://phabricator.wikimedia.org/T406623#11263465 (10elukey) Checked https://deb.nodesource.com/, and the setup script mentions: ` echo "deb [arch=$arch signed-by=/usr/share/keyrings/nodesource.gpg] https://deb.nodesource.com/node_... [08:58:31] (03CR) 10Btullis: [V:03+2 C:03+2] Update dummy keytabs to reflect the current list of hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1195156 (owner: 10Btullis) [08:58:43] (03CR) 10Btullis: [C:03+2] Re-enable YARN and HDFS on an-worker123[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/1195154 (https://phabricator.wikimedia.org/T398438) (owner: 10Btullis) [09:08:50] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:50] (03PS1) 10Elukey: aptrepo: fix node22 updates config and re-enable it [puppet] - 10https://gerrit.wikimedia.org/r/1195160 (https://phabricator.wikimedia.org/T406623) [09:13:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11263489 (10BTullis) 05Open→03Resolved I checked the physical disks to see which one needed to be configured: ` PD LIST : =====... [09:17:34] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#11263494 (10Volans) [note for future self] If we can wait for Python 3.14 to be around in our systems then we should evaluate also the new [[ https://docs.python.org/3/library/co... [09:19:48] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:20:23] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:32:05] (03CR) 10Muehlenhoff: Add the node labeller binary to the package. (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [09:34:24] !log btullis@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [09:34:56] !log btullis@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [09:38:50] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:43:50] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:50] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:16:12] (03PS1) 10Marostegui: db1248: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195181 (https://phabricator.wikimedia.org/T406541) [10:16:46] (03CR) 10Marostegui: [C:03+2] db1248: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195181 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [10:17:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1248.eqiad.wmnet with reason: Maintenance [10:17:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1248 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83738 and previous config saved to /var/cache/conftool/dbconfig/20251010-101720-marostegui.json [10:24:29] (03CR) 10Marostegui: "For the upgrade testing you can use db1247 - keep in mind it is in production" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [10:25:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1248 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83739 and previous config saved to /var/cache/conftool/dbconfig/20251010-102502-root.json [10:27:02] (03CR) 10Fabfur: [C:03+1] haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [10:27:47] FIRING: KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:28:50] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:30:39] (03CR) 10Marostegui: [C:04-1] "Needs to be removed also from insetup" [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:32:07] !log restarting acme-chief and nginx on acme-chief instances [10:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:14] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:36:22] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#11263901 (10Marostegui) Can I reimage this myself then? [10:40:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1248 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83740 and previous config saved to /var/cache/conftool/dbconfig/20251010-104008-root.json [10:42:55] (03PS1) 10GergesShamon: [arwikibooks] Update logos and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195183 [10:44:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195183 (owner: 10GergesShamon) [10:47:42] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1195160 (https://phabricator.wikimedia.org/T406623) (owner: 10Elukey) [10:55:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1248 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83741 and previous config saved to /var/cache/conftool/dbconfig/20251010-105514-root.json [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251010T0700) [11:00:05] jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251010T1100). [11:04:50] (03PS2) 10Federico Ceratto: site.pp: Add es2052, remove from insetup [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859) [11:05:27] (03CR) 10Federico Ceratto: "Ah, good spot, I'll update the runbook" [puppet] - 10https://gerrit.wikimedia.org/r/1194979 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [11:10:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1248 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83742 and previous config saved to /var/cache/conftool/dbconfig/20251010-111020-root.json [11:11:25] (03PS1) 10Marostegui: db1243: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195185 (https://phabricator.wikimedia.org/T406541) [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:07] (03CR) 10Marostegui: [C:03+2] db1243: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195185 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [11:13:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1243.eqiad.wmnet with reason: Maintenance [11:13:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1243 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83743 and previous config saved to /var/cache/conftool/dbconfig/20251010-111306-marostegui.json [11:13:34] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:13:56] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:14:11] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [11:15:04] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:15:39] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:16:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Change es3 eqiad master to es1028 T406488', diff saved to https://phabricator.wikimedia.org/P83744 and previous config saved to /var/cache/conftool/dbconfig/20251010-111605-marostegui.json [11:16:09] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:16:09] T406488: Productionize es1049 - es1057 - https://phabricator.wikimedia.org/T406488 [11:16:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Change es1 eqiad master to es1029 T406488', diff saved to https://phabricator.wikimedia.org/P83745 and previous config saved to /var/cache/conftool/dbconfig/20251010-111630-marostegui.json [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:16:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Change es2 eqiad master to es1030 T406488', diff saved to https://phabricator.wikimedia.org/P83746 and previous config saved to /var/cache/conftool/dbconfig/20251010-111653-marostegui.json [11:19:25] (03Abandoned) 10Reedy: labs only: Enable multiple 2FA modules and new 2FA UI in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1188890 (https://phabricator.wikimedia.org/T404029) (owner: 10Catrope) [11:21:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1243 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83747 and previous config saved to /var/cache/conftool/dbconfig/20251010-112126-root.json [11:23:12] (03PS1) 10Reedy: CommonSettings-labs: Remove OATHAuth config that are the same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195191 [11:35:52] (03PS2) 10Filippo Giunchedi: cloudceph: handle double / single NIC transition [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) [11:35:53] (03PS1) 10Filippo Giunchedi: interface: only bring down existing tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1195192 (https://phabricator.wikimedia.org/T405478) [11:35:54] (03PS1) 10Filippo Giunchedi: interface: add pre_down_command define [puppet] - 10https://gerrit.wikimedia.org/r/1195193 (https://phabricator.wikimedia.org/T405478) [11:35:56] (03PS1) 10Filippo Giunchedi: interface: del route on interface down [puppet] - 10https://gerrit.wikimedia.org/r/1195194 (https://phabricator.wikimedia.org/T405478) [11:36:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1243 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83748 and previous config saved to /var/cache/conftool/dbconfig/20251010-113632-root.json [11:40:02] (03CR) 10Filippo Giunchedi: "Thank you for the review; I have reworked things a little to be easier to understand and operate, there's now a few prerequisite changes a" [puppet] - 10https://gerrit.wikimedia.org/r/1194967 (https://phabricator.wikimedia.org/T405478) (owner: 10Filippo Giunchedi) [11:41:41] (03PS2) 10Muehlenhoff: Add missing Cumin alias for cloudrabbit/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1193836 [11:41:48] (03CR) 10Muehlenhoff: Add missing Cumin alias for cloudrabbit/codfw1dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193836 (owner: 10Muehlenhoff) [11:51:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1243 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83749 and previous config saved to /var/cache/conftool/dbconfig/20251010-115138-root.json [11:55:22] (03CR) 10Reedy: "Noting the English one hasn't been backported..." [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 (owner: 10Jforrester) [11:55:34] (03PS1) 10Reedy: Changing end date for Board election notification [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195197 (https://phabricator.wikimedia.org/T392232) [12:01:32] (03CR) 10Elukey: [C:03+2] aptrepo: fix node22 updates config and re-enable it [puppet] - 10https://gerrit.wikimedia.org/r/1195160 (https://phabricator.wikimedia.org/T406623) (owner: 10Elukey) [12:06:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1243 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83750 and previous config saved to /var/cache/conftool/dbconfig/20251010-120643-root.json [12:13:29] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: nodesource node22 apt mirror is broken - https://phabricator.wikimedia.org/T406623#11264142 (10elukey) 05Open→03Resolved a:03elukey ` root@apt1002:/srv/wikimedia# reprepro --component thirdparty/node22 checkupdate trixie-wikimedia Calculating... [12:27:52] (03CR) 10Cathal Mooney: [C:03+2] Nokia: adjust how we load static YAML configs [homer/public] - 10https://gerrit.wikimedia.org/r/1193467 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:28:07] (03PS2) 10Elukey: Add the node labeller binary to the package. [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) [12:28:26] (03CR) 10Elukey: Add the node labeller binary to the package. (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [12:29:20] (03Merged) 10jenkins-bot: Nokia: adjust how we load static YAML configs [homer/public] - 10https://gerrit.wikimedia.org/r/1193467 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [12:29:21] 06SRE, 06Infrastructure-Foundations: ganeti105[34] implementation tracking - https://phabricator.wikimedia.org/T381581#11264209 (10MoritzMuehlenhoff) 05Open→03Resolved These servers are aleady in service for quite a while [12:34:46] (03PS1) 10Muehlenhoff: Bitu: Add approval config for airflow-wikidata-ops [puppet] - 10https://gerrit.wikimedia.org/r/1195202 (https://phabricator.wikimedia.org/T405557) [12:38:19] (03PS1) 10Elukey: profile::amd_gpu: use a system user for the GPU node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) [12:38:39] (03PS2) 10Cathal Mooney: inter.link: add BGP community in esams for ddos protection [homer/public] - 10https://gerrit.wikimedia.org/r/1194988 (https://phabricator.wikimedia.org/T400984) [12:38:39] (03PS1) 10Cathal Mooney: inter.link esams: also add ddos scrubbing community to v6 prefix [homer/public] - 10https://gerrit.wikimedia.org/r/1195206 (https://phabricator.wikimedia.org/T400984) [12:40:30] (03CR) 10CI reject: [V:04-1] profile::amd_gpu: use a system user for the GPU node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [12:44:21] (03PS2) 10Elukey: profile::amd_gpu: use a system user for the GPU node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) [12:45:17] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7247/co" [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [12:48:50] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:54:22] (03CR) 10Elukey: Replace elasticsearch lib w/ spicerack APIClient (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [12:55:17] (03CR) 10Elukey: [C:03+2] Replace elasticsearch lib w/ spicerack APIClient [software/spicerack] - 10https://gerrit.wikimedia.org/r/1167299 (https://phabricator.wikimedia.org/T390860) (owner: 10Ryan Kemper) [12:56:36] (03PS2) 10Bking: dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) [12:56:56] (03PS1) 10Elukey: setup.py: remove the elastic dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1195208 (https://phabricator.wikimedia.org/T390860) [13:01:18] (03PS3) 10Bking: dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) [13:01:42] (03CR) 10Bking: dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [13:03:50] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps2011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1195208 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [13:10:35] (03PS1) 10Elukey: Remove the elasticsearch dependency [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195211 (https://phabricator.wikimedia.org/T390860) [13:10:59] (03PS1) 10Kosta Harlan: wmgMonologChannels: Set CheckUser to info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195212 [13:11:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195212 (owner: 10Kosta Harlan) [13:11:51] (03CR) 10Dreamy Jazz: [C:03+1] wmgMonologChannels: Set CheckUser to info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195212 (owner: 10Kosta Harlan) [13:13:42] (03CR) 10Muehlenhoff: Add the node labeller binary to the package. (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:14:37] (03PS1) 10Marostegui: db1242: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195213 (https://phabricator.wikimedia.org/T406541) [13:14:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on an-worker1235 - https://phabricator.wikimedia.org/T406293#11264305 (10Jclark-ctr) a:05BTullis→03Jclark-ctr [13:15:04] (03CR) 10CI reject: [V:04-1] db1242: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195213 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [13:15:27] (03PS3) 10Elukey: Add the node labeller binary to the package. [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) [13:15:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, as long as https://gerrit.wikimedia.org/r/c/operations/debs/amd-k8s-device-plugin/+/1194942 installs the matching UID." [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:15:40] (03CR) 10Tchanders: [C:03+1] wmgMonologChannels: Set CheckUser to info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1195212 (owner: 10Kosta Harlan) [13:16:17] (03CR) 10Elukey: Add the node labeller binary to the package. (031 comment) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:16:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11264319 (10bking) 05Open→03Resolved This host has been reimaged as a dse-k8s worker, so I'm closing this out. Work t... [13:17:23] (03PS2) 10Marostegui: db1242: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195213 (https://phabricator.wikimedia.org/T406541) [13:17:32] !log revert haproxykafka to v0.3.16 on cp5021 and cp7001 (T404427) [13:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:23] (03CR) 10Marostegui: [C:03+2] db1242: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1195213 (https://phabricator.wikimedia.org/T406541) (owner: 10Marostegui) [13:20:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1242.eqiad.wmnet with reason: Maintenance [13:20:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1242 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P83752 and previous config saved to /var/cache/conftool/dbconfig/20251010-132003-marostegui.json [13:20:11] (03CR) 10Muehlenhoff: "A few nits, otherwise looks good" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:20:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11264336 (10Jclark-ctr) a:05Jclark-ctr→03BTullis @BTullis Replaced Failed drive and converted to non-raid state via idrac. [13:21:32] (03CR) 10Btullis: dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [13:22:56] (03CR) 10Klausman: [C:03+1] profile::amd_gpu: use a system user for the GPU node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:22:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11264347 (10Jclark-ctr) while logged in updated idrac from 6.00.30.00 to 7.00.00.182 [13:24:46] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195211 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [13:26:36] (03CR) 10Bking: dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [13:28:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1242 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P83753 and previous config saved to /var/cache/conftool/dbconfig/20251010-132808-root.json [13:28:39] (03CR) 10Jforrester: [C:03+2] "Yeah. :-( In retrospect, instead of cherry-picking I should have made a custom one with the en original and the already-translated de one." [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195021 (owner: 10Jforrester) [13:29:51] (03CR) 10Jforrester: "This should include the other translations that were updated, de and it." [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195197 (https://phabricator.wikimedia.org/T392232) (owner: 10Reedy) [13:30:55] (03PS3) 10Elukey: profile::amd_gpu: use a system user for the GPU node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) [13:31:35] (03PS4) 10Elukey: profile::amd_gpu: use a system user for the GPU node labeller [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) [13:31:38] (03CR) 10Elukey: profile::amd_gpu: use a system user for the GPU node labeller (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:35:21] (03PS1) 10Marostegui: control-mariadb-10.11-trixie: Add to repo [software] - 10https://gerrit.wikimedia.org/r/1195217 (https://phabricator.wikimedia.org/T406981) [13:36:00] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-trixie: Add to repo [software] - 10https://gerrit.wikimedia.org/r/1195217 (https://phabricator.wikimedia.org/T406981) (owner: 10Marostegui) [13:38:50] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:39:04] (03PS4) 10Elukey: Add the node labeller binary to the package. [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) [13:39:29] (03CR) 10Elukey: Add the node labeller binary to the package. (032 comments) [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [13:43:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1242 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P83754 and previous config saved to /var/cache/conftool/dbconfig/20251010-134314-root.json [13:43:51] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:23] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195221 [13:47:47] RESOLVED: KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:55:13] (03PS1) 10TChin: [mw-enrichment] Bump to v1.42.0 and Flink 1.20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195223 (https://phabricator.wikimedia.org/T401725) [13:58:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1242 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P83755 and previous config saved to /var/cache/conftool/dbconfig/20251010-135820-root.json [13:59:16] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [14:01:28] (03PS1) 10Elukey: Don't skip elasticsearch tests anymore on older py versions. [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195224 (https://phabricator.wikimedia.org/T390860) [14:01:39] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: Add records for opensearch-test and opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1195048 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [14:02:32] !log bking@dns1004 START - running authdns-update [14:03:43] !log bking@dns1004 END - running authdns-update [14:06:23] (03PS2) 10CDanis: haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041 [14:06:36] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [14:06:53] !log elukey@cumin1003 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for sretest2001.codfw.wmnet: Renew puppet certificate - elukey@cumin1003 [14:09:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1195224 (https://phabricator.wikimedia.org/T390860) (owner: 10Elukey) [14:12:59] (03PS1) 10Elukey: sre.puppet.renew-cert: add a rudimentary fence for puppetservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1195225 (https://phabricator.wikimedia.org/T405580) [14:13:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1242 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P83756 and previous config saved to /var/cache/conftool/dbconfig/20251010-141326-root.json [14:13:48] FIRING: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:13:51] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:14:31] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11264510 (10elukey) Today I've run the sre.puppet.renew-cert cookbook for sretest2001 and it worked nicely. I created https://gerrit.wikimedia.org/r/c/opera... [14:16:15] (03CR) 10CDanis: haproxy tls_terminator template cleanups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [14:19:08] (03CR) 10Kamila Součková: "I assume you mean Oct 15? Otherwise LGTM :D" [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [14:19:20] (03CR) 10Kamila Součková: [C:03+1] url_downloader: remove hcaptcha proxy configuration [puppet] - 10https://gerrit.wikimedia.org/r/1195013 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [14:19:57] (03PS2) 10Jforrester: Changing end date for Board election notification [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195197 (https://phabricator.wikimedia.org/T392232) (owner: 10Reedy) [14:20:03] (03CR) 10Jforrester: "Done." [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195197 (https://phabricator.wikimedia.org/T392232) (owner: 10Reedy) [14:20:54] (03PS26) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [14:23:36] (03CR) 10Vgutierrez: [C:03+1] haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [14:24:23] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1195205 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [14:24:34] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (owner: 10CDanis) [14:24:55] (03CR) 10Kamila Součková: "sorry for resurrecting the dead, but it's dumb question time :D" [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [14:25:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [14:27:25] (03PS6) 10CDanis: WIP: ja4h lua first draft [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [14:27:25] (03PS1) 10CDanis: haproxylua: add core.concat() reimpl [puppet] - 10https://gerrit.wikimedia.org/r/1195228 [14:28:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/1195225 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [14:28:50] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:29:18] (03PS2) 10CDanis: haproxylua: add core.concat() reimpl [puppet] - 10https://gerrit.wikimedia.org/r/1195228 [14:29:18] (03PS3) 10CDanis: haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041 [14:29:18] (03PS7) 10CDanis: WIP: ja4h lua first draft [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [14:30:48] (03CR) 10Muehlenhoff: [C:03+1] haptcha: add new role for hCaptcha proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1193126 (https://phabricator.wikimedia.org/T405631) (owner: 10Ssingh) [14:34:29] (03CR) 10Elukey: [C:03+2] sre.puppet.renew-cert: add a rudimentary fence for puppetservers [cookbooks] - 10https://gerrit.wikimedia.org/r/1195225 (https://phabricator.wikimedia.org/T405580) (owner: 10Elukey) [14:35:23] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet host certificate problem on puppetserver1001 - https://phabricator.wikimedia.org/T405580#11264573 (10elukey) 05Open→03Resolved a:03elukey [14:36:28] (03CR) 10Kamila Součková: "I _think_ it might be sufficient to just do this without reimages." [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [14:36:33] (03CR) 10Cathal Mooney: [C:03+2] inter.link esams: also add ddos scrubbing community to v6 prefix [homer/public] - 10https://gerrit.wikimedia.org/r/1195206 (https://phabricator.wikimedia.org/T400984) (owner: 10Cathal Mooney) [14:37:53] (03CR) 10Kamila Součková: wikikube: Add wikikube-worker2[248-330] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [14:44:39] (03PS4) 10CDanis: haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041 [14:44:39] (03PS8) 10CDanis: WIP: ja4h lua first draft [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [14:44:39] (03PS1) 10CDanis: ja3n: use core.concat() [puppet] - 10https://gerrit.wikimedia.org/r/1195231 [14:45:04] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:45:10] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11264581 (10greg) Approved! [14:47:46] (03CR) 10Jforrester: [C:03+1] "Let's land this now?" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1195197 (https://phabricator.wikimedia.org/T392232) (owner: 10Reedy) [14:49:44] (03PS27) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [14:53:48] RESOLVED: PuppetFailure: Puppet has failed on cumin1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:57:07] (03PS9) 10CDanis: WIP: ja4h lua first draft [puppet] - 10https://gerrit.wikimedia.org/r/1194934 [15:04:36] 06SRE, 10Hiddenparma: FY25/26 WE4.3.2: support JA4H - https://phabricator.wikimedia.org/T406990 (10CDanis) 03NEW [15:05:20] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11264617 (10elukey) @Gehel I see some moving in the sprints tags, is it being planned/worked on? I can take car... [15:08:50] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:02] !log btullis@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker-codfw [15:10:43] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11264640 (10elukey) [15:11:08] (03PS3) 10CDanis: haproxylua: add core.concat() reimpl [puppet] - 10https://gerrit.wikimedia.org/r/1195228 (https://phabricator.wikimedia.org/T406990) [15:11:10] (03PS2) 10CDanis: ja3n: use core.concat() [puppet] - 10https://gerrit.wikimedia.org/r/1195231 [15:11:10] (03PS5) 10CDanis: haproxy tls_terminator template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/1195041 (https://phabricator.wikimedia.org/T406990) [15:11:12] (03PS10) 10CDanis: haproxy: add JA4H support [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) [15:11:14] (03PS1) 10CDanis: haproxy: enable ja4h on cp7008 [puppet] - 10https://gerrit.wikimedia.org/r/1195234 (https://phabricator.wikimedia.org/T406990) [15:17:07] (03CR) 10Vgutierrez: [C:03+1] haproxylua: add core.concat() reimpl [puppet] - 10https://gerrit.wikimedia.org/r/1195228 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [15:17:18] (03CR) 10Vgutierrez: [C:03+1] "nice catch, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1195231 (owner: 10CDanis) [15:18:12] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1195236 [15:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:21:50] (03PS5) 10Elukey: Add the node labeller binary to the package. [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) [15:26:57] (03CR) 10Elukey: "I had a typo in the user conf filename, fixed, the package builds :)" [debs/amd-k8s-device-plugin] - 10https://gerrit.wikimedia.org/r/1194942 (https://phabricator.wikimedia.org/T373806) (owner: 10Elukey) [15:27:36] (03Abandoned) 10Jgiannelos: changeprop: Add header with event timestamp for PCS requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160753 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [15:30:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:31:17] FIRING: ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:50] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:17] RESOLVED: ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:56] !log btullis@cumin1003 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:dse-k8s-worker-codfw [15:41:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.09.26 - 2025.10.17), 07Essential-Work: Degraded RAID on druid1011 - https://phabricator.wikimedia.org/T406394#11264794 (10BTullis) Thanks @Jclark-ctr - Looks good. I can see that the drive showed up as `/dev/sde` and had no partition table. ` [Fri O... [15:41:32] FIRING: KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:46:32] RESOLVED: KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:55:30] rolling out some envoy upgrades, staging only today [15:56:16] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [15:56:25] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:58:54] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:59:02] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [16:00:02] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/chart-renderer: apply [16:00:21] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [16:02:46] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [16:02:54] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [16:03:26] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [16:03:42] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [16:03:59] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [16:04:16] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [16:04:35] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/data-gateway: apply [16:04:50] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [16:05:05] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/device-analytics: apply [16:05:20] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [16:06:31] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/echostore: apply [16:06:46] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: apply [16:07:12] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/edit-analytics: apply [16:07:28] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [16:08:06] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/editor-analytics: apply [16:08:11] (03PS1) 10Cwhite: alertmanager: use psi-alerts channel ID [puppet] - 10https://gerrit.wikimedia.org/r/1195255 [16:08:22] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [16:08:44] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [16:09:06] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [16:09:16] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-a2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T406911#11264896 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm freak spike [16:09:26] (03CR) 10Cwhite: [C:03+2] alertmanager: use psi-alerts channel ID [puppet] - 10https://gerrit.wikimedia.org/r/1195255 (owner: 10Cwhite) [16:09:33] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [16:09:43] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [16:10:07] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [16:10:38] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [16:11:00] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [16:11:30] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [16:13:37] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [16:14:01] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [16:14:14] (03CR) 10CDanis: haproxy: add JA4H support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1194934 (https://phabricator.wikimedia.org/T406990) (owner: 10CDanis) [16:14:21] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [16:14:37] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [16:14:48] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/geo-analytics: apply [16:15:03] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [16:15:28] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/image-suggestion: apply [16:15:44] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [16:16:21] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [16:16:29] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [16:19:51] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [16:19:59] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [16:23:28] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [16:25:19] (03PS1) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [16:26:04] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [16:27:19] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mathoid: apply [16:27:26] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mathoid: apply [16:27:37] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/media-analytics: apply [16:27:53] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [16:28:03] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7248/console" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [16:30:17] (03PS2) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [16:30:38] (03CR) 10Cathal Mooney: [C:03+2] inter.link: add BGP community in esams for ddos protection [homer/public] - 10https://gerrit.wikimedia.org/r/1194988 (https://phabricator.wikimedia.org/T400984) (owner: 10Cathal Mooney) [16:31:12] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7249/console" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [16:31:56] (03Merged) 10jenkins-bot: inter.link: add BGP community in esams for ddos protection [homer/public] - 10https://gerrit.wikimedia.org/r/1194988 (https://phabricator.wikimedia.org/T400984) (owner: 10Cathal Mooney) [16:31:56] (03Merged) 10jenkins-bot: inter.link esams: also add ddos scrubbing community to v6 prefix [homer/public] - 10https://gerrit.wikimedia.org/r/1195206 (https://phabricator.wikimedia.org/T400984) (owner: 10Cathal Mooney) [16:31:58] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:33:14] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:33:15] (03PS3) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [16:34:38] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:35:01] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:35:38] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:35:42] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:36:01] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/page-analytics: apply [16:36:17] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [16:36:30] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [16:36:38] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [16:36:52] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/push-notifications: apply [16:37:07] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [16:37:33] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/recommendation-api: apply [16:37:41] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [16:37:55] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: apply [16:38:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [16:38:38] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox: apply [16:39:04] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox: apply [16:39:46] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [16:39:57] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [16:40:26] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-media: apply [16:40:38] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [16:41:17] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [16:41:32] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [16:42:52] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [16:43:10] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [16:43:30] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-video: apply [16:43:52] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:45:09] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [16:45:27] (03PS4) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [16:45:39] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [16:45:55] (03CR) 10CI reject: [V:04-1] dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [16:46:12] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/termbox: apply [16:46:19] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/termbox: apply [16:46:43] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [16:46:49] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:47:43] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/toolhub: apply [16:47:51] (03PS5) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [16:48:04] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/toolhub: apply [16:48:21] (03CR) 10CI reject: [V:04-1] dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [16:48:49] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [16:49:01] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [16:49:04] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [16:49:18] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [16:49:22] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:49:36] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [16:50:17] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [16:50:33] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [16:53:17] (03PS6) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [16:53:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (206.126.236.106) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [16:57:03] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [16:58:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (206.126.236.106) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDow [17:16:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008 (10cmooney) 03NEW p:05Triage→03Medium [17:17:12] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [17:19:53] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:09] (03PS1) 10Cathal Mooney: eqiad: remove dedicate BGP trasnit to HE and replace with IX service [homer/public] - 10https://gerrit.wikimedia.org/r/1195264 (https://phabricator.wikimedia.org/T407008) [17:25:18] (03CR) 10Cathal Mooney: [C:03+2] eqiad: remove dedicate BGP trasnit to HE and replace with IX service [homer/public] - 10https://gerrit.wikimedia.org/r/1195264 (https://phabricator.wikimedia.org/T407008) (owner: 10Cathal Mooney) [17:26:34] (03Merged) 10jenkins-bot: eqiad: remove dedicate BGP trasnit to HE and replace with IX service [homer/public] - 10https://gerrit.wikimedia.org/r/1195264 (https://phabricator.wikimedia.org/T407008) (owner: 10Cathal Mooney) [17:39:01] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:44:01] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:32] (03CR) 10Stoyofuku-wmf: [C:03+1] "Thanks for sticking with this!! Noting out loud that `dt` and `meta.dt` from the instrumentation spec are not in the stream config, but I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [18:03:40] (03PS3) 10DDesouza: Undeploy Design Research participant recruitment survey on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191688 (https://phabricator.wikimedia.org/T405577) [18:04:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1191688 (https://phabricator.wikimedia.org/T405577) (owner: 10DDesouza) [18:06:10] FIRING: BFDdown: BFD session down between cr2-eqiad and fe80::ee38:73ff:fee7:bc68 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:07:53] (03CR) 10Jdlrobson: [C:03+1] "Let's get this landed on Tuesday during the deployment window(s)!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [18:11:10] RESOLVED: BFDdown: BFD session down between cr2-eqiad and fe80::ee38:73ff:fee7:bc68 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:12:46] (03CR) 10Santiago Faci: "You are right. neither `dt` nor `meta.dt` are contextual attributes. Those values are both "core properties" and will be added automatical" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T404999) (owner: 10LorenMora) [18:14:22] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:15:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265323 (10RobH) [18:16:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265326 (10RobH) I've pulled my old records off the google sheet and updated this task and netbox with the patch panel landing info and... [18:19:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265342 (10RobH) @VRiley-WMF, I don't want to accidentally disconnect the wrong thing, and as my old (and possibly outdated) google sh... [18:22:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265352 (10VRiley-WMF) Sure, I will look into that [18:29:22] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:43:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265424 (10VRiley-WMF) I have confirmed that patch ID 3909 lands on PP:0000:103234:P1/2 and then onto cr1-eqiad:xe-3/1/5. Also, it seem... [18:49:16] (03CR) 10Andrew Bogott: "I haven't done performance testing but I have confirmed that I can create a three-node pool, set a key on one node, and read the key on an" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1194687 (https://phabricator.wikimedia.org/T406522) (owner: 10Andrew Bogott) [19:09:13] (03CR) 10Ebernhardson: [C:03+1] NetworkSession: enable only for private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193988 (owner: 10DCausse) [19:20:37] (03PS9) 10LorenMora: Add ReadingList Stream to EventStreamConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T406627) [19:20:46] (03CR) 10LorenMora: Add ReadingList Stream to EventStreamConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193445 (https://phabricator.wikimedia.org/T406627) (owner: 10LorenMora) [19:22:27] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11265582 (10VRiley-WMF) Opened up a ticket with Juniper Case Number 2025-1010-896021 [19:30:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032 (10RobH) 03NEW [19:30:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11265640 (10RobH) [19:32:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install aqs102[3-7] - https://phabricator.wikimedia.org/T407032#11265645 (10RobH) a:03Eevans @eevans, Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and... [19:36:08] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11265651 (10VRiley-WMF) Hey @cmooney Just checked it, and I apologize. It wasn't plugged in yet, however, that's been corrected. [19:53:55] (03PS1) 10Eevans: Provision hosts aqs102[3-7] (refresh of aqs101[0-2,4-5]) [puppet] - 10https://gerrit.wikimedia.org/r/1195276 (https://phabricator.wikimedia.org/T407032) [20:00:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:04:22] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:45:07] (03PS1) 10Andrew Bogott: Updates to build v2025.10.06 for Debian Trixie [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1195333 (https://phabricator.wikimedia.org/T406522) [20:45:55] (03PS3) 10Andrew Bogott: Updates to build v2025.10.06 for Debian Trixie [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1194687 (https://phabricator.wikimedia.org/T406522) [20:46:07] (03Abandoned) 10Andrew Bogott: Updates to build v2025.10.06 for Debian Trixie [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1195333 (https://phabricator.wikimedia.org/T406522) (owner: 10Andrew Bogott) [20:50:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265826 (10RobH) [20:51:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265827 (10RobH) Deinstall order 1-252958908448 submitted, feel free to unplug the patch cable 3909 from both ends. Once they confirm... [20:51:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265831 (10RobH) [20:52:31] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:53:03] andrew@cumin2002 reimage (PID 2863697) is awaiting input [20:54:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Decom Hurricane Electric Transit/Peering circuit eqiad - https://phabricator.wikimedia.org/T407008#11265835 (10RobH) @cmooney: Shouldn't the circuit be 'decommissioning' status in netbox at this point? https://netbox.wikimedia.org/cir... [20:57:05] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: WIP [20:58:56] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Updates to build v2025.10.06 for Debian Trixie [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/1194687 (https://phabricator.wikimedia.org/T406522) (owner: 10Andrew Bogott) [21:00:08] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2005-dev.codfw.wmnet with OS trixie [21:00:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:02:10] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:49] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [21:16:54] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudcontrol2005-dev.codfw.wmnet with reason: host reimage [21:32:43] (03PS28) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [21:39:22] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:49:50] (03PS1) 10Dzahn: zuul: add firewall rule to allow docker network to zookeeper port [puppet] - 10https://gerrit.wikimedia.org/r/1195340 (https://phabricator.wikimedia.org/T395938) [21:56:16] (03PS29) 10Bking: opensearch-cluster: Add secrets and network policy templates to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1193937 (https://phabricator.wikimedia.org/T397246) [22:00:18] (03CR) 10Dzahn: [C:03+2] zuul: add firewall rule to allow docker network to zookeeper port [puppet] - 10https://gerrit.wikimedia.org/r/1195340 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:06:52] (03PS1) 10Dzahn: zuul: fix srange in firewall rule, do not set host bits [puppet] - 10https://gerrit.wikimedia.org/r/1195341 (https://phabricator.wikimedia.org/T395938) [22:07:09] (03CR) 10CI reject: [V:04-1] zuul: fix srange in firewall rule, do not set host bits [puppet] - 10https://gerrit.wikimedia.org/r/1195341 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:07:14] (03PS2) 10Dzahn: zuul: fix srange in firewall rule, do not set host bits [puppet] - 10https://gerrit.wikimedia.org/r/1195341 (https://phabricator.wikimedia.org/T395938) [22:10:37] (03CR) 10Dzahn: [C:03+2] zuul: fix srange in firewall rule, do not set host bits [puppet] - 10https://gerrit.wikimedia.org/r/1195341 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [22:14:22] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:18:31] (03PS1) 10Bking: opensearch on k8s: add service definitions [puppet] - 10https://gerrit.wikimedia.org/r/1195342 (https://phabricator.wikimedia.org/T357753) [22:32:32] FIRING: HelmReleaseBadStatus: Helm release mw-script/amfcta11 on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:59:27] (03PS1) 10Dzahn: zuul: add firewall rule to allow zuul-web to httpd [puppet] - 10https://gerrit.wikimedia.org/r/1195347 (https://phabricator.wikimedia.org/T405119) [22:59:53] (03CR) 10CI reject: [V:04-1] zuul: add firewall rule to allow zuul-web to httpd [puppet] - 10https://gerrit.wikimedia.org/r/1195347 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [23:00:06] (03PS2) 10Dzahn: zuul: add firewall rule to allow zuul-web to httpd [puppet] - 10https://gerrit.wikimedia.org/r/1195347 (https://phabricator.wikimedia.org/T405119) [23:00:30] (03CR) 10CI reject: [V:04-1] zuul: add firewall rule to allow zuul-web to httpd [puppet] - 10https://gerrit.wikimedia.org/r/1195347 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [23:01:45] (03PS3) 10Dzahn: zuul: add firewall rule to allow zuul-web to httpd [puppet] - 10https://gerrit.wikimedia.org/r/1195347 (https://phabricator.wikimedia.org/T405119) [23:37:31] (03CR) 10Dzahn: [C:03+2] zuul: add firewall rule to allow zuul-web to httpd [puppet] - 10https://gerrit.wikimedia.org/r/1195347 (https://phabricator.wikimedia.org/T405119) (owner: 10Dzahn) [23:38:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195349 [23:38:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195349 (owner: 10TrainBranchBot) [23:45:52] (03PS1) 10Jasmine: wikikube: Add wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1195350 (https://phabricator.wikimedia.org/T390861) [23:48:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1195349 (owner: 10TrainBranchBot) [23:55:42] PROBLEM - ganeti-noded running on ganeti1023 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [23:56:42] RECOVERY - ganeti-noded running on ganeti1023 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti