[00:13:57] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11386743 (10TheDJ) Who will be responsible for security review, when this is sharing important top level domains ? [00:22:59] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [00:22:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [00:22:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [00:38:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473 (10Catrope) 03NEW [00:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206987 [00:40:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206987 (owner: 10TrainBranchBot) [00:48:11] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet'] [00:48:56] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet'] [00:55:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206987 (owner: 10TrainBranchBot) [01:00:58] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:10:03] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206989 [01:10:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206989 (owner: 10TrainBranchBot) [01:14:17] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 18s) [01:18:21] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet'] [01:18:49] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet'] [01:23:29] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet'] [01:23:47] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet'] [01:35:04] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206989 (owner: 10TrainBranchBot) [01:35:55] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1074.eqiad.wmnet with OS trixie [01:35:55] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:11] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1074.eqiad.wmnet with reason: host reimage [01:53:19] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1074.eqiad.wmnet with reason: host reimage [02:34:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [02:34:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [02:34:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [02:51:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:59:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1074.eqiad.wmnet with OS trixie [03:04:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [03:04:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [03:04:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [03:06:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:29:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [04:29:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [04:29:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:44:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [04:44:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [04:44:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:08:24] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:21:21] (03PS1) 10Kevin Bazira: ml-services: deploy revertrisk-wikidata to the revision-models ns prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) [05:24:26] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:29:26] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:31:59] FIRING: [3x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:33:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:36:59] FIRING: [4x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:37:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:46:59] RESOLVED: [4x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:16:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:25:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms3 T405942', diff saved to https://phabricator.wikimedia.org/P85372 and previous config saved to /var/cache/conftool/dbconfig/20251119-062509-marostegui.json [06:25:21] T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 [06:25:28] PROBLEM - Host db2144 #page is DOWN: PING CRITICAL - Packet loss = 100% [06:25:38] mmm what [06:25:41] !incidents [06:25:42] 7027 (UNACKED) Host db2144 (paged) [06:25:42] 7024 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [06:25:42] 7025 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [06:25:42] 7023 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [06:25:42] 7017 (RESOLVED) Host db1221 (paged) [06:25:43] 7022 (RESOLVED) db1233 (paged)/MariaDB Replica Lag: s2 (paged) [06:25:43] 7021 (RESOLVED) db1259 (paged)/MariaDB Replica Lag: s2 (paged) [06:25:43] 7020 (RESOLVED) db1259 (paged)/MariaDB Replica IO: s2 (paged) [06:25:43] 7019 (RESOLVED) db1258 (paged)/MariaDB Replica IO: x3 (paged) [06:25:44] 7018 (RESOLVED) db1258 (paged)/MariaDB Replica Lag: x3 (paged) [06:25:44] 7016 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [06:25:45] 7015 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [06:25:45] 7014 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [06:25:46] 7013 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [06:25:49] !ack 7027 [06:26:14] PROBLEM - MariaDB Replica IO: ms2 on db1151 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2144.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2144.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:26:14] !incidents [06:26:15] 7027 (ACKED) Host db2144 (paged) [06:26:15] 7024 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [06:26:15] 7025 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [06:26:15] 7023 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [06:26:15] 7017 (RESOLVED) Host db1221 (paged) [06:26:16] 7022 (RESOLVED) db1233 (paged)/MariaDB Replica Lag: s2 (paged) [06:26:16] 7021 (RESOLVED) db1259 (paged)/MariaDB Replica Lag: s2 (paged) [06:26:16] 7020 (RESOLVED) db1259 (paged)/MariaDB Replica IO: s2 (paged) [06:26:16] 7019 (RESOLVED) db1258 (paged)/MariaDB Replica IO: x3 (paged) [06:26:17] 7018 (RESOLVED) db1258 (paged)/MariaDB Replica Lag: x3 (paged) [06:26:17] 7016 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [06:26:18] 7015 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [06:26:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms3 T405942', diff saved to https://phabricator.wikimedia.org/P85373 and previous config saved to /var/cache/conftool/dbconfig/20251119-062634-marostegui.json [06:26:43] I will depool ms2 [06:27:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 2.885% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:27:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool ms2', diff saved to https://phabricator.wikimedia.org/P85374 and previous config saved to /var/cache/conftool/dbconfig/20251119-062728-marostegui.json [06:28:17] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: db2144 went down [06:32:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 17.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:33:20] 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480 (10Marostegui) 03NEW [06:33:31] 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480#11386969 (10Marostegui) p:05Triage→03Medium [06:34:13] 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480#11386972 (10Marostegui) I rebooted the host via idrac [06:34:37] RECOVERY - Host db2144 #page is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [06:34:43] 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480#11386973 (10Marostegui) ms2 is depooled [06:35:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc1 after network maint', diff saved to https://phabricator.wikimedia.org/P85375 and previous config saved to /var/cache/conftool/dbconfig/20251119-063522-marostegui.json [06:36:14] RECOVERY - MariaDB Replica IO: ms2 on db1151 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:37:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:37:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [06:37:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:39:02] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11386976 (10WMDECyn) Chandra's position is fixed till maximum 31st Jan 2026 [06:39:03] 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480#11386977 (10Marostegui) ` 2025-11-19T06:23:24.670274+00:00 db2144 kernel: [8348456.319422] mce: Uncorrected hardware memory error in user-access at 2062ea3d80 2025-11-19T06:23:24.670289+00:00 db2144 kernel: [8348456... [06:40:33] (03PS1) 10Marostegui: ms2: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207051 (https://phabricator.wikimedia.org/T410480) [06:40:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1223 with weight 0 T410283', diff saved to https://phabricator.wikimedia.org/P85376 and previous config saved to /var/cache/conftool/dbconfig/20251119-064055-marostegui.json [06:41:00] T410283: Switchover s3 master (db1189 -> db1223) - https://phabricator.wikimedia.org/T410283 [06:41:16] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T410283 [06:41:58] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1206406 (https://phabricator.wikimedia.org/T410283) (owner: 10Gerrit maintenance bot) [06:47:36] !log Starting s3 eqiad failover from db1189 to db1223 - T410283 [06:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:40] T410283: Switchover s3 master (db1189 -> db1223) - https://phabricator.wikimedia.org/T410283 [06:47:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1223 to s3 primary T410283', diff saved to https://phabricator.wikimedia.org/P85377 and previous config saved to /var/cache/conftool/dbconfig/20251119-064755-marostegui.json [06:48:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1189 T410283', diff saved to https://phabricator.wikimedia.org/P85378 and previous config saved to /var/cache/conftool/dbconfig/20251119-064838-marostegui.json [06:48:57] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1189 gradually with 4 steps - Repooling after switchover [06:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:52:09] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1189 gradually with 4 steps - Repooling after switchover [06:52:20] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1189 gradually with 4 steps - Repooling after switchover [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T0700) [07:04:28] 10ops-codfw, 06DBA, 06DC-Ops, 13Patch-For-Review: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11387011 (10Marostegui) [07:05:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet,pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: network maintenance [07:06:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool pc4', diff saved to https://phabricator.wikimedia.org/P85380 and previous config saved to /var/cache/conftool/dbconfig/20251119-070656-marostegui.json [07:07:20] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11387015 (10Marostegui) @Jclark-ctr db1189 pc1014 Those can be moved anytime when you get to the DC [07:16:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:21:33] (03PS3) 10DCausse: cirrus: index field to sort on title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) [07:21:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) (owner: 10DCausse) [07:37:46] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1189 gradually with 4 steps - Repooling after switchover [07:41:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:55:27] PROBLEM - Thanos swift https on thanos-fe1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [07:57:48] (03PS10) 10Arnaudb: apt-staging: logging and metrics [puppet] - 10https://gerrit.wikimedia.org/r/1205162 (https://phabricator.wikimedia.org/T409833) [07:57:48] (03CR) 10Arnaudb: "this change brings a bit more readability on the log output, and adds metrics to allow us to create alerts and be notified when something " [puppet] - 10https://gerrit.wikimedia.org/r/1205162 (https://phabricator.wikimedia.org/T409833) (owner: 10Arnaudb) [07:58:17] RECOVERY - Thanos swift https on thanos-fe1005 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Thanos [07:58:44] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11387076 (10fgiunchedi) [07:59:54] !log started OSM import on maps-test2001 T409528 [07:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:58] (03PS3) 10Arnaudb: apt-staging: error handling for reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) [07:59:58] T409528: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528 [07:59:58] (03CR) 10Arnaudb: "this change brings a logic stem to plug onto if we want to add email notification in case of reprepro issues. It currently increments a me" [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) (owner: 10Arnaudb) [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T0800). [08:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:13] o/ [08:00:16] I can deploy [08:01:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:02:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) (owner: 10DCausse) [08:03:01] (03Merged) 10jenkins-bot: cirrus: index field to sort on title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) (owner: 10DCausse) [08:04:10] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]] [08:04:16] T40403: Sortable search results - https://phabricator.wikimedia.org/T40403 [08:06:26] (03CR) 10Brouberol: [C:03+1] opensearch on k8s: Add CODFW environment to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206973 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking) [08:08:29] (03PS2) 10Brouberol: dse-k8s-codfw: set minimum resources for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206969 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking) [08:08:36] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1169.eqiad.wmnet'] [08:09:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1169.eqiad.wmnet'] [08:09:32] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:09:37] T40403: Sortable search results - https://phabricator.wikimedia.org/T40403 [08:12:31] !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [08:12:36] !log dcausse@deploy2002 dcausse: Continuing with sync [08:13:18] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [08:15:06] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11387129 (10Marostegui) @Jclark-ctr db1189 pc1014 Those can be moved anytime when you get to the DC [08:15:57] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11387131 (10Volans) [08:17:52] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]] (duration: 13m 42s) [08:17:56] T40403: Sortable search results - https://phabricator.wikimedia.org/T40403 [08:21:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:22:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11387139 (10Marostegui) @Jclark-ctr I think we scheduled db1189 for today but it was done yesterday? The spreadsheet marks it as done and also I can see: `... [08:22:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387140 (10Volans) p:05Triage→03Medium [08:23:53] (03PS3) 10Ryan Kemper: wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) [08:24:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387144 (10Volans) [08:24:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11387145 (10jcrespo) Based on the spreedsheet, no more interruptions are expected on ` backup1006 backup1007 ms-backup1002 ` So I will restart eqiad med... [08:25:19] (03CR) 10MVernon: "@bcornwall@wikimedia.org sorry, I was away last week and missed this; the change message says it's not fixed in Debian and cites Debian bu" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1204941 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [08:25:59] (03CR) 10Marostegui: [C:03+2] ms2: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207051 (https://phabricator.wikimedia.org/T410480) (owner: 10Marostegui) [08:27:41] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387152 (10Volans) Adding #data-engineering for visibility, no approval required for WMF staff. Pending approval from @SCherukuwada [08:28:37] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11387153 (10ayounsi) [08:29:23] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11387155 (10ayounsi) Thanks for the great writeup. We should unfortunately look at upgrading Netbox first. TBD if we need to spend time on a workaround. [08:30:00] filippo@cumin1003 reimage (PID 2877688) is awaiting input [08:30:48] (03CR) 10Ayounsi: UEFI: dup partition on MD RAID boxes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [08:34:50] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10vm-requests: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11387197 (10dcaro) p:05Triage→03Medium [08:35:15] !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for backup[1006-1007].eqiad.wmnet,ms-backup[1001-1002].eqiad.wmnet [08:35:17] !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for backup[1006-1007].eqiad.wmnet,ms-backup[1001-1002].eqiad.wmnet [08:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:39:59] dcausse: are you done deploying? [08:40:13] kostajh: yes [08:40:31] ok, I will deploy some patches [08:41:00] (03CR) 10Kosta Harlan: "recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [08:42:03] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11387257 (10ayounsi) Lots great thanks ! Not sure how best to show it on the diagram, but we also need to remove the 10G link between cr3 and cr4. Maybe you can... [08:42:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206906 (https://phabricator.wikimedia.org/T410024) (owner: 10Kosta Harlan) [08:43:20] (03PS4) 10Ryan Kemper: wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) [08:45:11] (03CR) 10Ryan Kemper: wdqs: add availability sli recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [08:51:37] (03PS1) 10Gehel: wdqs: Do not create task on failure of the WDQS LDF endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853) [08:51:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:53:39] (03Merged) 10jenkins-bot: hCaptcha: Validate sitekey of /siteverify API call [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206906 (https://phabricator.wikimedia.org/T410024) (owner: 10Kosta Harlan) [08:54:13] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1206906|hCaptcha: Validate sitekey of /siteverify API call (T410024)]] [08:54:17] T410024: ConfirmEdit hCaptcha: Verify sitekey in `siteverify` response was the sitekey given to the client as part of validating the captcha - https://phabricator.wikimedia.org/T410024 [08:56:42] (03PS1) 10Kosta Harlan: hCaptcha: Record A/B test experiment group [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207108 (https://phabricator.wikimedia.org/T410354) [08:56:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:57:39] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387290 (10SCherukuwada) Manager approves. [08:58:11] !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [08:58:47] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1206906|hCaptcha: Validate sitekey of /siteverify API call (T410024)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:59:55] (03PS1) 10Itamar Givon: Replace 'let' with arithmetic expansion [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) [08:59:57] (03PS1) 10Itamar Givon: Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) [08:59:59] (03PS1) 10Itamar Givon: Restore strict error handling [dumps] - 10https://gerrit.wikimedia.org/r/1207111 (https://phabricator.wikimedia.org/T406044) [09:00:05] brennen and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T0900). [09:00:44] !log kharlan@deploy2002 kharlan: Continuing with sync [09:01:25] andre: still finishing up some backports, is it ok to continue for another 30 minutes? [09:01:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:04:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207108 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [09:04:45] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206906|hCaptcha: Validate sitekey of /siteverify API call (T410024)]] (duration: 10m 32s) [09:04:49] T410024: ConfirmEdit hCaptcha: Verify sitekey in `siteverify` response was the sitekey given to the client as part of validating the captcha - https://phabricator.wikimedia.org/T410024 [09:05:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [09:05:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206830 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [09:06:02] actually, I'll leave the deployments I have for later [09:06:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:10:00] (03PS5) 10Tiziano Fogli: metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) [09:10:01] (03PS1) 10Tiziano Fogli: metamonitoring/icinga: trigger pages only for the active instance [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625) [09:10:41] (03CR) 10CI reject: [V:04-1] metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [09:16:45] (03PS3) 10Tiziano Fogli: metamonitoring/icinga: suppress script-managed notifications and pages [puppet] - 10https://gerrit.wikimedia.org/r/1206884 (https://phabricator.wikimedia.org/T393625) [09:16:45] (03PS4) 10Tiziano Fogli: metamonitoring/icinga: add smtp settings to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1206885 (https://phabricator.wikimedia.org/T393625) [09:16:45] (03PS6) 10Tiziano Fogli: metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) [09:16:46] (03PS2) 10Tiziano Fogli: metamonitoring/icinga: trigger pages only for the active instance [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625) [09:20:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy1001.wikimedia.org [09:20:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org [09:20:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:23:35] (03PS1) 10Volans: admin: add catrope to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1207114 (https://phabricator.wikimedia.org/T410473) [09:24:22] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387379 (10Volans) [09:24:39] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387381 (10Volans) [09:24:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy1001.wikimedia.org [09:25:54] (03PS1) 10David Caro: toolforge:prometheus: use / as the path url instead of /tools [puppet] - 10https://gerrit.wikimedia.org/r/1207115 [09:26:44] jmm@cumin2002 makevm (PID 262502) is awaiting input [09:31:10] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002" [09:31:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy1002.wikimedia.org [09:31:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [09:32:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1207114 (https://phabricator.wikimedia.org/T410473) (owner: 10Volans) [09:34:15] jmm@cumin2002 makevm (PID 262502) is awaiting input [09:34:17] (03CR) 10Volans: [C:03+2] admin: add catrope to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1207114 (https://phabricator.wikimedia.org/T410473) (owner: 10Volans) [09:35:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy1002.wikimedia.org [09:35:37] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853) (owner: 10Gehel) [09:35:51] (03PS2) 10Gehel: wdqs: Do not create task on failure of the WDQS LDF endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853) [09:36:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002" [09:36:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:36:12] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors [09:36:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors [09:36:24] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:36:50] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11387399 (10ayounsi) Nice ! As the IPs are already available, we should change the cr3/cr4/mr1 loopbacks ahead of time, in a different maintenance window, so... [09:36:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [09:37:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387401 (10Volans) @Catrope patch merged, will be live within ~30 minutes. Kerberos principal created, you should have received an email about it with in... [09:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206936 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [09:38:19] (03CR) 10Gehel: [C:03+2] wdqs: Do not create task on failure of the WDQS LDF endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853) (owner: 10Gehel) [09:42:08] jmm@cumin2002 makevm (PID 262502) is awaiting input [09:44:03] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:44:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host hcaptcha-proxy7001.wikimedia.org [09:44:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy2001.wikimedia.org [09:48:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy2001.wikimedia.org [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:38] (03PS1) 10Sergio Gimeno: fix(MigrateMentorStatusAway): ensure migration respects date format [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207118 (https://phabricator.wikimedia.org/T409170) [09:52:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207118 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [09:56:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy2002.wikimedia.org [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:40] 06SRE, 10Bitu, 06Infrastructure-Foundations: Live validation of usernames - https://phabricator.wikimedia.org/T345168#11387450 (10Tacsipacsi) [09:59:40] 06SRE, 10SRE-Access-Requests: Grant Access to ops-limited for matthieulec - https://phabricator.wikimedia.org/T410291#11387469 (10MLechvien-WMF) Thanks! I'm now able to SSH to Bastion, so it seems fine to close this. [10:00:19] (03PS1) 10Brouberol: airflow: update the base image to include the opensearch provider [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207121 (https://phabricator.wikimedia.org/T408238) [10:00:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy2002.wikimedia.org [10:00:36] 06SRE, 10SRE-Access-Requests: Grant Access to ops-limited for matthieulec - https://phabricator.wikimedia.org/T410291#11387484 (10MLechvien-WMF) 05Open→03Resolved [10:04:29] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy7001.wikimedia.org [10:08:36] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:09:01] (03PS1) 10Muehlenhoff: sre.hosts.decommission: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1207122 [10:13:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [10:14:19] jmm@cumin2002 decommission (PID 285077) is awaiting input [10:16:55] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:17:26] kostajh: see https://versions.toolforge.org/ - group0 is already on wmf.3 so there is no train :) [10:20:00] jmm@cumin2002 decommission (PID 285077) is awaiting input [10:20:00] (03PS1) 10Muehlenhoff: EFI-enabled Partman recipe (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400) [10:20:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:20:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:20:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy7001.wikimedia.org [10:20:21] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11387503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for ho... [10:23:03] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387510 (10ayounsi) I might have found something in Redfish for Dell: `lang=python r = spicerack.redfish('sretest2004') dump = r.scp_dump() dump.config['SystemConfiguration']['Comp... [10:23:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:24:07] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on clouddb[1024-1025].eqiad.wmnet with reason: cloning [10:25:37] (03PS1) 10Marostegui: mariadb: Productionize clouddb1025 [puppet] - 10https://gerrit.wikimedia.org/r/1207125 (https://phabricator.wikimedia.org/T409557) [10:25:41] (03CR) 10Brouberol: [C:03+2] airflow: update the base image to include the opensearch provider [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207121 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [10:26:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:26:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:26:52] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize clouddb1025 [puppet] - 10https://gerrit.wikimedia.org/r/1207125 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [10:28:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:28:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [10:30:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:31:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:31:36] (03PS1) 10Marostegui: db2144: Remove note. [puppet] - 10https://gerrit.wikimedia.org/r/1207127 [10:32:12] (03CR) 10Marostegui: [C:03+2] db2144: Remove note. [puppet] - 10https://gerrit.wikimedia.org/r/1207127 (owner: 10Marostegui) [10:32:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org [10:32:25] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:34:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy3001.wikimedia.org [10:35:14] (03PS1) 10Federico Ceratto: admin: add fceratto FIDO2 U2F SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1207129 [10:35:14] (03CR) 10Federico Ceratto: "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1207129 (owner: 10Federico Ceratto) [10:35:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy3001.wikimedia.org [10:37:59] FIRING: KubernetesDeploymentUnavailableReplicas: ... [10:37:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [10:37:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:38:07] jmm@cumin2002 makevm (PID 299387) is awaiting input [10:39:47] (03CR) 10Arnaudb: "I forgot to @ any reviewer for this chance, sorry about the delay!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [10:40:12] (03CR) 10Arnaudb: "change*" [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [10:40:46] (03PS5) 10Arnaudb: gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) [10:48:18] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387577 (10ayounsi) Looks like it was a false hope, I looked at cirrussearch2115 which is showing the same behavior: ` lsw1-d3-codfw> show lldp neighbors | match xe-0/0/43 xe... [10:48:52] (03CR) 10Arnaudb: [C:03+1] "small nitpicks but no blockers" [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth) [10:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:51:51] (03CR) 10Arnaudb: "I'll need to remove the local backup logic from the failover cookbook after merging this" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [10:52:29] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002" [10:52:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002" [10:52:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:52:36] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors [10:52:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors [10:53:12] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002" [10:53:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002" [10:55:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy4001.wikimedia.org [10:56:14] (03PS1) 10Brouberol: airflow-platform-eng: define the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207132 (https://phabricator.wikimedia.org/T408238) [10:56:18] jmm@cumin2002 makevm (PID 299387) is awaiting input [10:58:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1207129 (owner: 10Federico Ceratto) [10:58:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy4001.wikimedia.org [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1100) [11:00:42] (03PS1) 10Cyndywikime: [Growth]:Remove GELevelingUpNewNotificationsEnabled config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207133 (https://phabricator.wikimedia.org/T407431) [11:00:51] (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: define the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207132 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:01:46] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: define the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207132 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:02:41] (03CR) 10Federico Ceratto: [C:03+2] admin: add fceratto FIDO2 U2F SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1207129 (owner: 10Federico Ceratto) [11:03:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:03:46] (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207133 (https://phabricator.wikimedia.org/T407431) (owner: 10Cyndywikime) [11:03:52] !log cgoubert@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [11:04:59] (03PS2) 10Kevin Bazira: ml-services: deploy revertrisk-wikidata to the revision-models ns prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) [11:05:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:06:17] (03CR) 10Hnowlan: [C:03+1] "lgtm, two nits" [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [11:06:40] (03PS1) 10Brouberol: airflow-platform-eng: fix a tyop in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207136 (https://phabricator.wikimedia.org/T408238) [11:06:53] (03PS2) 10Brouberol: airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207136 (https://phabricator.wikimedia.org/T408238) [11:07:18] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387603 (10ayounsi) Haven't dug yet, but maybe an option is to install Broadcom's niccli tool : https://docs.broadcom.com/docs/Linux_Niccli-233.0.198.0 Then disabling it with: ` D... [11:08:52] (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207136 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:09:10] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207136 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:10:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:10:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:11:23] 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11387617 (10ngkountas) Thank you @Volans, I can now run queries on super.wikimedia.org properly! Thanks to everyone involved! This task can be now resolved. [11:11:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:12:45] (03PS1) 10Brouberol: airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207137 (https://phabricator.wikimedia.org/T408238) [11:13:57] (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207137 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:15:04] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207137 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:15:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:16:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:19:06] jouncebot: nowandnext [11:19:07] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1100) [11:19:07] In 0 hour(s) and 40 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1200) [11:19:57] (03CR) 10FNegri: [C:03+1] P:toolforge::prometheus: Use native exporters for HAProxy targets [puppet] - 10https://gerrit.wikimedia.org/r/1203427 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [11:23:24] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Use native exporters for HAProxy targets [puppet] - 10https://gerrit.wikimedia.org/r/1203427 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [11:23:43] (03PS1) 10Brouberol: airflow-platform-eng: configure SSL for opensearch API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207140 (https://phabricator.wikimedia.org/T408238) [11:24:37] (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: configure SSL for opensearch API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207140 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:24:39] RECOVERY - Kafka broker TLS certificate validity on kafka-main1006 is OK: SSL OK - Certificate kafka-main1006.eqiad.wmnet valid until 2026-10-20 13:49:00 +0000 (expires in 335 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [11:24:50] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [11:26:21] (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: configure SSL for opensearch API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207140 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol) [11:26:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:27:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:30:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7001.wikimedia.org with OS trixie [11:30:49] !log Roll restarting mobileapps in codfw - unavailable replicas - T410296 [11:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:53] T410296: Significant increase in wikifeeds latency since 2025/11/13 - https://phabricator.wikimedia.org/T410296 [11:30:57] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync [11:31:57] thanks, was thinking about that :D [11:32:18] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync [11:33:01] hnowlan: :D [11:34:36] (03PS1) 10Slyngshede: P:cache::base disable geoip in cloud environment [puppet] - 10https://gerrit.wikimedia.org/r/1207141 [11:37:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [11:37:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [11:37:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [11:41:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy5001.wikimedia.org [11:45:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy5001.wikimedia.org [11:47:43] (03PS3) 10Kevin Bazira: ml-services: deploy revertrisk-wikidata to the revertrisk ns prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) [11:54:20] (03PS3) 10Majavah: P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428 [11:55:30] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387754 (10cmooney) Another datapoint here, but the logspam seems worse on some switches: ` A:lsw1-d7-eqiad# show system logging buffer messages | grep -c "remote peer updated on i... [11:55:32] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7647/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203428 (owner: 10Majavah) [11:56:34] (03CR) 10AikoChou: [C:03+1] "LGTM! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [11:57:21] !log routing /api/rest_v1/page/lint/ via the rest-gateway for group1 [11:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:23] (03PS1) 10Majavah: Remove absented HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1207144 [12:00:04] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1200). [12:00:20] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [12:01:47] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206905 (owner: 10PipelineBot) [12:02:13] (03Merged) 10jenkins-bot: ml-services: deploy revertrisk-wikidata to the revertrisk ns prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [12:03:34] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206905 (owner: 10PipelineBot) [12:03:56] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy7001.wikimedia.org with reason: host reimage [12:04:40] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:05:00] !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:05:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy5002.wikimedia.org [12:05:23] !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:06:37] !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:07:09] !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:07:26] !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:07:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy7001.wikimedia.org with reason: host reimage [12:07:55] !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:09:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy5002.wikimedia.org [12:10:49] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:11:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy6001.wikimedia.org [12:13:26] (03CR) 10Clément Goubert: [C:03+1] "LGTM, will need testing in staging before roll out." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler) [12:14:38] 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11387772 (10Volans) 05In progress→03Resolved a:03Volans [12:15:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy6001.wikimedia.org [12:22:41] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [12:24:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy7001.wikimedia.org with OS trixie [12:24:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy7001.wikimedia.org [12:25:52] !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host db1169.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:27:10] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: allow rate limits per time unit (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler) [12:28:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy6002.wikimedia.org [12:28:54] (03CR) 10Clément Goubert: rest-gateway: implement per-route rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler) [12:32:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy6002.wikimedia.org [12:32:43] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11387861 (10Marostegui) [12:33:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7002.wikimedia.org [12:33:12] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:35:35] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1169.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:37:31] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:39:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:39:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:39:19] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors [12:39:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors [12:39:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:40:01] (03CR) 10Clément Goubert: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [12:43:42] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:43:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:43:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:43:49] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors [12:43:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors [12:43:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7002.wikimedia.org [12:44:35] (03CR) 10Zabe: undeploy Extension:Capiunto (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206851 (https://phabricator.wikimedia.org/T410172) (owner: 10Novem Linguae) [12:45:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7002.wikimedia.org [12:45:12] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:46:27] (03CR) 10Jforrester: [C:03+1] undeploy Extension:Capiunto (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206851 (https://phabricator.wikimedia.org/T410172) (owner: 10Novem Linguae) [12:49:06] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:49:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:49:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:49:12] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors [12:49:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors [12:49:47] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:51:32] (03PS1) 10Filippo Giunchedi: install_server: workaround for mpt3sas large optimal_io_size [puppet] - 10https://gerrit.wikimedia.org/r/1207150 (https://phabricator.wikimedia.org/T407586) [12:52:24] (03CR) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler) [12:53:21] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:53:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:53:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:53:28] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors [12:53:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors [12:53:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7002.wikimedia.org [12:53:52] (03CR) 10Daniel Kinzler: rest-gateway: implement per-route rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler) [12:54:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7002.wikimedia.org [12:54:32] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:55:10] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428 (owner: 10Majavah) [12:55:17] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428 (owner: 10Majavah) [12:58:42] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:58:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:58:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:58:47] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors [12:58:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors [12:59:24] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [12:59:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002" [13:00:44] (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [13:03:23] jmm@cumin2002 makevm (PID 370070) is awaiting input [13:04:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7002.wikimedia.org with OS trixie [13:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:14:50] !log installing systemd bugfix updates on trixie hosts [13:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:17] filippo@cumin1003 reimage (PID 2904929) is awaiting input [13:18:07] 10SRE-Access-Requests: New SSH key - https://phabricator.wikimedia.org/T410506 (10jijiki) 03NEW [13:19:32] (03PS1) 10Effie Mouzeli: admin: add new keys for effie [puppet] - 10https://gerrit.wikimedia.org/r/1207153 (https://phabricator.wikimedia.org/T410506) [13:29:53] (03PS2) 10Anzx: tcywikisource: Temporary increase of AccountCreationThrottle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) [13:30:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [13:30:41] (03CR) 10Hoo man: [C:04-1] Replace 'let' with arithmetic expansion (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [13:31:06] (03CR) 10Hoo man: [C:04-1] Replace 'let' with arithmetic expansion (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [13:33:12] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe-eqiad [13:33:24] !log marostegui@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:33:41] (03CR) 10Hoo man: Clean up existing symlink before creating a new one (032 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon) [13:33:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:33:51] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy7002.wikimedia.org with reason: host reimage [13:33:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T299441)', diff saved to https://phabricator.wikimedia.org/P85388 and previous config saved to /var/cache/conftool/dbconfig/20251119-133358-marostegui.json [13:34:03] T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441 [13:36:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:37:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:56] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance [13:39:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy7002.wikimedia.org with reason: host reimage [13:43:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - repool after schema change test [13:50:19] 06SRE, 06cloud-services-team, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11388128 (10fgiunchedi) Reported to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121006 [13:50:51] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists noboardwiki; drop database if exists ru_sibwiki; drop database if exists sep11wiki; drop database if exists strategyappswiki; (T297297) [13:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:56] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [13:52:30] 06SRE, 06cloud-services-team, 13Patch-For-Review, 07Upstream: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11388143 (10taavi) [13:56:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy7002.wikimedia.org with OS trixie [13:56:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy7002.wikimedia.org [13:59:50] !log installing monitoring-plugins bugfix updates on trixie hosts [13:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1400). [14:00:05] edsanders, Sergi0, tgr, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:06] o/ [14:00:16] o/ [14:00:17] I can self deploy my config change [14:00:40] o/ [14:01:15] (03CR) 10Ssingh: [C:03+1] "Yes, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1207141 (owner: 10Slyngshede) [14:01:39] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388181 (10MoritzMuehlenhoff) [14:02:29] I'll begin [14:02:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206880 (https://phabricator.wikimedia.org/T402532) (owner: 10Esanders) [14:02:51] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11388191 (10Jclark-ctr) 05Open→03Resolved Closing this ticket since it’s a configuration problem being addressed in T410400 [14:03:28] (03Merged) 10jenkins-bot: Freeze LiquidThreads on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206880 (https://phabricator.wikimedia.org/T402532) (owner: 10Esanders) [14:04:00] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1206880|Freeze LiquidThreads on ptwikibooks (T402532)]] [14:04:02] I'll stay last, will need to do a lot of testing [14:04:04] T402532: ptwikibooks: LQT set to readonly and removed as default - https://phabricator.wikimedia.org/T402532 [14:04:08] (03CR) 10Ssingh: [C:03+2] P:cache::base disable geoip in cloud environment [puppet] - 10https://gerrit.wikimedia.org/r/1207141 (owner: 10Slyngshede) [14:04:54] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11388207 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Closing ticket — cabling subtask has been completed and server migration is in process [14:05:06] o/ [14:05:57] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388212 (10MoritzMuehlenhoff) @ssingh The hcaptcha-proxy VMs in magru are up and running [14:07:01] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388216 (10ssingh) Oh wow, thanks @MoritzMuehlenhoff! But what was the issue for my understanding? [14:08:42] !log esanders@deploy2002 esanders: Backport for [[gerrit:1206880|Freeze LiquidThreads on ptwikibooks (T402532)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:08:50] (03CR) 10Ssingh: [C:03+2] hiera: lvs/interfaces: remove public1-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1206424 (https://phabricator.wikimedia.org/T410047) (owner: 10Ssingh) [14:10:37] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388223 (10MoritzMuehlenhoff) >>! In T409860#11388216, @ssingh wrote: > Oh wow, thanks @MoritzMu... [14:10:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:12:02] (03PS2) 10AOkoth: vrts: alert on vrts junk queue size [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) [14:12:05] !log esanders@deploy2002 esanders: Continuing with sync [14:12:59] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists tlhwiki; drop database if exists tlhwiktionary; drop database if exists ukwikimedia; drop database if exists zerowiki; drop database if exists zh_cnwiki; drop database if exists zh_twwiki; (T297297) [14:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:03] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [14:14:44] (03PS1) 10Arnaudb: admin: add FIDO key for arnaudb [puppet] - 10https://gerrit.wikimedia.org/r/1207159 [14:15:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [14:16:00] (03CR) 10Lucas Werkmeister (WMDE): "Is this really the right way to change the throttle? I can’t find any similar modifications in Git since the config took on its current fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [14:16:13] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206880|Freeze LiquidThreads on ptwikibooks (T402532)]] (duration: 12m 13s) [14:16:18] T402532: ptwikibooks: LQT set to readonly and removed as default - https://phabricator.wikimedia.org/T402532 [14:16:39] I think I can do mine together [14:16:49] ack [14:17:37] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388240 (10ssingh) >>! In T409860#11388223, @MoritzMuehlenhoff wrote: >>>! In T409860#11388216,... [14:17:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206936 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [14:17:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207118 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [14:19:05] (03CR) 10Anzx: "seems so if ip address is not known https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold#:~:text=If%20the%20IP%20is%2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [14:19:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1207159 (owner: 10Arnaudb) [14:20:11] (03CR) 10Arnaudb: [C:03+2] admin: add FIDO key for arnaudb [puppet] - 10https://gerrit.wikimedia.org/r/1207159 (owner: 10Arnaudb) [14:26:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:29:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1187 gradually with 4 steps - repool after schema change test [14:29:18] (03CR) 10Volans: [C:03+1] "Verified out of band with Effie" [puppet] - 10https://gerrit.wikimedia.org/r/1207153 (https://phabricator.wikimedia.org/T410506) (owner: 10Effie Mouzeli) [14:29:44] (03Merged) 10jenkins-bot: fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206936 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [14:30:18] (03Merged) 10jenkins-bot: fix(MigrateMentorStatusAway): ensure migration respects date format [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207118 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno) [14:30:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: New SSH key - https://phabricator.wikimedia.org/T410506#11388307 (10Volans) p:05Triage→03Medium [14:30:50] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1206936|fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users (T405177)]], [[gerrit:1207118|fix(MigrateMentorStatusAway): ensure migration respects date format (T409170)]] [14:30:56] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [14:30:57] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [14:31:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11388318 (10cmooney) 05Resolved→03Open a:05cmooney→03None Hi. Seems I made an error here as not all the work is complete on site. We still ne... [14:32:46] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Fair enough, let’s try it then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [14:33:57] (03PS3) 10Ssingh: O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) [14:34:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [14:34:23] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11388333 (10cmooney) 05Open→03Resolved >>! In T410047#11374122, @cmooney wrote: > Actually I discussed with @Papaul in relation to... [14:34:25] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D DC Ops host migrations - https://phabricator.wikimedia.org/T405021#11388335 (10Jclark-ctr) 05Open→03Resolved All dcops servers have been relocated to new switches [14:35:22] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: New SSH keys for effie - https://phabricator.wikimedia.org/T410506#11388338 (10A_smart_kitten) [14:35:28] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1206936|fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users (T405177)]], [[gerrit:1207118|fix(MigrateMentorStatusAway): ensure migration respects date format (T409170)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:35:59] !log sgimeno@deploy2002 sgimeno: Continuing with sync [14:36:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:36:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11388344 (10Jclark-ctr) 05Open→03Resolved a:05RobH→03Jclark-ctr All Servers for Traffic have been migrated to new nokia switches [14:37:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:00] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206936|fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users (T405177)]], [[gerrit:1207118|fix(MigrateMentorStatusAway): ensure migration respects date format (T409170)]] (duration: 09m 09s) [14:40:05] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [14:40:06] T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170 [14:40:30] all yours @anzx [14:40:51] Or @Lucas_WMDE ? [14:41:07] yaeh, I can deploy this one :) [14:41:09] *yeah [14:41:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [14:42:29] (03Merged) 10jenkins-bot: tcywikisource: Temporary increase of AccountCreationThrottle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [14:43:02] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1207154|tcywikisource: Temporary increase of AccountCreationThrottle (T410507)]] [14:43:06] T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507 [14:43:46] (03PS1) 10Ssingh: site.pp: reimage hcaptcha-proxy1001 to proper role [puppet] - 10https://gerrit.wikimedia.org/r/1207165 (https://phabricator.wikimedia.org/T409780) [14:44:11] (03PS1) 10Bking: opensearch-cluster: give 'opensearch' user access to bulk API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207166 (https://phabricator.wikimedia.org/T408012) [14:45:03] Lucas_WMDE: no need test, good to sync [14:45:15] makes sense [14:45:44] (03PS1) 10Ladsgroup: rdbms: Dismantle concept of groups [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207167 (https://phabricator.wikimedia.org/T405087) [14:47:09] (03PS1) 10Anzx: Revert "tcywikisource: Temporary increase of AccountCreationThrottle " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207169 (https://phabricator.wikimedia.org/T410507) [14:47:42] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [14:47:56] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [14:47:57] !log lucaswerkmeister-wmde@deploy2002 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1207154|tcywikisource: Temporary increase of AccountCreationThrottle (T410507)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:48:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe-eqiad [14:48:32] !log lucaswerkmeister-wmde@deploy2002 anzx, lucaswerkmeister-wmde: Continuing with sync [14:49:33] (03CR) 10Lucas Werkmeister (WMDE): "Thanks :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207169 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [14:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:51:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:52:01] (03CR) 10Andrew Bogott: [C:03+1] "bonkers" [puppet] - 10https://gerrit.wikimedia.org/r/1207150 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [14:52:34] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207154|tcywikisource: Temporary increase of AccountCreationThrottle (T410507)]] (duration: 09m 32s) [14:52:38] Lucas_WMDE: thanks for deploying [14:52:39] T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507 [14:52:42] np [14:52:49] tgr_: over to you [14:53:01] thx [14:54:02] (03PS2) 10Gergő Tisza: Use prefixed 'sub' field in OAuth 2 access tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) [14:54:42] oh, right, I wanted to try running resetAuthenticationThrottle too [14:54:46] (shouldn’t interfere, hopefully) [14:55:32] !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: resetAuthenticationThrottle tcywikisource --signup # T410507 [14:55:52] !log (T410507 maintenance script failed, --ip is required and we don’t have it. oh well) [14:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:18] Lucas_WMDE: i thought without IP address it was not required, thanks [14:56:36] you can add the wiki to throttle.php instead [14:57:15] anzx: yeah I suspected it would fail but wanted to try it anyway [14:57:21] tgr_: we don’t have an IP range for the event apparently :/ [14:57:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [14:57:57] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388433 (10MoritzMuehlenhoff) [14:57:58] (03CR) 10Bking: [C:03+2] dse-k8s-codfw: set minimum resources for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206969 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking) [14:58:15] (03Merged) 10jenkins-bot: Use prefixed 'sub' field in OAuth 2 access tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza) [14:58:28] IP is optional for that [14:58:39] but then, maybe unwise to allow all IPs [14:58:45] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1202768|Use prefixed 'sub' field in OAuth 2 access tokens (T399199)]] [14:58:50] T399199: Update OAuth 2.0 sessions to include new JWT session data from core - https://phabricator.wikimedia.org/T399199 [14:58:59] hm, maybe https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold needs an update then? that’s what pointed to wgAccountCreationThrottle [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1500) [15:00:39] you can just omit IP/range and then the higher limit will apply to all IPs [15:00:59] please let me know once you're done. I have a backport [15:01:01] $wgAccountCreationThrottle would work too, but then you can't limit it by date/wiki [15:03:37] !log tgr@deploy2002 tgr: Backport for [[gerrit:1202768|Use prefixed 'sub' field in OAuth 2 access tokens (T399199)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:03:52] (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207166 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [15:04:17] oh, you did already increase $wgAccountCreationThrottle. You don't really need the maintenance script then. [15:04:39] (03PS1) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) [15:04:51] tgr_: does https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 look better? [15:05:12] yeah, I ran the maintenance script because wikitech said to (and I figured it wouldn’t hurt even if it errored out) [15:05:26] (03CR) 10CI reject: [V:04-1] tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE)) [15:05:38] (but test your change first :)) [15:07:34] (03PS2) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) [15:07:34] yeah looks good [15:07:50] not running the script means the effective limit will be 69 not 75 [15:07:56] which isn't a big deal [15:08:24] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:30] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [15:09:21] ack [15:09:28] (though I had to fix one test that failed on the missing IP/range ^^) [15:09:57] wtf, I wrote that test? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073487 [15:10:14] (03PS1) 10Awight: Monitoring for WMDE dumps scraper [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) [15:10:38] so I guess we haven’t had throttling exceptions without IPs/ranges since at least September 2024 [15:10:50] hopefully they still work. anzx: does https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 look okay to you? [15:11:25] !log tgr@deploy2002 tgr: Continuing with sync [15:12:11] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: push changes - cmooney@cumin1003" [15:12:17] the code is in throttle-analyze.php, looks pretty straightforward [15:13:02] (03CR) 10Anzx: "just to be safe extend endtime by 1 hour" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE)) [15:13:31] Lucas_WMDE: looks ok, i have suggested to increase time by 1 hour just to be safe [15:13:37] anzx: sure [15:13:42] tgr_: true, fair enough [15:13:43] thanks! [15:13:45] (03PS2) 10Majavah: Remove absented HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1207144 [15:13:49] then I’ll try to get that deployed later [15:13:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: push changes - cmooney@cumin1003" [15:13:50] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:14:24] (03CR) 10Daphne Smit: [C:03+2] wikifunctions: Bump the orchestrator timeout down a skosh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205263 (https://phabricator.wikimedia.org/T407503) (owner: 10Cory Massaro) [15:15:29] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202768|Use prefixed 'sub' field in OAuth 2 access tokens (T399199)]] (duration: 16m 43s) [15:15:34] T399199: Update OAuth 2.0 sessions to include new JWT session data from core - https://phabricator.wikimedia.org/T399199 [15:15:43] let's see if we break any OAuth clients this time [15:15:56] Amir1: you are good to go [15:16:03] (03Merged) 10jenkins-bot: wikifunctions: Bump the orchestrator timeout down a skosh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205263 (https://phabricator.wikimedia.org/T407503) (owner: 10Cory Massaro) [15:16:05] please ping Lucas_WMDE when done [15:16:18] (03PS3) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) [15:16:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [15:16:27] (03CR) 10Ladsgroup: [C:03+2] rdbms: Dismantle concept of groups [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207167 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup) [15:16:28] (03CR) 10Majavah: [C:03+2] Remove absented HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1207144 (owner: 10Majavah) [15:16:31] awesome [15:16:51] fingers crossed for OAuth [15:16:53] my patch is going to take a while to merge, so if it's mw-config, Lucas_WMDE you can go head [15:16:54] brouberol: Can you ping when you're done? deployment-charts git is dirty so we can't use our window. [15:17:01] (03PS3) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) [15:17:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [15:17:32] (03CR) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE)) [15:17:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [15:17:45] Amir1: ok [15:17:49] (03CR) 10CI reject: [V:04-1] tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE)) [15:17:55] bah, what now [15:18:11] “Comments should start on new line.” blhhhhhh [15:18:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [15:18:25] even the wikitech example has end-of-line comments 😡 https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [15:18:42] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [15:19:06] (03PS4) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) [15:19:21] Amir1: want to CR+1 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 ? [15:19:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [15:19:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [15:20:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [15:20:26] (03CR) 10Ladsgroup: [C:03+1] tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE)) [15:20:33] (03Merged) 10jenkins-bot: rdbms: Dismantle concept of groups [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207167 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup) [15:20:35] ^ [15:20:38] ok, you go first [15:20:43] oh mine got merged [15:20:43] (asps. very dangerous!) [15:20:44] interesting [15:22:05] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207167|rdbms: Dismantle concept of groups (T405087)]] [15:22:13] T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087 [15:22:25] !log ladsgroup@deploy2002 sync-world failed: Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.IxZM23pYxK']' returned [15:22:25] non-zero exit status 255. (scap version: 4.227.0) (duration: 00m 20s) [15:22:47] (03Abandoned) 10Anzx: Revert "tcywikisource: Temporary increase of AccountCreationThrottle " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207169 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx) [15:23:03] Oh dear. [15:23:30] https://www.irccloud.com/pastebin/hU21gQov/ [15:23:44] (03PS1) 10Ssingh: hiera: lvs/interfaces: remove VLAN sub-ints for edges [puppet] - 10https://gerrit.wikimedia.org/r/1207180 (https://phabricator.wikimedia.org/T409860) [15:23:46] (03PS1) 10TrainBranchBot: Revert "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207181 [15:23:46] (03CR) 10TrainBranchBot: "ladsgroup@deploy2002 created a revert of this change as Ie333b077a04b6846c711f1a97baef4b42b46ae0f" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207167 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup) [15:24:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207181 (owner: 10TrainBranchBot) [15:24:52] brouberol: Ping again. We'd really like to use our deployment window if possible. [15:25:23] (03PS2) 10Ssingh: hiera: lvs/interfaces: remove VLAN sub-ints for edges [puppet] - 10https://gerrit.wikimedia.org/r/1207180 (https://phabricator.wikimedia.org/T410411) [15:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1500) [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1530) [15:30:07] !log rebooting sretest2004 to check LLDP settings [15:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:07] James_F: sorry, I missed the first ping. It's fixed [15:31:12] Thanks! [15:31:17] !log installing wtmpdb bugfix updates on trixie hosts [15:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:20] np, and apologies [15:31:28] !log daphnesmit@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:31:58] !log daphnesmit@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:32:10] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:28] !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site drmrs [reason: no reason specified, T390813] [15:32:33] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site drmrs [reason: no reason specified, T390813] [15:32:34] T390813: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813 [15:33:11] !log daphnesmit@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:33:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:35] !log installing console-setup bugfix updates on trixie hosts [15:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:38] !log daphnesmit@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:33:51] !log daphnesmit@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:34:14] (03CR) 10Kosta Harlan: [C:03+1] opensearch-cluster: give 'opensearch' user access to bulk API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207166 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [15:34:27] !log daphnesmit@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:35:26] (03CR) 10Effie Mouzeli: [C:03+2] admin: add new keys for effie [puppet] - 10https://gerrit.wikimedia.org/r/1207153 (https://phabricator.wikimedia.org/T410506) (owner: 10Effie Mouzeli) [15:36:18] (03CR) 10Daphne Smit: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-11-08-223341 to 2025-11-18-175356 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206981 (https://phabricator.wikimedia.org/T305612) (owner: 10Jforrester) [15:38:34] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-11-08-223341 to 2025-11-18-175356 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206981 (https://phabricator.wikimedia.org/T305612) (owner: 10Jforrester) [15:38:56] (03Merged) 10jenkins-bot: Revert "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207181 (owner: 10TrainBranchBot) [15:39:23] !log daphnesmit@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:39:27] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207181|Revert "rdbms: Dismantle concept of groups"]] [15:39:47] !log daphnesmit@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:40:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: New SSH keys for effie - https://phabricator.wikimedia.org/T410506#11388761 (10jijiki) 05Open→03Resolved a:03jijiki [15:40:54] !log daphnesmit@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:40:54] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11388764 (10Papaul) @ayounsi Please see below the steps to disable LLDP in the BIOS for Dell servers. - once in the BIOS go to "Device Settings" -pick the first NIC if it is 1G or... [15:41:20] (03CR) 10Jsn.sherman: [C:03+1] "LGTM; thanks for your work on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle) [15:41:28] !log daphnesmit@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:41:36] !log daphnesmit@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:42:10] !log daphnesmit@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:42:22] RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 33.54 ms [15:43:45] !log ladsgroup@deploy2002 trainbranchbot, ladsgroup: Backport for [[gerrit:1207181|Revert "rdbms: Dismantle concept of groups"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:43:54] Window clear from our end. [15:44:36] I’m waiting for Amir1 to be done deploying [15:44:42] !log ladsgroup@deploy2002 trainbranchbot, ladsgroup: Continuing with sync [15:44:50] and then can hopefully deploy my config cleanup in the break between wf/xLab and mw infra [15:44:51] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388797 (10MoritzMuehlenhoff) [15:44:59] just got to test servers [15:45:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11388798 (10RobH) Update: * backup1006, backup1007, ms-backup1002 moved yesterday. * db1189 was moved yesterday by accident sorry about that! * The only d... [15:47:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11388802 (10Ladsgroup) Please ping me before moving of pc1014 so I depool pc4 cluster from rotation. [15:47:38] (03CR) 10Bking: [C:03+2] opensearch-cluster: give 'opensearch' user access to bulk API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207166 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [15:48:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11388807 (10jcrespo) >>! In T405942#11388798, @RobH wrote: > ** moss-be1002 - no directions provided on moving this, please advise @Robh, not mine, but pl... [15:48:42] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207181|Revert "rdbms: Dismantle concept of groups"]] (duration: 09m 14s) [15:50:11] (03CR) 10Bking: [C:03+2] opensearch on k8s: Add CODFW environment to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206973 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking) [15:51:57] (03Merged) 10jenkins-bot: opensearch on k8s: Add CODFW environment to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206973 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking) [15:52:35] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-b[12-13]-drmrs,cr[1-2]-drmrs,mr1-drmrs with reason: router upgrade [15:52:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:52:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:52:50] here it comes [15:53:15] Lucas_WMDE: I'm done with the deploy [15:53:20] !incidents [15:53:20] 7029 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-esams.wikimedia.org) [15:53:20] 7027 (RESOLVED) Host db2144 (paged) [15:53:20] 7024 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [15:53:21] 7025 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [15:53:21] 7023 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [15:53:21] 7017 (RESOLVED) Host db1221 (paged) [15:53:21] 7022 (RESOLVED) db1233 (paged)/MariaDB Replica Lag: s2 (paged) [15:53:21] 7021 (RESOLVED) db1259 (paged)/MariaDB Replica Lag: s2 (paged) [15:53:22] 7020 (RESOLVED) db1259 (paged)/MariaDB Replica IO: s2 (paged) [15:53:22] 7019 (RESOLVED) db1258 (paged)/MariaDB Replica IO: x3 (paged) [15:53:22] 7018 (RESOLVED) db1258 (paged)/MariaDB Replica Lag: x3 (paged) [15:53:23] 7016 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [15:53:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:53:26] o/ [15:53:37] jhathaway: so this is because we depooled drmrs and now esams is suffering [15:53:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11388818 (10RobH) >>! In T405942#11388802, @Ladsgroup wrote: > Please ping me before moving of pc1014 so I depool pc4 cluster from rotation. Will do, it w... [15:53:43] topranks: I guess we weather this out for a bit? or what? [15:53:54] thanks sukhe [15:54:20] sukhe: do you want me to wait ? [15:54:29] sukhe: we should look for scrapers of originals in esams [15:54:32] (03PS7) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) [15:54:43] (03CR) 10JHathaway: UEFI: dup partition on MD RAID boxes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [15:54:47] moving to private [15:57:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:59:06] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388835 (10MoritzMuehlenhoff) [15:59:19] (03CR) 10Pmiazga: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler) [15:59:22] !log installing brltty bugfix updates on trixie hosts [15:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:15] (03PS1) 10Bking: opensearch-cluster: Add cluster ro permissions to 'opensearch' user [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012) [16:01:51] (03PS1) 10DCausse: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 [16:02:21] (03CR) 10CI reject: [V:04-1] apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (owner: 10DCausse) [16:02:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:03:39] Amir1: thanks (sorry I missed the ping) [16:03:50] !log installing libvirt bugfix updates on trixie hosts [16:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:14] jouncebot: nowandnext [16:04:14] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [16:04:14] In 1 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1800) [16:04:22] though it sounds like it might not be a good idea to deploy right now [16:05:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [16:06:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [16:06:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:06:57] !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: router upgrade [16:07:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:07:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:08:13] sukhe: drmrs still depooled? [16:08:18] vgutierrez: yeah [16:08:22] see -sec [16:08:24] (03CR) 10Jasmine: [C:03+2] Cleanup maintenance_hosts hiera variable use [puppet] - 10https://gerrit.wikimedia.org/r/1206877 (https://phabricator.wikimedia.org/T400442) (owner: 10Alexandros Kosiaris) [16:08:32] (03CR) 10Jasmine: [C:03+2] Empty maintenance_hosts array [puppet] - 10https://gerrit.wikimedia.org/r/1206876 (https://phabricator.wikimedia.org/T400442) (owner: 10Alexandros Kosiaris) [16:08:53] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388867 (10MoritzMuehlenhoff) [16:11:25] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388895 (10MoritzMuehlenhoff) [16:14:04] (03PS2) 10DCausse: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 [16:15:56] PROBLEM - Host doh6001 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:56] PROBLEM - Host durum6001 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:14] PROBLEM - Host tcp-proxy6001 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:18] PROBLEM - Host install6003 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:30] PROBLEM - Host bast6003 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:34] yeah expected ^ [16:16:36] that is me [16:16:41] (03CR) 10Bking: [C:03+1] "post-merge +1" [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853) (owner: 10Gehel) [16:16:42] PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:45] reboting asw1-v12 [16:16:53] b12 [16:16:57] FIRING: [5x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:17:46] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:18:24] FIRING: [4x] ProbeDown: Service ganeti6001:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:30] FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs6002:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:19:41] yeah well we should really silence all drmrs at this point [16:19:46] jhathaway: Raine: ^ [16:19:55] (03PS3) 10Bking: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse) [16:20:01] ok [16:20:10] sgtm sukhe [16:20:23] do we have tooling to do that? [16:21:29] jhathaway: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org something here [16:21:43] nod [16:21:57] FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:22:57] jhathaway: Raine: sorry, Traffic should have really silenced this [16:22:59] I can take that on [16:23:24] FIRING: [24x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:37] happy to as well, but so far my alert manger foo is failing me sukhe [16:24:02] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.01e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [16:24:30] FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs6002:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled [16:24:34] done [16:24:40] silenced all drmrs [16:24:40] thanks sukhe [16:24:46] thanks <3 [16:26:13] (03CR) 10Filippo Giunchedi: "Thank you for reaching out !" [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight) [16:26:26] no worries, this is my bad. we should have silenced it. [16:27:10] !log bking@deploy2002 helmfile [default] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:27:11] !log bking@deploy2002 helmfile [default] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:27:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [16:27:30] RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 87.71 ms [16:27:32] RECOVERY - Host durum6001 is UP: PING OK - Packet loss = 0%, RTA = 88.79 ms [16:27:34] RECOVERY - Host doh6001 is UP: PING OK - Packet loss = 0%, RTA = 87.56 ms [16:27:35] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [16:27:42] RECOVERY - Host tcp-proxy6001 is UP: PING OK - Packet loss = 0%, RTA = 87.48 ms [16:27:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [16:27:46] RECOVERY - Host install6003 is UP: PING OK - Packet loss = 0%, RTA = 87.64 ms [16:27:58] RECOVERY - Host bast6003 is UP: PING OK - Packet loss = 0%, RTA = 87.48 ms [16:28:15] (03CR) 10Filippo Giunchedi: [C:03+2] install_server: workaround for mpt3sas large optimal_io_size [puppet] - 10https://gerrit.wikimedia.org/r/1207150 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [16:28:35] (03PS1) 10Muehlenhoff: Record LDAP access for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1207200 [16:28:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [16:29:16] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [16:29:40] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1207200 (owner: 10Muehlenhoff) [16:29:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [16:30:25] FIRING: SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:31:06] godog, jasmine_: okay to puppet-merge your changes along? [16:31:20] (03PS2) 10Alexandros Kosiaris: Cleanup maintenance_hosts hiera variable use [puppet] - 10https://gerrit.wikimedia.org/r/1206877 (https://phabricator.wikimedia.org/T400442) [16:31:32] (03CR) 10Jasmine: [C:03+2] Cleanup maintenance_hosts hiera variable use [puppet] - 10https://gerrit.wikimedia.org/r/1206877 (https://phabricator.wikimedia.org/T400442) (owner: 10Alexandros Kosiaris) [16:31:46] moritzm: yes please [16:31:55] ok, merging [16:33:22] and done [16:33:41] thank you [16:34:04] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS bookworm [16:34:14] moritzm: ty! [16:35:22] !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:35:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:36:51] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:36:52] PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100% [16:36:52] PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100% [16:36:58] PROBLEM - Host doh6002 is DOWN: PING CRITICAL - Packet loss = 100% [16:36:58] PROBLEM - Host prometheus6002 is DOWN: PING CRITICAL - Packet loss = 100% [16:36:58] PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:00] well I did silence it sigh [16:37:12] this is probably icinga then hmm [16:37:14] PROBLEM - Host tcp-proxy6002 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:45] does anyone recall the silencing in Icinga? [16:38:15] sukhe: cookbook? [16:38:34] yeah A:drmrs on hosts.downtime [16:38:36] running [16:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:39:14] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 39 hosts with reason: site depool [16:45:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:55] FIRING: [2x] SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [16:46:48] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11388988 (10Papaul) [16:47:46] I guess I’ll deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 tomorrow instead, doesn’t sounds like it’s okay to deploy at the moment and I’m about to sign off [16:48:13] RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 87.45 ms [16:48:21] RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 87.41 ms [16:48:29] RECOVERY - Host doh6002 is UP: PING OK - Packet loss = 0%, RTA = 87.36 ms [16:48:29] RECOVERY - Host prometheus6002 is UP: PING OK - Packet loss = 0%, RTA = 87.42 ms [16:48:29] RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 87.45 ms [16:48:43] RECOVERY - Host tcp-proxy6002 is UP: PING OK - Packet loss = 0%, RTA = 87.56 ms [16:51:31] (03PS1) 10DLynch: TextMatchEditCheck: undo duplicate sub-type logging [extensions/VisualEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207201 (https://phabricator.wikimedia.org/T407286) [16:51:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/VisualEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207201 (https://phabricator.wikimedia.org/T407286) (owner: 10DLynch) [16:52:54] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage [16:53:03] 06SRE, 06Infrastructure-Foundations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841#11389010 (10cmooney) [16:53:50] (03PS1) 10Andrew Bogott: cloudcontrol2010-dev: remove pause-reboot [puppet] - 10https://gerrit.wikimedia.org/r/1207202 (https://phabricator.wikimedia.org/T409328) [16:56:04] (03CR) 10Andrew Bogott: [C:03+2] cloudcontrol2010-dev: remove pause-reboot [puppet] - 10https://gerrit.wikimedia.org/r/1207202 (https://phabricator.wikimedia.org/T409328) (owner: 10Andrew Bogott) [16:56:31] !log filippo@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [16:56:55] (03PS1) 10Filippo Giunchedi: install_server: restore cloudcontrol2010-dev unattended installation [puppet] - 10https://gerrit.wikimedia.org/r/1207203 (https://phabricator.wikimedia.org/T407586) [16:57:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage [16:58:05] (03Abandoned) 10Filippo Giunchedi: install_server: restore cloudcontrol2010-dev unattended installation [puppet] - 10https://gerrit.wikimedia.org/r/1207203 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [16:58:37] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:58:50] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:59:06] Raine: want to repool drmrs in case you haven't done it before? it's good practise :) [16:59:19] sukhe: sure :D one sec [16:59:33] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:59:38] https://wikitech.wikimedia.org/wiki/DNS#Change_GeoDNS_/_Depool_a_Site [16:59:43] sudo cookbook sre.dns.admin pool drmrs [16:59:47] follow prompt and that's it [17:00:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.remove-downtime for cr[1-2]-drmrs IPv6,cr[1-2]-drmrs.mgmt [17:00:05] oh, when I was young, it was a puppet patch :D [17:00:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr[1-2]-drmrs IPv6,cr[1-2]-drmrs.mgmt [17:00:13] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [17:00:52] !log kamila@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site drmrs [reason: no reason specified, ] [17:00:55] RESOLVED: [2x] SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:58] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [17:01:03] !log kamila@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site drmrs [reason: no reason specified, ] [17:01:58] sukhe: done [17:02:19] this is much better than a puppet patch \o/ [17:02:21] Raine: nice thanks! [17:02:27] RESOLVED: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:02:35] yeah, we are encouraging everyone to run this when not in an emergency and hence the ask [17:02:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:02:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [17:02:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:02:45] excellent [17:02:56] ^ less excellent [17:03:00] !log sukhe@cumin1003 START - Cookbook sre.hosts.remove-downtime for 39 hosts [17:03:22] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 39 hosts [17:03:24] RESOLVED: [27x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:03:24] RESOLVED: [13x] ProbeDown: Service ganeti6001:1811 has failed probes (tcp_ganeti_noded_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:03:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on pc1014.eqiad.wmnet with reason: C/D Migration [17:03:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11389033 (10Marostegui) >>! In T405942#11388802, @Ladsgroup wrote: > Please ping me before moving of pc1014 so I depool pc4 cluster from rotation. pc4 was... [17:04:28] Raine: Just do a roll restart of mobileapps [17:04:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:04:56] ok, let's look [17:05:10] !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [17:05:40] it's just one node, wonder why it says widespread [17:06:14] claime: re mobileapps, will do, but any idea why it happened? [17:06:29] mobileapps has been flapping for close to a week [17:06:33] https://phabricator.wikimedia.org/T410296 [17:06:51] oh, okay [17:07:01] thanks for the context hnowlan [17:07:02] sukhe: how many hosts are there in drmrs though? [17:07:18] Graph says 12.5% failed [17:07:34] claime: 39, only one was failing when I looked at least, but maybe there were more before? [17:07:38] ah ok, which graph is that? [17:07:50] Widespread puppet failure is >3% [17:07:57] sukhe: https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 [17:08:02] The one linked in the alert [17:08:13] ha ok, right, I thought there was someting in puppetboard too and I never knew [17:08:15] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Testing all optimize (T410401)', diff saved to https://phabricator.wikimedia.org/P85394 and previous config saved to /var/cache/conftool/dbconfig/20251119-170814-ladsgroup.json [17:08:19] T410401: Optimize all the things (=MySQL tables) - https://phabricator.wikimedia.org/T410401 [17:08:34] puppetboard was showing just one [17:08:36] it doesnt take that many new hosts to fail to make it "widespread" because the baseline is already close to the threshold [17:08:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:08:46] but I am guessing it was transient [17:08:55] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:08:55] yeah, 1 host is already 2.5% basically [17:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:11] so widespread means "1" :) [17:09:13] anyway should clear up now [17:09:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:09:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:10:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:10:12] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:10:19] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:12:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [17:12:29] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [17:13:21] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync [17:14:08] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on dbstore1007.eqiad.wmnet with reason: C/D Migration [17:14:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz) [17:14:44] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync [17:16:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc4 T405942', diff saved to https://phabricator.wikimedia.org/P85395 and previous config saved to /var/cache/conftool/dbconfig/20251119-171622-marostegui.json [17:16:28] T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 [17:16:44] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1008-dev.eqiad.wmnet with OS bookworm [17:17:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11389068 (10Marostegui) Repooled pc4 as Rob confirmed pc1014 has been moved. [17:17:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:17:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [17:17:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:21:19] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host releases1003.eqiad.wmnet with OS bookworm [17:22:20] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on moss-be1002.eqiad.wmnet with reason: C/D Migration [17:22:39] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11389081 (10Papaul) I think a am wrong on the public vlan for rack 22. We will not be re-imaging the servers in that rack with public vlan just changing the ne... [17:24:07] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11389084 (10Papaul) @ayounsi for the feed back i will work on it [17:30:54] (03CR) 10Andrea Denisse: [C:03+1] "The PCC results LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight) [17:32:58] !log wikikube c6 hosts depooling for migration [17:33:00] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1260-1269].eqiad.wmnet [17:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389152 (10RobH) Depooling wikikube in rack C6: sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad depool wikikube-worker1... [17:37:36] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11389156 (10Jhancock.wm) @Marostegui i rotated DIMM_A6 with DIMM_A10 to see if the error follows the stick. unfortunately, we do have to wait for it to happen again to diagnose it. Since the cpu error... [17:38:24] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on releases1003.eqiad.wmnet with reason: host reimage [17:38:46] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1260-1269].eqiad.wmnet [17:38:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389160 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1260-1269].eqiad.wmnet completed: - wikikub... [17:39:17] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet [17:42:17] (03PS1) 10Alexandros Kosiaris: relforge: Clarify comment about cumin masters role [puppet] - 10https://gerrit.wikimedia.org/r/1207212 [17:42:50] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet [17:42:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389174 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet... [17:43:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases1003.eqiad.wmnet with reason: host reimage [17:43:54] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1260.eqiad.wmnet with reason: C/D Migration [17:46:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11389179 (10Jclark-ctr) 05Open→03Resolved a:05BTullis→03Jclark-ctr no additional errors I will close ticket and figure out... [17:48:02] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1261.eqiad.wmnet with reason: C/D Migration [17:48:31] 10SRE-swift-storage, 06Commons, 10media-backups: File not found: /v1/AUTH_mw/wikipedia-commons-local-public ... for 3 files - https://phabricator.wikimedia.org/T400567#11389187 (10Bugreporter) >>! In T400567#11039161, @jcrespo wrote: >>>! In T400567#11038949, @GPSLeo wrote: >> As there are likely many more o... [17:50:46] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1262.eqiad.wmnet with reason: C/D Migration [17:53:14] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1263.eqiad.wmnet with reason: C/D Migration [17:53:31] (03CR) 10Aaron Schulz: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [17:53:32] (03CR) 10CDobbins: [C:03+2] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [17:55:25] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1264.eqiad.wmnet with reason: C/D Migration [17:56:57] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1265.eqiad.wmnet with reason: C/D Migration [17:58:26] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1266.eqiad.wmnet with reason: C/D Migration [17:59:52] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11389226 (10ayounsi) Thanks, looks like I missed it in my first look but it seems doable through Redfish on Dell : ` >>> dump.set('NIC.Integrated.1-2-1', 'Broadcom_LLDPNearestBridge... [18:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1800). [18:00:36] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1267.eqiad.wmnet with reason: C/D Migration [18:01:50] 10ops-eqiad, 06DC-Ops: eno8303 on db1219:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410536 (10phaultfinder) 03NEW [18:01:51] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1051.eqiad.wmnet with reason: C/D Migration [18:02:28] o/ [18:03:45] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [18:04:01] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1052.eqiad.wmnet with reason: C/D Migration [18:04:08] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537 (10RLazarus) 03NEW p:05Triage→03Medium [18:05:41] (03Merged) 10jenkins-bot: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [18:05:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1053.eqiad.wmnet with reason: C/D Migration [18:07:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1054.eqiad.wmnet with reason: C/D Migration [18:08:31] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11389289 (10RLazarus) (I'm not married to the specific CLI syntax in the example. Among other things, making it an --optional-flag means that the positional `host... [18:09:18] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1055.eqiad.wmnet with reason: C/D Migration [18:09:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host releases1003.eqiad.wmnet with OS bookworm [18:10:49] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1083.eqiad.wmnet with reason: C/D Migration [18:11:50] 10ops-eqiad, 06DC-Ops: eno8303 on db1220:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410539 (10phaultfinder) 03NEW [18:12:01] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003*} and A:liberica [18:12:36] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1268.eqiad.wmnet with reason: C/D Migration [18:12:54] PROBLEM - Host db1219 #page is DOWN: PING CRITICAL - Packet loss = 100% [18:13:57] db1219 is in C6 - robh: are you migrating that today? [18:14:40] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1269.eqiad.wmnet with reason: C/D Migration [18:15:05] RECOVERY - Host db1219 #page is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [18:15:31] PROBLEM - MariaDB Replica IO: s1 #page on db1219 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db1163.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db1163.eqiad.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:15:50] uh oh [18:16:26] possible inadvertent cable bump? [18:16:31] RECOVERY - MariaDB Replica IO: s1 #page on db1219 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:16:32] hope so :D [18:16:56] Or in the same rack? [18:17:02] yeah, it's on C6 [18:17:05] yeah [18:18:09] * swfrench-wmf is going to defer any deployments planned for this infra window [18:21:19] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1036.eqiad.wmnet with reason: C/D Migration [18:21:25] !log import prometheus-rdkafka-exporter 0.4~deb13u1 into trixie-wikimedia - T401832 [18:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:30] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [18:23:41] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1260-1269].eqiad.wmnet [18:23:44] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet [18:23:50] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1260-1269].eqiad.wmnet [18:23:51] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet [18:23:55] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [18:24:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389329 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1260-1269].eqiad.wmnet completed: - wikikube-... [18:24:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389330 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet co... [18:27:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [18:28:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389339 (10RobH) Ran: sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad pool wikikube-worker126[0-9].eqiad.wmnet sudo cookb... [18:32:35] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11389364 (10Marostegui) Thanks - I'll repool the host tomorrow! [18:35:28] (03CR) 10BCornwall: "thetimespedia.in is meant to be redirected to the diff post per legal." [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor) [18:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:37:42] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor) [18:38:04] (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor) [18:38:10] FIRING: BFDdown: BFD session down between cr2-eqiad and 208.80.154.209 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:38:39] 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1220:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410539#11389429 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr during maintenance of nokia refresh in C6 today this server went down to 100mbps Replaced faulty optic returned... [18:39:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389435 (10RobH) Going to depool wikikube in rack eqiad D1 for port migrations. sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube... [18:40:21] 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11389442 (10cmooney) 05Open→03Resolved a:03cmooney Ok this is now done across the whole estate, eqiad and... [18:40:55] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet [18:42:05] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11389451 (10Ladsgroup) [18:43:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 208.80.154.209 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:43:14] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet [18:43:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389453 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet completed:... [18:44:46] 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1219:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410536#11389469 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr during maintenance of nokia refresh in C6 today this server went down to 100mbps Speed did return to normal shor... [18:45:46] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1140.eqiad.wmnet with reason: C/D Migration [18:46:39] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1270-1275].eqiad.wmnet [18:48:26] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11389475 (10BCornwall) To add on, what about the maintenance of package.json and the dependencies that it pulls in? [18:49:25] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003*} and A:liberica [18:49:51] !log import purged 0.24+deb13u1 into trixie-wikimedia - T401832 [18:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:55] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [18:50:07] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1270-1275].eqiad.wmnet [18:50:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389493 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1270-1275].eqiad.wmnet completed: - wikikub... [18:50:56] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1204092 (owner: 10Ncmonitor) [18:51:13] !log brett@dns1006 START - running authdns-update [18:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:52:13] !log brett@dns1006 END - running authdns-update [18:52:22] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11389503 (10Ladsgroup) [18:57:44] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [18:58:12] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1141.eqiad.wmnet with reason: C/D Migration [19:00:05] brennen and andre: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1900). [19:00:38] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1270.eqiad.wmnet with reason: C/D Migration [19:01:06] o/ [19:03:00] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1271.eqiad.wmnet with reason: C/D Migration [19:03:32] andrew@cumin2002 reimage (PID 563205) is awaiting input [19:03:48] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [19:03:52] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11389563 (10Ladsgroup) [19:04:34] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1272.eqiad.wmnet with reason: C/D Migration [19:05:03] (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor) [19:07:17] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1273.eqiad.wmnet with reason: C/D Migration [19:10:32] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1274.eqiad.wmnet with reason: C/D Migration [19:11:51] (03PS2) 10Bking: opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012) [19:13:09] !log 1.46.0-wmf.3 train status (T408273): no current blockers, logs clean, rolling to group1 [19:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:14] T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273 [19:13:15] (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking) [19:13:58] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1275.eqiad.wmnet with reason: C/D Migration [19:14:57] (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207235 (https://phabricator.wikimedia.org/T408273) [19:14:59] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207235 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [19:15:46] (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207235 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [19:16:19] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1160.eqiad.wmnet with reason: C/D Migration [19:18:34] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1161.eqiad.wmnet with reason: C/D Migration [19:21:00] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet [19:21:06] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet [19:21:09] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1270-1275].eqiad.wmnet [19:21:16] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1270-1275].eqiad.wmnet [19:21:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389632 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet completed: -... [19:21:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389633 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1270-1275].eqiad.wmnet completed: - wikikube-... [19:22:14] (03CR) 10Superpes15: [C:03+1] tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE)) [19:23:53] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [19:23:59] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.3 refs T408273 [19:24:04] T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273 [19:27:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11389651 (10RobH) Day 7 Update: * 33 hosts moved today, 44 remain * all row c wikikube migrated, some of row D wikikube migrated ** 23 wikikube hosts remain o... [19:27:49] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [19:31:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389672 (10RobH) a:05RobH→03brouberol @brouberol, you were tagged into this task by T405950#11236474 but I don't have any feedback on the migration details for kafka-main1... [19:31:46] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204094 (owner: 10Ncmonitor) [19:31:57] (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor) [19:34:57] FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:37:35] (03PS1) 10Novem Linguae: README: remove outdated advice about dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 [19:39:49] (03CR) 10Novem Linguae: "In response to code review comments at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1206851/1/README#10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae) [19:39:57] RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:40:30] (03CR) 10Novem Linguae: undeploy Extension:Capiunto (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206851 (https://phabricator.wikimedia.org/T410172) (owner: 10Novem Linguae) [19:41:58] (03PS1) 10Dzahn: admin: remove bvibber from releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/1207243 [19:43:20] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9525 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [19:46:56] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1070.eqiad.wmnet with OS trixie [19:47:26] (03PS3) 10Bking: opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012) [19:49:26] (03PS1) 10Aude: Remove action_context from page_load events in ReadingList A/B test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207245 (https://phabricator.wikimedia.org/T410535) [19:49:50] (03PS1) 10Aude: Remove action_context from page_load events in ReadingList A/B test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207246 (https://phabricator.wikimedia.org/T410535) [19:49:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [19:50:08] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [19:50:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207246 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude) [19:50:29] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:50:36] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [19:50:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207245 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude) [19:52:10] !log denisse@deploy2002 Started deploy [librenms/librenms@d152b36]: Upgrade LibreNMS to 25.11.0 - T410519 [19:52:26] !log denisse@deploy2002 Finished deploy [librenms/librenms@d152b36]: Upgrade LibreNMS to 25.11.0 - T410519 (duration: 00m 16s) [20:01:58] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1070.eqiad.wmnet with reason: host reimage [20:07:10] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki Manueldinardo08 'Renamed user 7fd4cfd08628d295620b39574c59750f' # T410545 [20:07:14] T410545: Unblock stuck global rename of Renamed user 7fd4cfd08628d295620b39574c59750f - https://phabricator.wikimedia.org/T410545 [20:07:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [20:07:54] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1070.eqiad.wmnet with reason: host reimage [20:09:55] (03CR) 10Kamila Součková: [C:03+1] "LGTM other than inline nits/questions" [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [20:12:51] (03PS4) 10Ssingh: O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) [20:13:14] (03CR) 10Ssingh: O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [20:16:06] (03CR) 10Bvibber: [C:03+1] "Can confirm I do not need to be in this group at this time. :)" [puppet] - 10https://gerrit.wikimedia.org/r/1207243 (owner: 10Dzahn) [20:16:16] (03CR) 10Kamila Součková: [C:03+1] site.pp: reimage hcaptcha-proxy1001 to proper role [puppet] - 10https://gerrit.wikimedia.org/r/1207165 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [20:16:53] (03CR) 10Kamila Součková: [C:03+1] P:bird::anycast_monitoring: add hcaptcha-proxy.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1204074 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [20:17:42] (03CR) 10Dzahn: gerrit: add dry run rsync (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [20:18:09] (03CR) 10Dzahn: [C:03+2] admin: remove bvibber from releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/1207243 (owner: 10Dzahn) [20:19:24] (03PS1) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250 [20:19:57] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková) [20:33:37] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1070.eqiad.wmnet with OS trixie [20:38:57] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T2100) [21:00:05] kostajh, kemayo, AaronSchulz, and aude: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:21] hi [21:00:33] i'm here but can wait my turn [21:01:07] I'll start with mine, then [21:01:14] ok [21:01:22] Mine can be bundled in with anyone else's if you want. [21:02:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207108 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [21:02:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [21:02:47] mine can be bundled as well [21:04:01] (03Merged) 10jenkins-bot: hCaptcha: Record A/B test experiment group [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207108 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [21:04:06] mine can be too.  but idk how that works exactly [21:06:08] (03PS3) 10Majavah: Initial configuration for tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205954 (https://phabricator.wikimedia.org/T404457) [21:06:08] (03PS3) 10Majavah: Activate tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205955 (https://phabricator.wikimedia.org/T404457) [21:06:08] (03PS3) 10Majavah: Set up tokwiki namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457) [21:06:08] (03PS1) 10Majavah: Allow account creation on tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207262 (https://phabricator.wikimedia.org/T404457) [21:06:58] (03CR) 10Majavah: Set up tokwiki namespaces (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah) [21:09:06] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11389974 (10Catrope) 05Open→03Resolved a:03Volans Everything works great, thanks! [21:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:10:40] (03PS2) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250 [21:10:42] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková) [21:12:30] (03Merged) 10jenkins-bot: hCaptcha: Record A/B test experiment group [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [21:13:03] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1207108|hCaptcha: Record A/B test experiment group (T410354)]], [[gerrit:1206960|hCaptcha: Record A/B test experiment group (T410354)]] [21:13:08] T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354 [21:15:26] FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:16:59] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:48] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1207108|hCaptcha: Record A/B test experiment group (T410354)]], [[gerrit:1206960|hCaptcha: Record A/B test experiment group (T410354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:20:10] !log kharlan@deploy2002 kharlan: Continuing with sync [21:21:04] (03CR) 10Kamila Součková: "I'll remove these from labs and puppet-private hiera too." [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková) [21:21:44] (03PS3) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250 [21:22:08] (03PS4) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250 [21:24:19] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207108|hCaptcha: Record A/B test experiment group (T410354)]], [[gerrit:1206960|hCaptcha: Record A/B test experiment group (T410354)]] (duration: 11m 16s) [21:24:24] T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354 [21:24:32] Syncing a security patch, then my config patch [21:25:09] (03PS1) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [labs/private] - 10https://gerrit.wikimedia.org/r/1207265 [21:25:26] RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:26:59] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:28:40] (03CR) 10Kamila Součková: [C:03+1] O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh) [21:28:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:30:18] (03PS1) 10Aaron Schulz: rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) [21:31:05] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:31:36] syncing PrivateSettings.php now [21:31:59] RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:01] (03CR) 10Aaron Schulz: "Alternatively, I made https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1207267 to do more of this on the gateway level. Tha" [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [21:36:05] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:39:10] (03CR) 10Jforrester: [C:03+1] README: remove outdated advice about dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae) [21:40:52] (03CR) 10Novem Linguae: "Can we +2 this and have it ride the train? Or does it need a backport?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae) [21:40:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206830 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [21:41:05] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:41:44] (03Merged) 10jenkins-bot: hCaptcha: Enable A/B edit test on zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206830 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [21:41:56] (03CR) 10Jforrester: [C:03+1] "No, this is the production config repo, all merges must be immediately deployed. But it's not urgent to fix docs that should have been cor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae) [21:42:16] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1206830|hCaptcha: Enable A/B edit test on zhwiki and jawiki (T410354)]] [21:42:20] T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354 [21:46:05] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:46:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae) [21:47:01] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1206830|hCaptcha: Enable A/B edit test on zhwiki and jawiki (T410354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:47:56] I just added a README file change to the backport window if that's easy to squeeze in. If not don't worry about it. https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=2363265 [21:49:11] !log kharlan@deploy2002 kharlan: Continuing with sync [21:49:49] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T410563, transfer main graph to lagged host) xfer wikidata_main from wdqs1015.eqiad.wmnet -> wdqs1011.eqiad.wmnet, repooling both afterwards [21:49:53] T410563: ProbeDown - https://phabricator.wikimedia.org/T410563 [21:51:47] (03PS1) 10Scott French: mobileapps: revert to 2025-10-13-122439-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) [21:52:34] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11390152 (10ATitkov) > Who will be responsible for security review, when this is sharing important top level domains ? @TheDJ Could it be possibly handled or at l... [21:53:11] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206830|hCaptcha: Enable A/B edit test on zhwiki and jawiki (T410354)]] (duration: 10m 55s) [21:53:16] T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354 [21:53:27] I'm done [21:53:33] Kemayo aude over to you [21:53:37] sorry that took so long! [21:53:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207201 (https://phabricator.wikimedia.org/T407286) (owner: 10DLynch) [21:53:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207245 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude) [21:53:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207246 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude) [21:53:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae) [21:54:33] thanks Kemayo! [21:54:40] (03Merged) 10jenkins-bot: README: remove outdated advice about dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae) [21:54:53] kostajh: No worries. My only complaint is that there's not some vague "we can't tell you anything but here's a progress bar" for the waiting-for-a-security-patch part of it. :D [21:55:25] yeah, the process is far from ideal [21:56:12] Novem's patch lacks anything to test. aude, will you need to check anything on the testservers, or should I go ahead when it's ready? [21:56:37] i can quickly spot check on wmf.3 [21:56:54] * AaronSchulz still has https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1203191 [21:57:40] AaronSchulz: oops, sorry, I didn't realize you were here or I'd have offered to throw that in to this bundle as well. [21:58:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:59:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:59:06] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:59:22] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [21:59:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:00:04] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11390175 (10Volans) Just for context referencing past ideas on the topic: T327300 [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T2200) [22:00:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:00:38] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:00:59] (03CR) 10Scott French: "I'm happy to give this a try today or tomorrow, or please feel free to go ahead and merge / deploy at your convenience in the interim. Tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) (owner: 10Scott French) [22:01:58] (03PS4) 10Bking: opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012) [22:02:35] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:02:44] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply [22:03:44] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [22:03:51] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [22:05:13] (03Merged) 10jenkins-bot: TextMatchEditCheck: undo duplicate sub-type logging [extensions/VisualEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207201 (https://phabricator.wikimedia.org/T407286) (owner: 10DLynch) [22:05:14] (03Merged) 10jenkins-bot: Remove action_context from page_load events in ReadingList A/B test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207245 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude) [22:05:17] (03Merged) 10jenkins-bot: Remove action_context from page_load events in ReadingList A/B test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207246 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude) [22:05:55] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1207201|TextMatchEditCheck: undo duplicate sub-type logging (T407286)]], [[gerrit:1207245|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207246|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207241|README: remove outdated advice about dblists]] [22:06:01] T407286: Log sub-types of textmatch checks to VEFU - https://phabricator.wikimedia.org/T407286 [22:06:01] T410535: Remove action_context from ReadingLists AB test page_load event - https://phabricator.wikimedia.org/T410535 [22:07:55] (03PS1) 10Bvibber: Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165) [22:08:56] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [22:09:00] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [22:10:56] !log kemayo@deploy2002 aude, kemayo, novemlinguae: Backport for [[gerrit:1207201|TextMatchEditCheck: undo duplicate sub-type logging (T407286)]], [[gerrit:1207245|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207246|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207241|README: remove outdated advice about dblists]] synced to the tests [22:10:57] ervers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:11:02] T407286: Log sub-types of textmatch checks to VEFU - https://phabricator.wikimedia.org/T407286 [22:11:03] T410535: Remove action_context from ReadingLists AB test page_load event - https://phabricator.wikimedia.org/T410535 [22:11:04] checking [22:12:36] looks good [22:12:48] Excellent, continuing the sync. [22:12:51] !log kemayo@deploy2002 aude, kemayo, novemlinguae: Continuing with sync [22:12:53] thank you! [22:14:52] 06SRE, 10Phabricator: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent - https://phabricator.wikimedia.org/T410572 (10Aklapper) 03NEW p:05Triage→03Low [22:15:35] 06SRE, 10Phabricator: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent - https://phabricator.wikimedia.org/T410572#11390254 (10Aklapper) [22:16:13] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [22:16:13] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [22:16:17] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:16:52] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207201|TextMatchEditCheck: undo duplicate sub-type logging (T407286)]], [[gerrit:1207245|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207246|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207241|README: remove outdated advice about dblists]] (duration: 10m 57s) [22:16:58] T407286: Log sub-types of textmatch checks to VEFU - https://phabricator.wikimedia.org/T407286 [22:16:58] T410535: Remove action_context from ReadingLists AB test page_load event - https://phabricator.wikimedia.org/T410535 [22:22:21] (03PS1) 10Arlolra: Deploy Parsoid Read Views to 18 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) [22:22:47] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860 [22:22:50] Kemayo: done? [22:22:52] T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860 [22:25:29] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06serviceops, and 5 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573 (10bking) 03NEW [22:25:33] (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to 18 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra) [22:28:53] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.reboot [22:29:54] 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06serviceops, and 5 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11390286 (10bking) [22:32:00] AaronSchulz: yes, done. [22:32:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz) [22:32:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11390294 (10RobH) Migration Update: Only 3 #data-persistence hosts remain for migration: pc101[678]. Chatted with @marosgui earlier in IRC and he'll be o... [22:33:55] (03Merged) 10jenkins-bot: Sandbox cleanup for the Wikimedia REST APIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz) [22:34:26] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1203191|Sandbox cleanup for the Wikimedia REST APIs (T409776 T402426)]] [22:34:31] T409776: Rename & clean up Wikimedia RESTBase APIs - https://phabricator.wikimedia.org/T409776 [22:34:32] T402426: OpenAPI description for Wikimedia REST API links to the wrong on-wiki documentation - https://phabricator.wikimedia.org/T402426 [22:37:03] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [22:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:39:27] !log aaron@deploy2002 aaron: Backport for [[gerrit:1203191|Sandbox cleanup for the Wikimedia REST APIs (T409776 T402426)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:39:33] T409776: Rename & clean up Wikimedia RESTBase APIs - https://phabricator.wikimedia.org/T409776 [22:39:33] T402426: OpenAPI description for Wikimedia REST API links to the wrong on-wiki documentation - https://phabricator.wikimedia.org/T402426 [22:39:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:42:03] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [22:43:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and Hurricane Electric (2001:7f8:54:5::13) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:43:40] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T410563, transfer main graph to lagged host) xfer wikidata_main from wdqs1015.eqiad.wmnet -> wdqs1011.eqiad.wmnet, repooling both afterwards [22:43:44] T410563: ProbeDown - https://phabricator.wikimedia.org/T410563 [22:44:05] !log aaron@deploy2002 aaron: Continuing with sync [22:48:09] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203191|Sandbox cleanup for the Wikimedia REST APIs (T409776 T402426)]] (duration: 13m 43s) [22:48:15] T409776: Rename & clean up Wikimedia RESTBase APIs - https://phabricator.wikimedia.org/T409776 [22:48:15] T402426: OpenAPI description for Wikimedia REST API links to the wrong on-wiki documentation - https://phabricator.wikimedia.org/T402426 [22:49:10] * AaronSchulz is done [22:49:20] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [22:49:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11390359 (10RobH) @BTullis, We're now down to 44 hosts overall to migrate, and 12 of those belong to your team. Please... [22:51:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:55:04] (03PS1) 10Ryan Kemper: elastic: reboot should check uptime not jvm start time [cookbooks] - 10https://gerrit.wikimedia.org/r/1207280 (https://phabricator.wikimedia.org/T410577) [22:57:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11390399 (10RobH) p:05Triage→03High @herron, We've migrated 9 of the 10 #observability hosts. We're now only left with alert1002 which the notes detail will require s... [22:59:50] (03CR) 10Ryan Kemper: elastic: reboot should check uptime not jvm start time (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1207280 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [22:59:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T2300) [23:00:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11390409 (10RobH) >>! In T405950#11238805, @Scott_French wrote: > conf1009 is (1) a member of eqiad main-etcd cluster, so clients will attempt to issue writes to it, (2) the ups... [23:01:46] (03CR) 10CI reject: [V:04-1] elastic: reboot should check uptime not jvm start time [cookbooks] - 10https://gerrit.wikimedia.org/r/1207280 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [23:06:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11390432 (10RobH) Please note we didn't get to these two today, will do tomorrow! [23:08:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11390447 (10RobH) 05Open→03Resolved Please note all hosts listed on this task have been migrated. [23:23:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and Hurricane Electric (2001:7f8:54:5::13) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:48:16] (03PS1) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna [dns] - 10https://gerrit.wikimedia.org/r/1207288 [23:48:29] (03PS1) 10RLazarus: kubernetes: Set default Envoy version to 1.32.12 [puppet] - 10https://gerrit.wikimedia.org/r/1207289 (https://phabricator.wikimedia.org/T405808) [23:57:34] (03PS4) 10RLazarus: mesh.configuration: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510)