[00:13:57] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11386743 (10TheDJ) Who will be responsible for security review, when this is sharing important top level domains ?
[00:22:59] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[00:22:59] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[00:22:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[00:38:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473 (10Catrope) 03NEW
[00:38:56] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[00:40:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206987
[00:40:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206987 (owner: 10TrainBranchBot)
[00:48:11] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet']
[00:48:56] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet']
[00:55:12] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206987 (owner: 10TrainBranchBot)
[01:00:58] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:10:03] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206989
[01:10:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206989 (owner: 10TrainBranchBot)
[01:14:17] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 18s)
[01:18:21] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet']
[01:18:49] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet']
[01:23:29] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet']
[01:23:47] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1074.eqiad.wmnet']
[01:35:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206989 (owner: 10TrainBranchBot)
[01:35:55] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1074.eqiad.wmnet with OS trixie
[01:35:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:50:11] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1074.eqiad.wmnet with reason: host reimage
[01:53:19] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1074.eqiad.wmnet with reason: host reimage
[02:34:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[02:34:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[02:34:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[02:51:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[02:59:28] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1074.eqiad.wmnet with OS trixie
[03:04:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[03:04:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[03:04:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[03:06:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:29:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[04:29:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[04:29:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[04:38:56] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[04:44:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[04:44:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[04:44:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[05:08:24] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:21:21] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: deploy revertrisk-wikidata to the revision-models ns prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179)
[05:24:26] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:29:26] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:31:59] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:33:24] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:36:59] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:46:59] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:16:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:25:10] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms3 T405942', diff saved to https://phabricator.wikimedia.org/P85372 and previous config saved to /var/cache/conftool/dbconfig/20251119-062509-marostegui.json
[06:25:21] <stashbot>	 T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942
[06:25:28] <icinga-wm>	 PROBLEM - Host db2144 #page is DOWN: PING CRITICAL - Packet loss = 100%
[06:25:38] <marostegui>	 mmm what
[06:25:41] <marostegui>	 !incidents
[06:25:42] <sirenbot>	 7027 (UNACKED)  Host db2144 (paged)
[06:25:42] <sirenbot>	 7024 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[06:25:42] <sirenbot>	 7025 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[06:25:42] <sirenbot>	 7023 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[06:25:42] <sirenbot>	 7017 (RESOLVED)  Host db1221 (paged)
[06:25:43] <sirenbot>	 7022 (RESOLVED)  db1233 (paged)/MariaDB Replica Lag: s2 (paged)
[06:25:43] <sirenbot>	 7021 (RESOLVED)  db1259 (paged)/MariaDB Replica Lag: s2 (paged)
[06:25:43] <sirenbot>	 7020 (RESOLVED)  db1259 (paged)/MariaDB Replica IO: s2 (paged)
[06:25:43] <sirenbot>	 7019 (RESOLVED)  db1258 (paged)/MariaDB Replica IO: x3 (paged)
[06:25:44] <sirenbot>	 7018 (RESOLVED)  db1258 (paged)/MariaDB Replica Lag: x3 (paged)
[06:25:44] <sirenbot>	 7016 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[06:25:45] <sirenbot>	 7015 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[06:25:45] <sirenbot>	 7014 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr3-eqsin.wikimedia.org)
[06:25:46] <sirenbot>	 7013 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (cr1-codfw.wikimedia.org)
[06:25:49] <marostegui>	 !ack 7027
[06:26:14] <icinga-wm>	 PROBLEM - MariaDB Replica IO: ms2 on db1151 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db2144.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2144.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:26:14] <marostegui>	 !incidents
[06:26:15] <sirenbot>	 7027 (ACKED)  Host db2144 (paged)
[06:26:15] <sirenbot>	 7024 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[06:26:15] <sirenbot>	 7025 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[06:26:15] <sirenbot>	 7023 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[06:26:15] <sirenbot>	 7017 (RESOLVED)  Host db1221 (paged)
[06:26:16] <sirenbot>	 7022 (RESOLVED)  db1233 (paged)/MariaDB Replica Lag: s2 (paged)
[06:26:16] <sirenbot>	 7021 (RESOLVED)  db1259 (paged)/MariaDB Replica Lag: s2 (paged)
[06:26:16] <sirenbot>	 7020 (RESOLVED)  db1259 (paged)/MariaDB Replica IO: s2 (paged)
[06:26:16] <sirenbot>	 7019 (RESOLVED)  db1258 (paged)/MariaDB Replica IO: x3 (paged)
[06:26:17] <sirenbot>	 7018 (RESOLVED)  db1258 (paged)/MariaDB Replica Lag: x3 (paged)
[06:26:17] <sirenbot>	 7016 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[06:26:18] <sirenbot>	 7015 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[06:26:34] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool ms3 T405942', diff saved to https://phabricator.wikimedia.org/P85373 and previous config saved to /var/cache/conftool/dbconfig/20251119-062634-marostegui.json
[06:26:43] <marostegui>	 I will depool ms2
[06:27:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 2.885% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[06:27:29] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool ms2', diff saved to https://phabricator.wikimedia.org/P85374 and previous config saved to /var/cache/conftool/dbconfig/20251119-062728-marostegui.json
[06:28:17] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: db2144 went down
[06:32:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 17.31% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[06:33:20] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480 (10Marostegui) 03NEW
[06:33:31] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480#11386969 (10Marostegui) p:05Triage→03Medium
[06:34:13] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480#11386972 (10Marostegui) I rebooted the host via idrac
[06:34:37] <icinga-wm>	 RECOVERY - Host db2144 #page is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms
[06:34:43] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480#11386973 (10Marostegui) ms2 is depooled
[06:35:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc1 after network maint', diff saved to https://phabricator.wikimedia.org/P85375 and previous config saved to /var/cache/conftool/dbconfig/20251119-063522-marostegui.json
[06:36:14] <icinga-wm>	 RECOVERY - MariaDB Replica IO: ms2 on db1151 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:37:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[06:37:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[06:37:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[06:39:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11386976 (10WMDECyn) Chandra's position is fixed till maximum 31st Jan 2026
[06:39:03] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops: db2144 memory error - https://phabricator.wikimedia.org/T410480#11386977 (10Marostegui) ` 2025-11-19T06:23:24.670274+00:00 db2144 kernel: [8348456.319422] mce: Uncorrected hardware memory error in user-access at 2062ea3d80 2025-11-19T06:23:24.670289+00:00 db2144 kernel: [8348456...
[06:40:33] <wikibugs>	 (03PS1) 10Marostegui: ms2: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207051 (https://phabricator.wikimedia.org/T410480)
[06:40:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1223 with weight 0 T410283', diff saved to https://phabricator.wikimedia.org/P85376 and previous config saved to /var/cache/conftool/dbconfig/20251119-064055-marostegui.json
[06:41:00] <stashbot>	 T410283: Switchover s3 master (db1189 -> db1223) - https://phabricator.wikimedia.org/T410283
[06:41:16] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s3 T410283
[06:41:58] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1206406 (https://phabricator.wikimedia.org/T410283) (owner: 10Gerrit maintenance bot)
[06:47:36] <marostegui>	 !log Starting s3 eqiad failover from db1189 to db1223 - T410283
[06:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:40] <stashbot>	 T410283: Switchover s3 master (db1189 -> db1223) - https://phabricator.wikimedia.org/T410283
[06:47:55] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1223 to s3 primary T410283', diff saved to https://phabricator.wikimedia.org/P85377 and previous config saved to /var/cache/conftool/dbconfig/20251119-064755-marostegui.json
[06:48:38] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1189 T410283', diff saved to https://phabricator.wikimedia.org/P85378 and previous config saved to /var/cache/conftool/dbconfig/20251119-064838-marostegui.json
[06:48:57] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1189 gradually with 4 steps - Repooling after switchover
[06:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:52:09] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1189 gradually with 4 steps - Repooling after switchover
[06:52:20] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1189 gradually with 4 steps - Repooling after switchover
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T0700)
[07:04:28] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 13Patch-For-Review: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11387011 (10Marostegui)
[07:05:31] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet,pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: network maintenance
[07:06:56] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool pc4', diff saved to https://phabricator.wikimedia.org/P85380 and previous config saved to /var/cache/conftool/dbconfig/20251119-070656-marostegui.json
[07:07:20] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11387015 (10Marostegui) @Jclark-ctr  db1189 pc1014  Those can be moved anytime when you get to the DC
[07:16:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:21:33] <wikibugs>	 (03PS3) 10DCausse: cirrus: index field to sort on title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403)
[07:21:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) (owner: 10DCausse)
[07:37:46] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1189 gradually with 4 steps - Repooling after switchover
[07:41:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:55:27] <icinga-wm>	 PROBLEM - Thanos swift https on thanos-fe1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
[07:57:48] <wikibugs>	 (03PS10) 10Arnaudb: apt-staging: logging and metrics [puppet] - 10https://gerrit.wikimedia.org/r/1205162 (https://phabricator.wikimedia.org/T409833)
[07:57:48] <wikibugs>	 (03CR) 10Arnaudb: "this change brings a bit more readability on the log output, and adds metrics to allow us to create alerts and be notified when something " [puppet] - 10https://gerrit.wikimedia.org/r/1205162 (https://phabricator.wikimedia.org/T409833) (owner: 10Arnaudb)
[07:58:17] <icinga-wm>	 RECOVERY - Thanos swift https on thanos-fe1005 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Thanos
[07:58:44] <wikibugs>	 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11387076 (10fgiunchedi)
[07:59:54] <moritzm>	 !log started OSM import on maps-test2001 T409528
[07:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:58] <wikibugs>	 (03PS3) 10Arnaudb: apt-staging: error handling for reprepro [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832)
[07:59:58] <stashbot>	 T409528: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528
[07:59:58] <wikibugs>	 (03CR) 10Arnaudb: "this change brings a logic stem to plug onto if we want to add email notification in case of reprepro issues. It currently increments a me" [puppet] - 10https://gerrit.wikimedia.org/r/1206887 (https://phabricator.wikimedia.org/T409832) (owner: 10Arnaudb)
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T0800).
[08:00:05] <jouncebot>	 dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:13] <dcausse>	 o/
[08:00:16] <dcausse>	 I can deploy
[08:01:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:02:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) (owner: 10DCausse)
[08:03:01] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: index field to sort on title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403) (owner: 10DCausse)
[08:04:10] <logmsgbot>	 !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]]
[08:04:16] <stashbot>	 T40403: Sortable search results - https://phabricator.wikimedia.org/T40403
[08:06:26] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] opensearch on k8s: Add CODFW environment to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206973 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking)
[08:08:29] <wikibugs>	 (03PS2) 10Brouberol: dse-k8s-codfw: set minimum resources for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206969 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking)
[08:08:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1169.eqiad.wmnet']
[08:09:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1169.eqiad.wmnet']
[08:09:32] <logmsgbot>	 !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:09:37] <stashbot>	 T40403: Sortable search results - https://phabricator.wikimedia.org/T40403
[08:12:31] <logmsgbot>	 !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[08:12:36] <logmsgbot>	 !log dcausse@deploy2002 dcausse: Continuing with sync
[08:13:18] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[08:15:06] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11387129 (10Marostegui) @Jclark-ctr  db1189 pc1014  Those can be moved anytime when you get to the DC
[08:15:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11387131 (10Volans)
[08:17:52] <logmsgbot>	 !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]] (duration: 13m 42s)
[08:17:56] <stashbot>	 T40403: Sortable search results - https://phabricator.wikimedia.org/T40403
[08:21:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:22:43] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11387139 (10Marostegui) @Jclark-ctr I think we scheduled db1189 for today but it was done yesterday? The spreadsheet marks it as done and also I can see: `...
[08:22:46] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387140 (10Volans) p:05Triage→03Medium
[08:23:53] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966)
[08:24:27] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387144 (10Volans)
[08:24:31] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11387145 (10jcrespo) Based on the spreedsheet, no more interruptions are expected on   ` backup1006 backup1007 ms-backup1002 `  So I will restart eqiad med...
[08:25:19] <wikibugs>	 (03CR) 10MVernon: "@bcornwall@wikimedia.org sorry, I was away last week and missed this; the change message says it's not fixed in Debian and cites Debian bu" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1204941 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall)
[08:25:59] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] ms2: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1207051 (https://phabricator.wikimedia.org/T410480) (owner: 10Marostegui)
[08:27:41] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387152 (10Volans) Adding #data-engineering for visibility, no approval required for WMF staff. Pending approval from @SCherukuwada
[08:28:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11387153 (10ayounsi)
[08:29:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11387155 (10ayounsi) Thanks for the great writeup. We should unfortunately look at upgrading Netbox first. TBD if we need to spend time on a workaround.
[08:30:00] <logmsgbot>	 filippo@cumin1003 reimage (PID 2877688) is awaiting input
[08:30:48] <wikibugs>	 (03CR) 10Ayounsi: UEFI: dup partition on MD RAID boxes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[08:34:50] <wikibugs>	 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10vm-requests: Site: codfw   1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11387197 (10dcaro) p:05Triage→03Medium
[08:35:15] <logmsgbot>	 !log jynus@cumin1003 START - Cookbook sre.hosts.remove-downtime for backup[1006-1007].eqiad.wmnet,ms-backup[1001-1002].eqiad.wmnet
[08:35:17] <logmsgbot>	 !log jynus@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for backup[1006-1007].eqiad.wmnet,ms-backup[1001-1002].eqiad.wmnet
[08:38:56] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:39:59] <kostajh>	 dcausse: are you done deploying? 
[08:40:13] <dcausse>	 kostajh: yes
[08:40:31] <kostajh>	 ok, I will deploy some patches 
[08:41:00] <wikibugs>	 (03CR) 10Kosta Harlan: "recheck" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[08:42:03] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11387257 (10ayounsi) Lots great thanks !  Not sure how best to show it on the diagram, but we also need to remove the 10G link between cr3 and cr4. Maybe you can...
[08:42:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206906 (https://phabricator.wikimedia.org/T410024) (owner: 10Kosta Harlan)
[08:43:20] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: add availability sli recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966)
[08:45:11] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: add availability sli recording rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1202049 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[08:51:37] <wikibugs>	 (03PS1) 10Gehel: wdqs: Do not create task on failure of the WDQS LDF endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853)
[08:51:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:53:39] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Validate sitekey of /siteverify API call [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206906 (https://phabricator.wikimedia.org/T410024) (owner: 10Kosta Harlan)
[08:54:13] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1206906|hCaptcha: Validate sitekey of /siteverify API call (T410024)]]
[08:54:17] <stashbot>	 T410024: ConfirmEdit hCaptcha: Verify sitekey in `siteverify` response was the sitekey given to the client as part of validating the captcha - https://phabricator.wikimedia.org/T410024
[08:56:42] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Record A/B test experiment group [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207108 (https://phabricator.wikimedia.org/T410354)
[08:56:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:57:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387290 (10SCherukuwada) Manager approves.
[08:58:11] <logmsgbot>	 !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[08:58:47] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1206906|hCaptcha: Validate sitekey of /siteverify API call (T410024)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:59:55] <wikibugs>	 (03PS1) 10Itamar Givon: Replace 'let' with arithmetic expansion [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044)
[08:59:57] <wikibugs>	 (03PS1) 10Itamar Givon: Clean up existing symlink before creating a new one [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044)
[08:59:59] <wikibugs>	 (03PS1) 10Itamar Givon: Restore strict error handling [dumps] - 10https://gerrit.wikimedia.org/r/1207111 (https://phabricator.wikimedia.org/T406044)
[09:00:05] <jouncebot>	 brennen and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T0900).
[09:00:44] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[09:01:25] <kostajh>	 andre: still finishing up some backports, is it ok to continue for another 30 minutes? 
[09:01:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:04:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207108 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[09:04:45] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206906|hCaptcha: Validate sitekey of /siteverify API call (T410024)]] (duration: 10m 32s)
[09:04:49] <stashbot>	 T410024: ConfirmEdit hCaptcha: Verify sitekey in `siteverify` response was the sitekey given to the client as part of validating the captcha - https://phabricator.wikimedia.org/T410024
[09:05:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[09:05:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206830 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[09:06:02] <kostajh>	 actually, I'll leave the deployments I have for later 
[09:06:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:10:00] <wikibugs>	 (03PS5) 10Tiziano Fogli: metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625)
[09:10:01] <wikibugs>	 (03PS1) 10Tiziano Fogli: metamonitoring/icinga: trigger pages only for the active instance [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625)
[09:10:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[09:16:45] <wikibugs>	 (03PS3) 10Tiziano Fogli: metamonitoring/icinga: suppress script-managed notifications and pages [puppet] - 10https://gerrit.wikimedia.org/r/1206884 (https://phabricator.wikimedia.org/T393625)
[09:16:45] <wikibugs>	 (03PS4) 10Tiziano Fogli: metamonitoring/icinga: add smtp settings to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1206885 (https://phabricator.wikimedia.org/T393625)
[09:16:45] <wikibugs>	 (03PS6) 10Tiziano Fogli: metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625)
[09:16:46] <wikibugs>	 (03PS2) 10Tiziano Fogli: metamonitoring/icinga: trigger pages only for the active instance [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625)
[09:20:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy1001.wikimedia.org
[09:20:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org
[09:20:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:23:35] <wikibugs>	 (03PS1) 10Volans: admin: add catrope to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1207114 (https://phabricator.wikimedia.org/T410473)
[09:24:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387379 (10Volans)
[09:24:39] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387381 (10Volans)
[09:24:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy1001.wikimedia.org
[09:25:54] <wikibugs>	 (03PS1) 10David Caro: toolforge:prometheus: use / as the path url instead of /tools [puppet] - 10https://gerrit.wikimedia.org/r/1207115
[09:26:44] <logmsgbot>	 jmm@cumin2002 makevm (PID 262502) is awaiting input
[09:31:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002"
[09:31:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy1002.wikimedia.org
[09:31:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[09:32:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1207114 (https://phabricator.wikimedia.org/T410473) (owner: 10Volans)
[09:34:15] <logmsgbot>	 jmm@cumin2002 makevm (PID 262502) is awaiting input
[09:34:17] <wikibugs>	 (03CR) 10Volans: [C:03+2] admin: add catrope to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1207114 (https://phabricator.wikimedia.org/T410473) (owner: 10Volans)
[09:35:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy1002.wikimedia.org
[09:35:37] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853) (owner: 10Gehel)
[09:35:51] <wikibugs>	 (03PS2) 10Gehel: wdqs: Do not create task on failure of the WDQS LDF endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853)
[09:36:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002"
[09:36:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:36:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors
[09:36:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors
[09:36:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:36:50] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11387399 (10ayounsi) Nice !  As the IPs are already available, we should change the cr3/cr4/mr1 loopbacks ahead of time, in a different maintenance window, so...
[09:36:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[09:37:12] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11387401 (10Volans) @Catrope patch merged, will be live within ~30 minutes. Kerberos principal created, you should have received an email about it with in...
[09:37:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:38:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206936 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[09:38:19] <wikibugs>	 (03CR) 10Gehel: [C:03+2] wdqs: Do not create task on failure of the WDQS LDF endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853) (owner: 10Gehel)
[09:42:08] <logmsgbot>	 jmm@cumin2002 makevm (PID 262502) is awaiting input
[09:44:03] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[09:44:11] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host hcaptcha-proxy7001.wikimedia.org
[09:44:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy2001.wikimedia.org
[09:48:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy2001.wikimedia.org
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:52:38] <wikibugs>	 (03PS1) 10Sergio Gimeno: fix(MigrateMentorStatusAway): ensure migration respects date format [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207118 (https://phabricator.wikimedia.org/T409170)
[09:52:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207118 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno)
[09:56:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy2002.wikimedia.org
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:56:40] <wikibugs>	 06SRE, 10Bitu, 06Infrastructure-Foundations: Live validation of usernames - https://phabricator.wikimedia.org/T345168#11387450 (10Tacsipacsi)
[09:59:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to ops-limited for matthieulec - https://phabricator.wikimedia.org/T410291#11387469 (10MLechvien-WMF) Thanks! I'm now able to SSH to Bastion, so it seems fine to close this.
[10:00:19] <wikibugs>	 (03PS1) 10Brouberol: airflow: update the base image to include the opensearch provider [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207121 (https://phabricator.wikimedia.org/T408238)
[10:00:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy2002.wikimedia.org
[10:00:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to ops-limited for matthieulec - https://phabricator.wikimedia.org/T410291#11387484 (10MLechvien-WMF) 05Open→03Resolved
[10:04:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy7001.wikimedia.org
[10:08:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:09:01] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.hosts.decommission: Fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/1207122
[10:13:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza)
[10:14:19] <logmsgbot>	 jmm@cumin2002 decommission (PID 285077) is awaiting input
[10:16:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:17:26] <andre>	 kostajh: see https://versions.toolforge.org/ - group0 is already on wmf.3 so there is no train :)
[10:20:00] <logmsgbot>	 jmm@cumin2002 decommission (PID 285077) is awaiting input
[10:20:00] <wikibugs>	 (03PS1) 10Muehlenhoff: EFI-enabled Partman recipe (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1207124 (https://phabricator.wikimedia.org/T410400)
[10:20:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:20:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:20:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy7001.wikimedia.org
[10:20:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11387503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for ho...
[10:23:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387510 (10ayounsi) I might have found something in Redfish for Dell: `lang=python r = spicerack.redfish('sretest2004') dump = r.scp_dump() dump.config['SystemConfiguration']['Comp...
[10:23:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[10:24:07] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on clouddb[1024-1025].eqiad.wmnet with reason: cloning
[10:25:37] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize clouddb1025 [puppet] - 10https://gerrit.wikimedia.org/r/1207125 (https://phabricator.wikimedia.org/T409557)
[10:25:41] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: update the base image to include the opensearch provider [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207121 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[10:26:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[10:26:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:26:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize clouddb1025 [puppet] - 10https://gerrit.wikimedia.org/r/1207125 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui)
[10:28:37] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[10:28:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[10:30:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[10:31:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::6687:88ff:fef2:6d48 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[10:31:36] <wikibugs>	 (03PS1) 10Marostegui: db2144: Remove note. [puppet] - 10https://gerrit.wikimedia.org/r/1207127
[10:32:12] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2144: Remove note. [puppet] - 10https://gerrit.wikimedia.org/r/1207127 (owner: 10Marostegui)
[10:32:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7001.wikimedia.org
[10:32:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:34:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy3001.wikimedia.org
[10:35:14] <wikibugs>	 (03PS1) 10Federico Ceratto: admin: add fceratto FIDO2 U2F SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1207129
[10:35:14] <wikibugs>	 (03CR) 10Federico Ceratto: "As discussed on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1207129 (owner: 10Federico Ceratto)
[10:35:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy3001.wikimedia.org
[10:37:59] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[10:37:59] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[10:37:59] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[10:38:07] <logmsgbot>	 jmm@cumin2002 makevm (PID 299387) is awaiting input
[10:39:47] <wikibugs>	 (03CR) 10Arnaudb: "I forgot to @ any reviewer for this chance, sorry about the delay!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[10:40:12] <wikibugs>	 (03CR) 10Arnaudb: "change*" [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[10:40:46] <wikibugs>	 (03PS5) 10Arnaudb: gerrit: add a local backup cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833)
[10:48:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387577 (10ayounsi) Looks like it was a false hope, I looked at cirrussearch2115 which is showing the same behavior: ` lsw1-d3-codfw> show lldp neighbors | match xe-0/0/43       xe...
[10:48:52] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "small nitpicks but no blockers" [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632) (owner: 10AOkoth)
[10:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:51:51] <wikibugs>	 (03CR) 10Arnaudb: "I'll need to remove the local backup logic from the failover cookbook after merging this" [cookbooks] - 10https://gerrit.wikimedia.org/r/1193590 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[10:52:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002"
[10:52:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002"
[10:52:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:52:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7001.wikimedia.org on all recursors
[10:52:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7001.wikimedia.org on all recursors
[10:53:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002"
[10:53:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7001.wikimedia.org - jmm@cumin2002"
[10:55:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy4001.wikimedia.org
[10:56:14] <wikibugs>	 (03PS1) 10Brouberol: airflow-platform-eng: define the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207132 (https://phabricator.wikimedia.org/T408238)
[10:56:18] <logmsgbot>	 jmm@cumin2002 makevm (PID 299387) is awaiting input
[10:58:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1207129 (owner: 10Federico Ceratto)
[10:58:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy4001.wikimedia.org
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1100)
[11:00:42] <wikibugs>	 (03PS1) 10Cyndywikime: [Growth]:Remove GELevelingUpNewNotificationsEnabled config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207133 (https://phabricator.wikimedia.org/T407431)
[11:00:51] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: define the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207132 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[11:01:46] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: define the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207132 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[11:02:41] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] admin: add fceratto FIDO2 U2F SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1207129 (owner: 10Federico Ceratto)
[11:03:25] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:03:46] <wikibugs>	 (03CR) 10Cyndywikime: "This patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207133 (https://phabricator.wikimedia.org/T407431) (owner: 10Cyndywikime)
[11:03:52] <logmsgbot>	 !log cgoubert@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad
[11:04:59] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: deploy revertrisk-wikidata to the revision-models ns prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179)
[11:05:08] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:06:17] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm, two nits" [puppet] - 10https://gerrit.wikimedia.org/r/1207113 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[11:06:40] <wikibugs>	 (03PS1) 10Brouberol: airflow-platform-eng: fix a tyop in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207136 (https://phabricator.wikimedia.org/T408238)
[11:06:53] <wikibugs>	 (03PS2) 10Brouberol: airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207136 (https://phabricator.wikimedia.org/T408238)
[11:07:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387603 (10ayounsi) Haven't dug yet, but maybe an option is to install Broadcom's niccli tool : https://docs.broadcom.com/docs/Linux_Niccli-233.0.198.0  Then disabling it with: ` D...
[11:08:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207136 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[11:09:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207136 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[11:10:01] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:10:04] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:11:23] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11387617 (10ngkountas) Thank you @Volans, I can now run queries on super.wikimedia.org properly! Thanks to everyone involved! This task can be now resolved.
[11:11:35] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:12:45] <wikibugs>	 (03PS1) 10Brouberol: airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207137 (https://phabricator.wikimedia.org/T408238)
[11:13:57] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207137 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[11:15:04] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: fix a typo in the opensearch_test connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207137 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[11:15:13] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:16:22] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:19:06] <hnowlan>	 jouncebot: nowandnext
[11:19:07] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1100)
[11:19:07] <jouncebot>	 In 0 hour(s) and 40 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1200)
[11:19:57] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:toolforge::prometheus: Use native exporters for HAProxy targets [puppet] - 10https://gerrit.wikimedia.org/r/1203427 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah)
[11:23:24] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Use native exporters for HAProxy targets [puppet] - 10https://gerrit.wikimedia.org/r/1203427 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah)
[11:23:43] <wikibugs>	 (03PS1) 10Brouberol: airflow-platform-eng: configure SSL for opensearch API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207140 (https://phabricator.wikimedia.org/T408238)
[11:24:37] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] airflow-platform-eng: configure SSL for opensearch API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207140 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[11:24:39] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-main1006 is OK: SSL OK - Certificate kafka-main1006.eqiad.wmnet valid until 2026-10-20 13:49:00 +0000 (expires in 335 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[11:24:50] <logmsgbot>	 !log cgoubert@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad
[11:26:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-platform-eng: configure SSL for opensearch API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207140 (https://phabricator.wikimedia.org/T408238) (owner: 10Brouberol)
[11:26:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:27:04] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[11:30:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7001.wikimedia.org with OS trixie
[11:30:49] <claime>	 !log Roll restarting mobileapps in codfw - unavailable replicas - T410296
[11:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:53] <stashbot>	 T410296: Significant increase in wikifeeds latency since 2025/11/13 - https://phabricator.wikimedia.org/T410296
[11:30:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync
[11:31:57] <hnowlan>	 thanks, was thinking about that :D
[11:32:18] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync
[11:33:01] <claime>	 hnowlan: :D
[11:34:36] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::base disable geoip in cloud environment [puppet] - 10https://gerrit.wikimedia.org/r/1207141
[11:37:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[11:37:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[11:37:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[11:41:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy5001.wikimedia.org
[11:45:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy5001.wikimedia.org
[11:47:43] <wikibugs>	 (03PS3) 10Kevin Bazira: ml-services: deploy revertrisk-wikidata to the revertrisk ns prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179)
[11:54:20] <wikibugs>	 (03PS3) 10Majavah: P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428
[11:55:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11387754 (10cmooney) Another datapoint here, but the logspam seems worse on some switches: ` A:lsw1-d7-eqiad# show system logging buffer messages | grep -c "remote peer updated on i...
[11:55:32] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7647/co" [puppet] - 10https://gerrit.wikimedia.org/r/1203428 (owner: 10Majavah)
[11:56:34] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] "LGTM! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)
[11:57:21] <hnowlan>	 !log routing /api/rest_v1/page/lint/ via the rest-gateway for group1 
[11:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:23] <wikibugs>	 (03PS1) 10Majavah: Remove absented HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1207144
[12:00:04] <jouncebot>	 mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1200).
[12:00:20] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)
[12:01:47] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206905 (owner: 10PipelineBot)
[12:02:13] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy revertrisk-wikidata to the revertrisk ns prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207027 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)
[12:03:34] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206905 (owner: 10PipelineBot)
[12:03:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy7001.wikimedia.org with reason: host reimage
[12:04:40] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:05:00] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[12:05:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy5002.wikimedia.org
[12:05:23] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[12:06:37] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[12:07:09] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[12:07:26] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[12:07:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy7001.wikimedia.org with reason: host reimage
[12:07:55] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[12:09:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy5002.wikimedia.org
[12:10:49] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[12:11:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy6001.wikimedia.org
[12:13:26] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "LGTM, will need testing in staging before roll out." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler)
[12:14:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11387772 (10Volans) 05In progress→03Resolved a:03Volans
[12:15:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy6001.wikimedia.org
[12:22:41] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[12:24:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy7001.wikimedia.org with OS trixie
[12:24:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy7001.wikimedia.org
[12:25:52] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host db1169.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[12:27:10] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] rest-gateway: allow rate limits per time unit (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler)
[12:28:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host hcaptcha-proxy6002.wikimedia.org
[12:28:54] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: implement per-route rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler)
[12:32:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host hcaptcha-proxy6002.wikimedia.org
[12:32:43] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11387861 (10Marostegui)
[12:33:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7002.wikimedia.org
[12:33:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:35:35] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1169.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[12:37:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:38:56] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[12:39:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:39:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:39:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors
[12:39:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors
[12:39:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:40:01] <wikibugs>	 (03CR) 10Clément Goubert: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler)
[12:43:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:43:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:43:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:43:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors
[12:43:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors
[12:43:52] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7002.wikimedia.org
[12:44:35] <wikibugs>	 (03CR) 10Zabe: undeploy Extension:Capiunto (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206851 (https://phabricator.wikimedia.org/T410172) (owner: 10Novem Linguae)
[12:45:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7002.wikimedia.org
[12:45:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:46:27] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] undeploy Extension:Capiunto (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206851 (https://phabricator.wikimedia.org/T410172) (owner: 10Novem Linguae)
[12:49:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:49:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:49:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:49:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors
[12:49:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors
[12:49:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:51:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: workaround for mpt3sas large optimal_io_size [puppet] - 10https://gerrit.wikimedia.org/r/1207150 (https://phabricator.wikimedia.org/T407586)
[12:52:24] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132) (owner: 10Daniel Kinzler)
[12:53:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:53:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:53:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:53:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors
[12:53:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors
[12:53:31] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7002.wikimedia.org
[12:53:52] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: implement per-route rate limits (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) (owner: 10Daniel Kinzler)
[12:54:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7002.wikimedia.org
[12:54:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:55:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428 (owner: 10Majavah)
[12:55:17] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: Remove legacy HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1203428 (owner: 10Majavah)
[12:58:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:58:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:58:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:58:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors
[12:58:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors
[12:59:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[12:59:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7002.wikimedia.org - jmm@cumin2002"
[13:00:44] <wikibugs>	 (03CR) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler)
[13:03:23] <logmsgbot>	 jmm@cumin2002 makevm (PID 370070) is awaiting input
[13:04:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7002.wikimedia.org with OS trixie
[13:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:14:50] <moritzm>	 !log installing systemd bugfix updates on trixie hosts
[13:14:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:17] <logmsgbot>	 filippo@cumin1003 reimage (PID 2904929) is awaiting input
[13:18:07] <wikibugs>	 10SRE-Access-Requests: New SSH key - https://phabricator.wikimedia.org/T410506 (10jijiki) 03NEW
[13:19:32] <wikibugs>	 (03PS1) 10Effie Mouzeli: admin: add new keys for effie [puppet] - 10https://gerrit.wikimedia.org/r/1207153 (https://phabricator.wikimedia.org/T410506)
[13:29:53] <wikibugs>	 (03PS2) 10Anzx: tcywikisource: Temporary increase of AccountCreationThrottle  [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507)
[13:30:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[13:30:41] <wikibugs>	 (03CR) 10Hoo man: [C:04-1] Replace 'let' with arithmetic expansion (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon)
[13:31:06] <wikibugs>	 (03CR) 10Hoo man: [C:04-1] Replace 'let' with arithmetic expansion (031 comment) [dumps] - 10https://gerrit.wikimedia.org/r/1207109 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon)
[13:33:12] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe-eqiad
[13:33:24] <logmsgbot>	 !log marostegui@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[13:33:41] <wikibugs>	 (03CR) 10Hoo man: Clean up existing symlink before creating a new one (032 comments) [dumps] - 10https://gerrit.wikimedia.org/r/1207110 (https://phabricator.wikimedia.org/T406044) (owner: 10Itamar Givon)
[13:33:51] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[13:33:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy7002.wikimedia.org with reason: host reimage
[13:33:59] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T299441)', diff saved to https://phabricator.wikimedia.org/P85388 and previous config saved to /var/cache/conftool/dbconfig/20251119-133358-marostegui.json
[13:34:03] <stashbot>	 T299441: Avoid depooling hosts if the schema change has been applied before - https://phabricator.wikimedia.org/T299441
[13:36:13] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:37:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:37:56] <logmsgbot>	 !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[13:39:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy7002.wikimedia.org with reason: host reimage
[13:43:41] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - repool after schema change test
[13:50:19] <wikibugs>	 06SRE, 06cloud-services-team, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11388128 (10fgiunchedi) Reported to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121006
[13:50:51] <Amir1>	 !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists noboardwiki; drop database if exists ru_sibwiki; drop database if exists sep11wiki; drop database if exists strategyappswiki; (T297297)
[13:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:56] <stashbot>	 T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297
[13:52:30] <wikibugs>	 06SRE, 06cloud-services-team, 13Patch-For-Review, 07Upstream: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11388143 (10taavi)
[13:56:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy7002.wikimedia.org with OS trixie
[13:56:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy7002.wikimedia.org
[13:59:50] <moritzm>	 !log installing  monitoring-plugins bugfix updates on trixie hosts
[13:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1400).
[14:00:05] <jouncebot>	 edsanders, Sergi0, tgr, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:06] <edsanders>	 o/
[14:00:16] <sergi0>	 o/
[14:00:17] <edsanders>	 I can self deploy my config change
[14:00:40] <anzx>	 o/
[14:01:15] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Yes, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1207141 (owner: 10Slyngshede)
[14:01:39] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388181 (10MoritzMuehlenhoff)
[14:02:29] <edsanders>	 I'll begin
[14:02:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206880 (https://phabricator.wikimedia.org/T402532) (owner: 10Esanders)
[14:02:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11388191 (10Jclark-ctr) 05Open→03Resolved Closing this ticket since it’s a configuration problem being addressed in T410400
[14:03:28] <wikibugs>	 (03Merged) 10jenkins-bot: Freeze LiquidThreads on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206880 (https://phabricator.wikimedia.org/T402532) (owner: 10Esanders)
[14:04:00] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1206880|Freeze LiquidThreads on ptwikibooks (T402532)]]
[14:04:02] <tgr_>	 I'll stay last, will need to do a lot of testing
[14:04:04] <stashbot>	 T402532: ptwikibooks: LQT set to readonly and removed as default - https://phabricator.wikimedia.org/T402532
[14:04:08] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] P:cache::base disable geoip in cloud environment [puppet] - 10https://gerrit.wikimedia.org/r/1207141 (owner: 10Slyngshede)
[14:04:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11388207 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Closing ticket — cabling subtask has been completed and server migration is in process
[14:05:06] <Lucas_WMDE>	 o/
[14:05:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388212 (10MoritzMuehlenhoff) @ssingh The hcaptcha-proxy VMs in magru are up and running
[14:07:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388216 (10ssingh) Oh wow, thanks @MoritzMuehlenhoff! But what was the issue for my understanding?
[14:08:42] <logmsgbot>	 !log esanders@deploy2002 esanders: Backport for [[gerrit:1206880|Freeze LiquidThreads on ptwikibooks (T402532)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:08:50] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: lvs/interfaces: remove public1-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1206424 (https://phabricator.wikimedia.org/T410047) (owner: 10Ssingh)
[14:10:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388223 (10MoritzMuehlenhoff) >>! In T409860#11388216, @ssingh wrote: > Oh wow, thanks @MoritzMu...
[14:10:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[14:12:02] <wikibugs>	 (03PS2) 10AOkoth: vrts: alert on vrts junk queue size [alerts] - 10https://gerrit.wikimedia.org/r/1201087 (https://phabricator.wikimedia.org/T408632)
[14:12:05] <logmsgbot>	 !log esanders@deploy2002 esanders: Continuing with sync
[14:12:59] <Amir1>	 !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists tlhwiki; drop database if exists tlhwiktionary; drop database if exists ukwikimedia; drop database if exists zerowiki; drop database if exists zh_cnwiki; drop database if exists zh_twwiki; (T297297)
[14:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:03] <stashbot>	 T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297
[14:14:44] <wikibugs>	 (03PS1) 10Arnaudb: admin: add FIDO key for arnaudb [puppet] - 10https://gerrit.wikimedia.org/r/1207159
[14:15:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[14:16:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Is this really the right way to change the throttle? I can’t find any similar modifications in Git since the config took on its current fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[14:16:13] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206880|Freeze LiquidThreads on ptwikibooks (T402532)]] (duration: 12m 13s)
[14:16:18] <stashbot>	 T402532: ptwikibooks: LQT set to readonly and removed as default - https://phabricator.wikimedia.org/T402532
[14:16:39] <sergi0>	 I think I can do mine together
[14:16:49] <Lucas_WMDE>	 ack
[14:17:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11388240 (10ssingh) >>! In T409860#11388223, @MoritzMuehlenhoff wrote: >>>! In T409860#11388216,...
[14:17:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206936 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[14:17:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207118 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno)
[14:19:05] <wikibugs>	 (03CR) 10Anzx: "seems so if ip address is not known https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold#:~:text=If%20the%20IP%20is%2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[14:19:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good and key verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1207159 (owner: 10Arnaudb)
[14:20:11] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] admin: add FIDO key for arnaudb [puppet] - 10https://gerrit.wikimedia.org/r/1207159 (owner: 10Arnaudb)
[14:26:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:29:11] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1187 gradually with 4 steps - repool after schema change test
[14:29:18] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Verified out of band with Effie" [puppet] - 10https://gerrit.wikimedia.org/r/1207153 (https://phabricator.wikimedia.org/T410506) (owner: 10Effie Mouzeli)
[14:29:44] <wikibugs>	 (03Merged) 10jenkins-bot: fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206936 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[14:30:18] <wikibugs>	 (03Merged) 10jenkins-bot: fix(MigrateMentorStatusAway): ensure migration respects date format [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207118 (https://phabricator.wikimedia.org/T409170) (owner: 10Sergio Gimeno)
[14:30:33] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: New SSH key - https://phabricator.wikimedia.org/T410506#11388307 (10Volans) p:05Triage→03Medium
[14:30:50] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1206936|fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users (T405177)]], [[gerrit:1207118|fix(MigrateMentorStatusAway): ensure migration respects date format (T409170)]]
[14:30:56] <stashbot>	 T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177
[14:30:57] <stashbot>	 T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170
[14:31:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Remove lvs1018 L2 link to ssw1-e1-eqiad - https://phabricator.wikimedia.org/T405499#11388318 (10cmooney) 05Resolved→03Open a:05cmooney→03None Hi.  Seems I made an error here as not all the work is complete on site.  We still ne...
[14:32:46] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Fair enough, let’s try it then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[14:33:57] <wikibugs>	 (03PS3) 10Ssingh: O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780)
[14:34:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[14:34:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11388333 (10cmooney) 05Open→03Resolved >>! In T410047#11374122, @cmooney wrote: > Actually I discussed with @Papaul in relation to...
[14:34:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad row C/D DC Ops host migrations - https://phabricator.wikimedia.org/T405021#11388335 (10Jclark-ctr) 05Open→03Resolved All dcops servers have been relocated to new switches
[14:35:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: New SSH keys for effie - https://phabricator.wikimedia.org/T410506#11388338 (10A_smart_kitten)
[14:35:28] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1206936|fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users (T405177)]], [[gerrit:1207118|fix(MigrateMentorStatusAway): ensure migration respects date format (T409170)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:35:59] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[14:36:12] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:36:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11388344 (10Jclark-ctr) 05Open→03Resolved a:05RobH→03Jclark-ctr All Servers for Traffic have been migrated to new nokia switches
[14:37:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:40:00] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206936|fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users (T405177)]], [[gerrit:1207118|fix(MigrateMentorStatusAway): ensure migration respects date format (T409170)]] (duration: 09m 09s)
[14:40:05] <stashbot>	 T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177
[14:40:06] <stashbot>	 T409170: Run MigrateMentorStatusAway migration script - https://phabricator.wikimedia.org/T409170
[14:40:30] <sergi0>	 all yours @anzx 
[14:40:51] <sergi0>	 Or @Lucas_WMDE ?
[14:41:07] <Lucas_WMDE>	 yaeh, I can deploy this one :)
[14:41:09] <Lucas_WMDE>	 *yeah
[14:41:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[14:42:29] <wikibugs>	 (03Merged) 10jenkins-bot: tcywikisource: Temporary increase of AccountCreationThrottle  [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207154 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[14:43:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1207154|tcywikisource: Temporary increase of AccountCreationThrottle  (T410507)]]
[14:43:06] <stashbot>	 T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507
[14:43:46] <wikibugs>	 (03PS1) 10Ssingh: site.pp: reimage hcaptcha-proxy1001 to proper role [puppet] - 10https://gerrit.wikimedia.org/r/1207165 (https://phabricator.wikimedia.org/T409780)
[14:44:11] <wikibugs>	 (03PS1) 10Bking: opensearch-cluster: give 'opensearch' user access to bulk API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207166 (https://phabricator.wikimedia.org/T408012)
[14:45:03] <anzx>	 Lucas_WMDE: no need test, good to sync 
[14:45:15] <Lucas_WMDE>	 makes sense
[14:45:44] <wikibugs>	 (03PS1) 10Ladsgroup: rdbms: Dismantle concept of groups [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207167 (https://phabricator.wikimedia.org/T405087)
[14:47:09] <wikibugs>	 (03PS1) 10Anzx: Revert "tcywikisource: Temporary increase of AccountCreationThrottle " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207169 (https://phabricator.wikimedia.org/T410507)
[14:47:42] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[14:47:56] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[14:47:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1207154|tcywikisource: Temporary increase of AccountCreationThrottle  (T410507)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:48:24] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe-eqiad
[14:48:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 anzx, lucaswerkmeister-wmde: Continuing with sync
[14:49:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Thanks :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207169 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[14:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:51:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[14:52:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "bonkers" [puppet] - 10https://gerrit.wikimedia.org/r/1207150 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi)
[14:52:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207154|tcywikisource: Temporary increase of AccountCreationThrottle  (T410507)]] (duration: 09m 32s)
[14:52:38] <anzx>	 Lucas_WMDE: thanks for deploying 
[14:52:39] <stashbot>	 T410507: Increase AccountCreationThrottle for Tulu Wikisource - https://phabricator.wikimedia.org/T410507
[14:52:42] <Lucas_WMDE>	 np
[14:52:49] <Lucas_WMDE>	 tgr_: over to you
[14:53:01] <tgr_>	 thx
[14:54:02] <wikibugs>	 (03PS2) 10Gergő Tisza: Use prefixed 'sub' field in OAuth 2 access tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199)
[14:54:42] <Lucas_WMDE>	 oh, right, I wanted to try running resetAuthenticationThrottle too
[14:54:46] <Lucas_WMDE>	 (shouldn’t interfere, hopefully)
[14:55:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mwscript-k8s job started: resetAuthenticationThrottle tcywikisource --signup  # T410507
[14:55:52] <Lucas_WMDE>	 !log (T410507 maintenance script failed, --ip is required and we don’t have it. oh well)
[14:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:18] <anzx>	 Lucas_WMDE: i thought without IP address it was not required, thanks
[14:56:36] <tgr_>	 you can add the wiki to throttle.php instead
[14:57:15] <Lucas_WMDE>	 anzx: yeah I suspected it would fail but wanted to try it anyway
[14:57:21] <Lucas_WMDE>	 tgr_: we don’t have an IP range for the event apparently :/
[14:57:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza)
[14:57:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388433 (10MoritzMuehlenhoff)
[14:57:58] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-codfw: set minimum resources for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206969 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking)
[14:58:15] <wikibugs>	 (03Merged) 10jenkins-bot: Use prefixed 'sub' field in OAuth 2 access tokens [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1202768 (https://phabricator.wikimedia.org/T399199) (owner: 10Gergő Tisza)
[14:58:28] <tgr_>	 IP is optional for that
[14:58:39] <tgr_>	 but then, maybe unwise to allow all IPs
[14:58:45] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1202768|Use prefixed 'sub' field in OAuth 2 access tokens (T399199)]]
[14:58:50] <stashbot>	 T399199: Update OAuth 2.0 sessions to include new JWT session data from core - https://phabricator.wikimedia.org/T399199
[14:58:59] <Lucas_WMDE>	 hm, maybe https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold needs an update then? that’s what pointed to wgAccountCreationThrottle
[15:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1500)
[15:00:39] <tgr_>	 you can just omit IP/range and then the higher limit will apply to all IPs
[15:00:59] <Amir1>	 please let me know once you're done. I have a backport
[15:01:01] <tgr_>	 $wgAccountCreationThrottle would work too, but then you can't limit it by date/wiki
[15:03:37] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1202768|Use prefixed 'sub' field in OAuth 2 access tokens (T399199)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:03:52] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207166 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking)
[15:04:17] <tgr_>	 oh, you did already increase $wgAccountCreationThrottle. You don't really need the maintenance script then.
[15:04:39] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507)
[15:04:51] <Lucas_WMDE>	 tgr_: does https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 look better?
[15:05:12] <Lucas_WMDE>	 yeah, I ran the maintenance script because wikitech said to (and I figured it wouldn’t hurt even if it errored out)
[15:05:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE))
[15:05:38] <Lucas_WMDE>	 (but test your change first :))
[15:07:34] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507)
[15:07:34] <tgr_>	 yeah looks good
[15:07:50] <tgr_>	 not running the script means the effective limit will be 69 not 75
[15:07:56] <tgr_>	 which isn't a big deal
[15:08:24] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:30] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.dns.netbox
[15:09:21] <Lucas_WMDE>	 ack
[15:09:28] <Lucas_WMDE>	 (though I had to fix one test that failed on the missing IP/range ^^)
[15:09:57] <Lucas_WMDE>	 wtf, I wrote that test? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1073487
[15:10:14] <wikibugs>	 (03PS1) 10Awight: Monitoring for WMDE dumps scraper [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613)
[15:10:38] <Lucas_WMDE>	 so I guess we haven’t had throttling exceptions without IPs/ranges since at least September 2024
[15:10:50] <Lucas_WMDE>	 hopefully they still work. anzx: does https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 look okay to you?
[15:11:25] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[15:12:11] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: push changes - cmooney@cumin1003"
[15:12:17] <tgr_>	 the code is in throttle-analyze.php, looks pretty straightforward
[15:13:02] <wikibugs>	 (03CR) 10Anzx: "just to be safe extend endtime by 1 hour" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE))
[15:13:31] <anzx>	 Lucas_WMDE: looks ok, i have suggested to increase time by 1 hour just to be safe
[15:13:37] <Lucas_WMDE>	 anzx: sure
[15:13:42] <Lucas_WMDE>	 tgr_: true, fair enough
[15:13:43] <Lucas_WMDE>	 thanks!
[15:13:45] <wikibugs>	 (03PS2) 10Majavah: Remove absented HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1207144
[15:13:49] <Lucas_WMDE>	 then I’ll try to get that deployed later
[15:13:50] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: push changes - cmooney@cumin1003"
[15:13:50] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:14:24] <wikibugs>	 (03CR) 10Daphne Smit: [C:03+2] wikifunctions: Bump the orchestrator timeout down a skosh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205263 (https://phabricator.wikimedia.org/T407503) (owner: 10Cory Massaro)
[15:15:29] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1202768|Use prefixed 'sub' field in OAuth 2 access tokens (T399199)]] (duration: 16m 43s)
[15:15:34] <stashbot>	 T399199: Update OAuth 2.0 sessions to include new JWT session data from core - https://phabricator.wikimedia.org/T399199
[15:15:43] <tgr_>	 let's see if we break any OAuth clients this time
[15:15:56] <tgr_>	 Amir1: you are good to go
[15:16:03] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Bump the orchestrator timeout down a skosh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205263 (https://phabricator.wikimedia.org/T407503) (owner: 10Cory Massaro)
[15:16:05] <tgr_>	 please ping Lucas_WMDE when done
[15:16:18] <wikibugs>	 (03PS3) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727)
[15:16:23] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[15:16:27] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] rdbms: Dismantle concept of groups [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207167 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup)
[15:16:28] <wikibugs>	 (03CR) 10Majavah: [C:03+2] Remove absented HAProxy exporters [puppet] - 10https://gerrit.wikimedia.org/r/1207144 (owner: 10Majavah)
[15:16:31] <Amir1>	 awesome
[15:16:51] <Lucas_WMDE>	 fingers crossed for OAuth
[15:16:53] <Amir1>	 my patch is going to take a while to merge, so if it's mw-config, Lucas_WMDE you can go head
[15:16:54] <James_F>	 brouberol: Can you ping when you're done? deployment-charts git is dirty so we can't use our window.
[15:17:01] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507)
[15:17:06] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[15:17:32] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE))
[15:17:42] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[15:17:45] <Lucas_WMDE>	 Amir1: ok
[15:17:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE))
[15:17:55] <Lucas_WMDE>	 bah, what now
[15:18:11] <Lucas_WMDE>	 “Comments should start on new line.”  blhhhhhh
[15:18:21] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[15:18:25] <Lucas_WMDE>	 even the wikitech example has end-of-line comments 😡 https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold
[15:18:42] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[15:19:06] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507)
[15:19:21] <Lucas_WMDE>	 Amir1: want to CR+1 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 ?
[15:19:25] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[15:19:47] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply
[15:20:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply
[15:20:26] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE))
[15:20:33] <wikibugs>	 (03Merged) 10jenkins-bot: rdbms: Dismantle concept of groups [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207167 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup)
[15:20:35] <Amir1>	 ^
[15:20:38] <Lucas_WMDE>	 ok, you go first
[15:20:43] <Amir1>	 oh mine got merged
[15:20:43] <Lucas_WMDE>	 (asps. very dangerous!)
[15:20:44] <Amir1>	 interesting
[15:22:05] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207167|rdbms: Dismantle concept of groups (T405087)]]
[15:22:13] <stashbot>	 T405087: Remove concept of groups in rdbms load balancer and replace it with shuffle sharding - https://phabricator.wikimedia.org/T405087
[15:22:25] <logmsgbot>	 !log ladsgroup@deploy2002 sync-world failed: <CalledProcessError> Command '['sudo', '-u', 'mwbuilder', '-n', '--', '/usr/bin/scap', 'mwscript', '--no-local-config', '--directory', '/srv/mediawiki-staging', '--user', 'www-data', '--', 'mergeMessageFileList.php', '--wiki=aawiki', '--force-version', '1.46.0-wmf.3', '--list-file', '/srv/mediawiki-staging/wmf-config/extension-list', '--output', '/tmp/tmp.IxZM23pYxK']' returned
[15:22:25] <logmsgbot>	 non-zero exit status 255. (scap version: 4.227.0) (duration: 00m 20s)
[15:22:47] <wikibugs>	 (03Abandoned) 10Anzx: Revert "tcywikisource: Temporary increase of AccountCreationThrottle " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207169 (https://phabricator.wikimedia.org/T410507) (owner: 10Anzx)
[15:23:03] <James_F>	 Oh dear.
[15:23:30] <Amir1>	 https://www.irccloud.com/pastebin/hU21gQov/
[15:23:44] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs/interfaces: remove VLAN sub-ints for edges [puppet] - 10https://gerrit.wikimedia.org/r/1207180 (https://phabricator.wikimedia.org/T409860)
[15:23:46] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207181
[15:23:46] <wikibugs>	 (03CR) 10TrainBranchBot: "ladsgroup@deploy2002 created a revert of this change as Ie333b077a04b6846c711f1a97baef4b42b46ae0f" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207167 (https://phabricator.wikimedia.org/T405087) (owner: 10Ladsgroup)
[15:24:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207181 (owner: 10TrainBranchBot)
[15:24:52] <James_F>	 brouberol: Ping again. We'd really like to use our deployment window if possible.
[15:25:23] <wikibugs>	 (03PS2) 10Ssingh: hiera: lvs/interfaces: remove VLAN sub-ints for edges [puppet] - 10https://gerrit.wikimedia.org/r/1207180 (https://phabricator.wikimedia.org/T410411)
[15:30:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1500)
[15:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1530)
[15:30:07] <papaul>	 !log rebooting sretest2004 to check LLDP settings 
[15:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:07] <brouberol>	 James_F: sorry, I missed the first ping. It's fixed
[15:31:12] <James_F>	 Thanks!
[15:31:17] <moritzm>	 !log installing wtmpdb bugfix updates on trixie hosts
[15:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:20] <brouberol>	 np, and apologies 
[15:31:28] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:31:58] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:32:10] <icinga-wm>	 PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:28] <logmsgbot>	 !log slyngshede@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site drmrs [reason: no reason specified, T390813]
[15:32:33] <logmsgbot>	 !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site drmrs [reason: no reason specified, T390813]
[15:32:34] <stashbot>	 T390813: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813
[15:33:11] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:33:24] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:33:35] <moritzm>	 !log installing console-setup bugfix updates on trixie hosts
[15:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:38] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:33:51] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:34:14] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] opensearch-cluster: give 'opensearch' user access to bulk API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207166 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking)
[15:34:27] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:35:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] admin: add new keys for effie [puppet] - 10https://gerrit.wikimedia.org/r/1207153 (https://phabricator.wikimedia.org/T410506) (owner: 10Effie Mouzeli)
[15:36:18] <wikibugs>	 (03CR) 10Daphne Smit: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-11-08-223341 to 2025-11-18-175356 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206981 (https://phabricator.wikimedia.org/T305612) (owner: 10Jforrester)
[15:38:34] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-11-08-223341 to 2025-11-18-175356 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206981 (https://phabricator.wikimedia.org/T305612) (owner: 10Jforrester)
[15:38:56] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "rdbms: Dismantle concept of groups" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207181 (owner: 10TrainBranchBot)
[15:39:23] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:39:27] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1207181|Revert "rdbms: Dismantle concept of groups"]]
[15:39:47] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:40:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: New SSH keys for effie - https://phabricator.wikimedia.org/T410506#11388761 (10jijiki) 05Open→03Resolved a:03jijiki
[15:40:54] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:40:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11388764 (10Papaul) @ayounsi Please see below the steps to disable LLDP in the BIOS for Dell servers.  - once in the BIOS go to "Device Settings" -pick the first NIC if it is 1G or...
[15:41:20] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+1] "LGTM; thanks for your work on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727) (owner: 10Kgraessle)
[15:41:28] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:41:36] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:42:10] <logmsgbot>	 !log daphnesmit@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:42:22] <icinga-wm>	 RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 33.54 ms
[15:43:45] <logmsgbot>	 !log ladsgroup@deploy2002 trainbranchbot, ladsgroup: Backport for [[gerrit:1207181|Revert "rdbms: Dismantle concept of groups"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[15:43:54] <James_F>	 Window clear from our end.
[15:44:36] <Lucas_WMDE>	 I’m waiting for Amir1 to be done deploying
[15:44:42] <logmsgbot>	 !log ladsgroup@deploy2002 trainbranchbot, ladsgroup: Continuing with sync
[15:44:50] <Lucas_WMDE>	 and then can hopefully deploy my config cleanup in the break between wf/xLab and mw infra
[15:44:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388797 (10MoritzMuehlenhoff)
[15:44:59] <Amir1>	 just got to test servers
[15:45:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11388798 (10RobH) Update:  * backup1006, backup1007, ms-backup1002 moved yesterday. * db1189 was moved yesterday by accident sorry about that! * The only d...
[15:47:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11388802 (10Ladsgroup) Please ping me before moving of pc1014 so I depool pc4 cluster from rotation.
[15:47:38] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch-cluster: give 'opensearch' user access to bulk API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207166 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking)
[15:48:31] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11388807 (10jcrespo) >>! In T405942#11388798, @RobH wrote: > ** moss-be1002 - no directions provided on moving this, please advise  @Robh, not mine, but pl...
[15:48:42] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207181|Revert "rdbms: Dismantle concept of groups"]] (duration: 09m 14s)
[15:50:11] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch on k8s: Add CODFW environment to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206973 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking)
[15:51:57] <wikibugs>	 (03Merged) 10jenkins-bot: opensearch on k8s: Add CODFW environment to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206973 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking)
[15:52:35] <logmsgbot>	 !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw1-b[12-13]-drmrs,cr[1-2]-drmrs,mr1-drmrs with reason: router upgrade
[15:52:39] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[15:52:46] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[15:52:50] <sukhe>	 here it comes
[15:53:15] <Amir1>	 Lucas_WMDE: I'm done with the deploy
[15:53:20] <sukhe>	 !incidents
[15:53:20] <sirenbot>	 7029 (ACKED)  Primary inbound port utilisation over 80%  (paged) network noc (cr1-esams.wikimedia.org)
[15:53:20] <sirenbot>	 7027 (RESOLVED)  Host db2144 (paged)
[15:53:20] <sirenbot>	 7024 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[15:53:21] <sirenbot>	 7025 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[15:53:21] <sirenbot>	 7023 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad)
[15:53:21] <sirenbot>	 7017 (RESOLVED)  Host db1221 (paged)
[15:53:21] <sirenbot>	 7022 (RESOLVED)  db1233 (paged)/MariaDB Replica Lag: s2 (paged)
[15:53:21] <sirenbot>	 7021 (RESOLVED)  db1259 (paged)/MariaDB Replica Lag: s2 (paged)
[15:53:22] <sirenbot>	 7020 (RESOLVED)  db1259 (paged)/MariaDB Replica IO: s2 (paged)
[15:53:22] <sirenbot>	 7019 (RESOLVED)  db1258 (paged)/MariaDB Replica IO: x3 (paged)
[15:53:22] <sirenbot>	 7018 (RESOLVED)  db1258 (paged)/MariaDB Replica Lag: x3 (paged)
[15:53:23] <sirenbot>	 7016 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[15:53:25] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[15:53:26] <jhathaway>	 o/
[15:53:37] <sukhe>	 jhathaway: so this is because we depooled drmrs and now esams is suffering
[15:53:42] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11388818 (10RobH) >>! In T405942#11388802, @Ladsgroup wrote: > Please ping me before moving of pc1014 so I depool pc4 cluster from rotation.  Will do, it w...
[15:53:43] <sukhe>	 topranks: I guess we weather this out for a bit? or what?
[15:53:54] <jhathaway>	 thanks sukhe
[15:54:20] <papaul>	 sukhe: do you want me to wait ?
[15:54:29] <cdanis>	 sukhe: we should look for scrapers of originals in esams
[15:54:32] <wikibugs>	 (03PS7) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949)
[15:54:43] <wikibugs>	 (03CR) 10JHathaway: UEFI: dup partition on MD RAID boxes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway)
[15:54:47] <sukhe>	 moving to private
[15:57:46] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[15:59:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388835 (10MoritzMuehlenhoff)
[15:59:19] <wikibugs>	 (03CR) 10Pmiazga: rest-gateway: assign ratelimit class by network range (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) (owner: 10Daniel Kinzler)
[15:59:22] <moritzm>	 !log installing brltty bugfix updates on trixie hosts
[15:59:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:15] <wikibugs>	 (03PS1) 10Bking: opensearch-cluster: Add cluster ro permissions to 'opensearch' user [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012)
[16:01:51] <wikibugs>	 (03PS1) 10DCausse: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195
[16:02:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (owner: 10DCausse)
[16:02:45] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[16:03:39] <Lucas_WMDE>	 Amir1: thanks (sorry I missed the ping)
[16:03:50] <moritzm>	 !log installing libvirt bugfix updates on trixie hosts
[16:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:14] <Lucas_WMDE>	 jouncebot: nowandnext
[16:04:14] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 55 minute(s)
[16:04:14] <jouncebot>	 In 1 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1800)
[16:04:22] <Lucas_WMDE>	 though it sounds like it might not be a good idea to deploy right now
[16:05:14] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply
[16:06:23] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply
[16:06:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[16:06:57] <logmsgbot>	 !log pt1979@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: router upgrade
[16:07:45] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[16:07:46] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[16:08:13] <vgutierrez>	 sukhe: drmrs still depooled?
[16:08:18] <sukhe>	 vgutierrez: yeah
[16:08:22] <sukhe>	 see -sec
[16:08:24] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] Cleanup maintenance_hosts hiera variable use [puppet] - 10https://gerrit.wikimedia.org/r/1206877 (https://phabricator.wikimedia.org/T400442) (owner: 10Alexandros Kosiaris)
[16:08:32] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] Empty maintenance_hosts array [puppet] - 10https://gerrit.wikimedia.org/r/1206876 (https://phabricator.wikimedia.org/T400442) (owner: 10Alexandros Kosiaris)
[16:08:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388867 (10MoritzMuehlenhoff)
[16:11:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11388895 (10MoritzMuehlenhoff)
[16:14:04] <wikibugs>	 (03PS2) 10DCausse: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195
[16:15:56] <icinga-wm>	 PROBLEM - Host doh6001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:15:56] <icinga-wm>	 PROBLEM - Host durum6001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:16:14] <icinga-wm>	 PROBLEM - Host tcp-proxy6001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:16:18] <icinga-wm>	 PROBLEM - Host install6003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:16:30] <icinga-wm>	 PROBLEM - Host bast6003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:16:34] <sukhe>	 yeah expected ^
[16:16:36] <papaul>	 that is me 
[16:16:41] <wikibugs>	 (03CR) 10Bking: [C:03+1] "post-merge +1" [puppet] - 10https://gerrit.wikimedia.org/r/1207107 (https://phabricator.wikimedia.org/T408853) (owner: 10Gehel)
[16:16:42] <icinga-wm>	 PROBLEM - Host ncredir6001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:16:45] <papaul>	 reboting asw1-v12
[16:16:53] <papaul>	 b12
[16:16:57] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:17:46] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[16:18:24] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ganeti6001:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:30] <jinxer-wm>	 FIRING: [2x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs6002:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[16:19:41] <sukhe>	 yeah well we should really silence all drmrs at this point
[16:19:46] <sukhe>	 jhathaway: Raine: ^
[16:19:55] <wikibugs>	 (03PS3) 10Bking: apt: update opensearch3 key [puppet] - 10https://gerrit.wikimedia.org/r/1207195 (https://phabricator.wikimedia.org/T407123) (owner: 10DCausse)
[16:20:01] <jhathaway>	 ok
[16:20:10] <Raine>	 sgtm sukhe 
[16:20:23] <jhathaway>	 do we have tooling to do that?
[16:21:29] <sukhe>	 jhathaway: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org something here
[16:21:43] <jhathaway>	 nod
[16:21:57] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:22:57] <sukhe>	 jhathaway: Raine: sorry, Traffic should have really silenced this
[16:22:59] <sukhe>	 I can take that on
[16:23:24] <jinxer-wm>	 FIRING: [24x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:23:37] <jhathaway>	 happy to as well, but so far my alert manger foo is failing me sukhe 
[16:24:02] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.01e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[16:24:30] <jinxer-wm>	 FIRING: [4x] LibericaUnhealthyRealserverPooled: Liberica service upload-httpslb6_443 has 2 unhealthy realservers pooled on lvs6002:3003 - https://wikitech.wikimedia.org/wiki/Liberica#LibericaUnhealthyRealserverPooled  - https://alerts.wikimedia.org/?q=alertname%3DLibericaUnhealthyRealserverPooled
[16:24:34] <sukhe>	 done
[16:24:40] <sukhe>	 silenced all drmrs
[16:24:40] <jhathaway>	 thanks sukhe 
[16:24:46] <Raine>	 thanks <3
[16:26:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for reaching out !" [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight)
[16:26:26] <sukhe>	 no worries, this is my bad. we should have silenced it.
[16:27:10] <logmsgbot>	 !log bking@deploy2002 helmfile [default] START helmfile.d/dse-k8s-services/opensearch-test: apply
[16:27:11] <logmsgbot>	 !log bking@deploy2002 helmfile [default] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[16:27:20] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[16:27:30] <icinga-wm>	 RECOVERY - Host ncredir6001 is UP: PING OK - Packet loss = 0%, RTA = 87.71 ms
[16:27:32] <icinga-wm>	 RECOVERY - Host durum6001 is UP: PING OK - Packet loss = 0%, RTA = 88.79 ms
[16:27:34] <icinga-wm>	 RECOVERY - Host doh6001 is UP: PING OK - Packet loss = 0%, RTA = 87.56 ms
[16:27:35] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[16:27:42] <icinga-wm>	 RECOVERY - Host tcp-proxy6001 is UP: PING OK - Packet loss = 0%, RTA = 87.48 ms
[16:27:46] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply
[16:27:46] <icinga-wm>	 RECOVERY - Host install6003 is UP: PING OK - Packet loss = 0%, RTA = 87.64 ms
[16:27:58] <icinga-wm>	 RECOVERY - Host bast6003 is UP: PING OK - Packet loss = 0%, RTA = 87.48 ms
[16:28:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] install_server: workaround for mpt3sas large optimal_io_size [puppet] - 10https://gerrit.wikimedia.org/r/1207150 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi)
[16:28:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1207200
[16:28:51] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply
[16:29:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[16:29:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for kareid [puppet] - 10https://gerrit.wikimedia.org/r/1207200 (owner: 10Muehlenhoff)
[16:29:59] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[16:30:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:31:06] <moritzm>	 godog, jasmine_: okay to puppet-merge your changes along?
[16:31:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Cleanup maintenance_hosts hiera variable use [puppet] - 10https://gerrit.wikimedia.org/r/1206877 (https://phabricator.wikimedia.org/T400442)
[16:31:32] <wikibugs>	 (03CR) 10Jasmine: [C:03+2] Cleanup maintenance_hosts hiera variable use [puppet] - 10https://gerrit.wikimedia.org/r/1206877 (https://phabricator.wikimedia.org/T400442) (owner: 10Alexandros Kosiaris)
[16:31:46] <godog>	 moritzm: yes please
[16:31:55] <moritzm>	 ok, merging
[16:33:22] <moritzm>	 and done
[16:33:41] <godog>	 thank you
[16:34:04] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS bookworm
[16:34:14] <jasmine_>	 moritzm: ty!
[16:35:22] <logmsgbot>	 !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[16:35:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[16:36:51] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[16:36:52] <icinga-wm>	 PROBLEM - Host netflow6001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:36:52] <icinga-wm>	 PROBLEM - Host ncredir6002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:36:58] <icinga-wm>	 PROBLEM - Host doh6002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:36:58] <icinga-wm>	 PROBLEM - Host prometheus6002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:36:58] <icinga-wm>	 PROBLEM - Host durum6002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:37:00] <sukhe>	 well I did silence it sigh
[16:37:12] <sukhe>	 this is probably icinga then hmm
[16:37:14] <icinga-wm>	 PROBLEM - Host tcp-proxy6002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:37:45] <sukhe>	 does anyone recall the silencing in Icinga?
[16:38:15] <vgutierrez>	 sukhe: cookbook?
[16:38:34] <sukhe>	 yeah A:drmrs on hosts.downtime
[16:38:36] <sukhe>	 running
[16:38:56] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:39:14] <logmsgbot>	 !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 39 hosts with reason: site depool
[16:45:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:45:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:45:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[16:46:48] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11388988 (10Papaul)
[16:47:46] <Lucas_WMDE>	 I guess I’ll deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1207171 tomorrow instead, doesn’t sounds like it’s okay to deploy at the moment and I’m about to sign off
[16:48:13] <icinga-wm>	 RECOVERY - Host ncredir6002 is UP: PING OK - Packet loss = 0%, RTA = 87.45 ms
[16:48:21] <icinga-wm>	 RECOVERY - Host netflow6001 is UP: PING OK - Packet loss = 0%, RTA = 87.41 ms
[16:48:29] <icinga-wm>	 RECOVERY - Host doh6002 is UP: PING OK - Packet loss = 0%, RTA = 87.36 ms
[16:48:29] <icinga-wm>	 RECOVERY - Host prometheus6002 is UP: PING OK - Packet loss = 0%, RTA = 87.42 ms
[16:48:29] <icinga-wm>	 RECOVERY - Host durum6002 is UP: PING OK - Packet loss = 0%, RTA = 87.45 ms
[16:48:43] <icinga-wm>	 RECOVERY - Host tcp-proxy6002 is UP: PING OK - Packet loss = 0%, RTA = 87.56 ms
[16:51:31] <wikibugs>	 (03PS1) 10DLynch: TextMatchEditCheck: undo duplicate sub-type logging [extensions/VisualEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207201 (https://phabricator.wikimedia.org/T407286)
[16:51:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/VisualEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207201 (https://phabricator.wikimedia.org/T407286) (owner: 10DLynch)
[16:52:54] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage
[16:53:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841#11389010 (10cmooney)
[16:53:50] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcontrol2010-dev: remove pause-reboot [puppet] - 10https://gerrit.wikimedia.org/r/1207202 (https://phabricator.wikimedia.org/T409328)
[16:56:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudcontrol2010-dev: remove pause-reboot [puppet] - 10https://gerrit.wikimedia.org/r/1207202 (https://phabricator.wikimedia.org/T409328) (owner: 10Andrew Bogott)
[16:56:31] <logmsgbot>	 !log filippo@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage
[16:56:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: install_server: restore cloudcontrol2010-dev unattended installation [puppet] - 10https://gerrit.wikimedia.org/r/1207203 (https://phabricator.wikimedia.org/T407586)
[16:57:09] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1008-dev.eqiad.wmnet with reason: host reimage
[16:58:05] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: install_server: restore cloudcontrol2010-dev unattended installation [puppet] - 10https://gerrit.wikimedia.org/r/1207203 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi)
[16:58:37] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[16:58:50] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[16:59:06] <sukhe>	 Raine: want to repool drmrs in case you haven't done it before? it's good practise :)
[16:59:19] <Raine>	 sukhe: sure :D one sec
[16:59:33] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[16:59:38] <sukhe>	 https://wikitech.wikimedia.org/wiki/DNS#Change_GeoDNS_/_Depool_a_Site
[16:59:43] <sukhe>	 sudo cookbook sre.dns.admin pool drmrs
[16:59:47] <sukhe>	 follow prompt and that's it
[17:00:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.remove-downtime for cr[1-2]-drmrs IPv6,cr[1-2]-drmrs.mgmt
[17:00:05] <Raine>	 oh, when I was young, it was a puppet patch :D
[17:00:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr[1-2]-drmrs IPv6,cr[1-2]-drmrs.mgmt
[17:00:13] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[17:00:52] <logmsgbot>	 !log kamila@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site drmrs [reason: no reason specified, ]
[17:00:55] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: netbox_ganeti_drmrs01_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:00:58] <logmsgbot>	 !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage
[17:01:03] <logmsgbot>	 !log kamila@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site drmrs [reason: no reason specified, ]
[17:01:58] <Raine>	 sukhe: done
[17:02:19] <Raine>	 this is much better than a puppet patch \o/
[17:02:21] <sukhe>	 Raine: nice thanks!
[17:02:27] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:02:35] <sukhe>	 yeah, we are encouraging everyone to run this when not in an emergency and hence the ask
[17:02:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[17:02:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[17:02:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:02:45] <Raine>	 excellent
[17:02:56] <Raine>	 ^ less excellent
[17:03:00] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.remove-downtime for 39 hosts
[17:03:22] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 39 hosts
[17:03:24] <jinxer-wm>	 RESOLVED: [27x] JobUnavailable: Reduced availability for job benthos in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:03:24] <jinxer-wm>	 RESOLVED: [13x] ProbeDown: Service ganeti6001:1811 has failed probes (tcp_ganeti_noded_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:03:35] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on pc1014.eqiad.wmnet with reason: C/D Migration
[17:03:54] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11389033 (10Marostegui) >>! In T405942#11388802, @Ladsgroup wrote: > Please ping me before moving of pc1014 so I depool pc4 cluster from rotation.  pc4 was...
[17:04:28] <claime>	 Raine: Just do a roll restart of mobileapps
[17:04:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:04:56] <sukhe>	 ok, let's look
[17:05:10] <logmsgbot>	 !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[17:05:40] <sukhe>	 it's just one node, wonder why it says widespread
[17:06:14] <Raine>	 claime: re mobileapps, will do, but any idea why it happened?
[17:06:29] <hnowlan>	 mobileapps has been flapping for close to a week 
[17:06:33] <hnowlan>	 https://phabricator.wikimedia.org/T410296
[17:06:51] <Raine>	 oh, okay
[17:07:01] <Raine>	 thanks for the context hnowlan 
[17:07:02] <claime>	 sukhe: how many hosts are there in drmrs though?
[17:07:18] <claime>	 Graph says 12.5% failed
[17:07:34] <sukhe>	 claime: 39, only one was failing when I looked at least, but maybe there were more before?
[17:07:38] <sukhe>	 ah ok, which graph is that?
[17:07:50] <claime>	 Widespread puppet failure is >3%
[17:07:57] <claime>	 sukhe: https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6
[17:08:02] <claime>	 The one linked in the alert
[17:08:13] <sukhe>	 ha ok, right, I thought there was someting in puppetboard too and I never knew
[17:08:15] <logmsgbot>	 !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Testing all optimize (T410401)', diff saved to https://phabricator.wikimedia.org/P85394 and previous config saved to /var/cache/conftool/dbconfig/20251119-170814-ladsgroup.json
[17:08:19] <stashbot>	 T410401: Optimize all the things (=MySQL tables) - https://phabricator.wikimedia.org/T410401
[17:08:34] <sukhe>	 puppetboard was showing just one
[17:08:36] <mutante>	 it doesnt take that many new hosts to fail to make it "widespread" because the baseline is already close to the threshold
[17:08:40] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[17:08:46] <sukhe>	 but I am guessing it was transient
[17:08:55] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[17:08:55] <claime>	 yeah, 1 host is already 2.5% basically
[17:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:09:11] <mutante>	 so widespread means "1" :)
[17:09:13] <sukhe>	 anyway should clear up now
[17:09:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:09:54] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[17:10:02] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[17:10:12] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[17:10:19] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[17:12:21] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[17:12:29] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[17:13:21] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync
[17:14:08] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on dbstore1007.eqiad.wmnet with reason: C/D Migration
[17:14:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz)
[17:14:44] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync
[17:16:23] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Repool pc4 T405942', diff saved to https://phabricator.wikimedia.org/P85395 and previous config saved to /var/cache/conftool/dbconfig/20251119-171622-marostegui.json
[17:16:28] <stashbot>	 T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942
[17:16:44] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1008-dev.eqiad.wmnet with OS bookworm
[17:17:28] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11389068 (10Marostegui) Repooled pc4 as Rob confirmed pc1014 has been moved.
[17:17:44] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[17:17:44] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[17:17:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[17:21:19] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host releases1003.eqiad.wmnet with OS bookworm
[17:22:20] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on moss-be1002.eqiad.wmnet with reason: C/D Migration
[17:22:39] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11389081 (10Papaul) I think a am wrong on the public vlan for rack 22. We will not be re-imaging the servers in that rack with public vlan just changing the ne...
[17:24:07] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11389084 (10Papaul) @ayounsi for the feed back i will work on it
[17:30:54] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "The PCC results LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight)
[17:32:58] <robh>	 !log wikikube c6 hosts depooling for migration
[17:33:00] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1260-1269].eqiad.wmnet
[17:33:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389152 (10RobH) Depooling wikikube in rack C6:    sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad depool wikikube-worker1...
[17:37:36] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11389156 (10Jhancock.wm) @Marostegui i rotated DIMM_A6 with DIMM_A10 to see if the error follows the stick. unfortunately, we do have to wait for it to happen again to diagnose it. Since the cpu error...
[17:38:24] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on releases1003.eqiad.wmnet with reason: host reimage
[17:38:46] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1260-1269].eqiad.wmnet
[17:38:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389160 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1260-1269].eqiad.wmnet completed: - wikikub...
[17:39:17] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet
[17:42:17] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: relforge: Clarify comment about cumin masters role [puppet] - 10https://gerrit.wikimedia.org/r/1207212
[17:42:50] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet
[17:42:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389174 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet...
[17:43:42] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases1003.eqiad.wmnet with reason: host reimage
[17:43:54] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1260.eqiad.wmnet with reason: C/D Migration
[17:46:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11389179 (10Jclark-ctr) 05Open→03Resolved a:05BTullis→03Jclark-ctr no additional errors  I will close ticket and figure out...
[17:48:02] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1261.eqiad.wmnet with reason: C/D Migration
[17:48:31] <wikibugs>	 10SRE-swift-storage, 06Commons, 10media-backups: File not found: /v1/AUTH_mw/wikipedia-commons-local-public ... for 3 files - https://phabricator.wikimedia.org/T400567#11389187 (10Bugreporter) >>! In T400567#11039161, @jcrespo wrote: >>>! In T400567#11038949, @GPSLeo wrote: >> As there are likely many more o...
[17:50:46] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1262.eqiad.wmnet with reason: C/D Migration
[17:53:14] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1263.eqiad.wmnet with reason: C/D Migration
[17:53:31] <wikibugs>	 (03CR) 10Aaron Schulz: rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[17:53:32] <wikibugs>	 (03CR) 10CDobbins: [C:03+2] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[17:55:25] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1264.eqiad.wmnet with reason: C/D Migration
[17:56:57] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1265.eqiad.wmnet with reason: C/D Migration
[17:58:26] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1266.eqiad.wmnet with reason: C/D Migration
[17:59:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11389226 (10ayounsi) Thanks, looks like I missed it in my first look but it seems doable through Redfish on Dell : ` >>> dump.set('NIC.Integrated.1-2-1', 'Broadcom_LLDPNearestBridge...
[18:00:05] <jouncebot>	 swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1800).
[18:00:36] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1267.eqiad.wmnet with reason: C/D Migration
[18:01:50] <wikibugs>	 10ops-eqiad, 06DC-Ops: eno8303 on db1219:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410536 (10phaultfinder) 03NEW
[18:01:51] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1051.eqiad.wmnet with reason: C/D Migration
[18:02:28] <swfrench-wmf>	 o/
[18:03:45] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[18:04:01] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1052.eqiad.wmnet with reason: C/D Migration
[18:04:08] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537 (10RLazarus) 03NEW p:05Triage→03Medium
[18:05:41] <wikibugs>	 (03Merged) 10jenkins-bot: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins)
[18:05:52] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1053.eqiad.wmnet with reason: C/D Migration
[18:07:35] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1054.eqiad.wmnet with reason: C/D Migration
[18:08:31] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11389289 (10RLazarus) (I'm not married to the specific CLI syntax in the example. Among other things, making it an --optional-flag means that the positional `host...
[18:09:18] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1055.eqiad.wmnet with reason: C/D Migration
[18:09:31] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host releases1003.eqiad.wmnet with OS bookworm
[18:10:49] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1083.eqiad.wmnet with reason: C/D Migration
[18:11:50] <wikibugs>	 10ops-eqiad, 06DC-Ops: eno8303 on db1220:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410539 (10phaultfinder) 03NEW
[18:12:01] <logmsgbot>	 !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs7003*} and A:liberica
[18:12:36] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1268.eqiad.wmnet with reason: C/D Migration
[18:12:54] <icinga-wm>	 PROBLEM - Host db1219 #page is DOWN: PING CRITICAL - Packet loss = 100%
[18:13:57] <swfrench-wmf>	 db1219 is in C6 - robh: are you migrating that today?
[18:14:40] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1269.eqiad.wmnet with reason: C/D Migration
[18:15:05] <icinga-wm>	 RECOVERY - Host db1219 #page is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[18:15:31] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s1 #page on db1219 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db1163.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db1163.eqiad.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:15:50] <Raine>	 uh oh
[18:16:26] <swfrench-wmf>	 possible inadvertent cable bump?
[18:16:31] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s1 #page on db1219 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:16:32] <Raine>	 hope so :D
[18:16:56] <jhathaway>	 Or in the same rack? 
[18:17:02] <swfrench-wmf>	 yeah, it's on C6
[18:17:05] <Raine>	 yeah
[18:18:09] * swfrench-wmf is going to defer any deployments planned for this infra window
[18:21:19] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1036.eqiad.wmnet with reason: C/D Migration
[18:21:25] <brett>	 !log import prometheus-rdkafka-exporter 0.4~deb13u1 into trixie-wikimedia - T401832
[18:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:30] <stashbot>	 T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832
[18:23:41] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1260-1269].eqiad.wmnet
[18:23:44] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet
[18:23:50] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1260-1269].eqiad.wmnet
[18:23:51] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet
[18:23:55] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage
[18:24:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389329 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1260-1269].eqiad.wmnet completed: - wikikube-...
[18:24:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389330 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet co...
[18:27:11] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage
[18:28:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389339 (10RobH) Ran:  sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad pool wikikube-worker126[0-9].eqiad.wmnet sudo cookb...
[18:32:35] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2144 (ms2) memory error - https://phabricator.wikimedia.org/T410480#11389364 (10Marostegui) Thanks - I'll repool the host tomorrow!
[18:35:28] <wikibugs>	 (03CR) 10BCornwall: "thetimespedia.in is meant to be redirected to the diff post per legal." [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor)
[18:37:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:37:42] <wikibugs>	 (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor)
[18:38:04] <wikibugs>	 (03PS3) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor)
[18:38:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqiad and 208.80.154.209 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:38:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1220:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410539#11389429 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr during maintenance  of nokia refresh in C6 today this server went down to 100mbps  Replaced faulty optic returned...
[18:39:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389435 (10RobH) Going to depool wikikube in rack eqiad D1 for port migrations.  sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube...
[18:40:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11389442 (10cmooney) 05Open→03Resolved a:03cmooney Ok this is now done across the whole estate, eqiad and...
[18:40:55] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet
[18:42:05] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11389451 (10Ladsgroup)
[18:43:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 208.80.154.209 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[18:43:14] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet
[18:43:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389453 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet completed:...
[18:44:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1219:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410536#11389469 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr during maintenance of nokia refresh in C6 today this server went down to 100mbps   Speed did return to normal shor...
[18:45:46] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1140.eqiad.wmnet with reason: C/D Migration
[18:46:39] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1270-1275].eqiad.wmnet
[18:48:26] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11389475 (10BCornwall) To add on, what about the maintenance of package.json and the dependencies that it pulls in?
[18:49:25] <logmsgbot>	 !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs7003*} and A:liberica
[18:49:51] <brett>	 !log import purged 0.24+deb13u1 into trixie-wikimedia - T401832
[18:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:55] <stashbot>	 T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832
[18:50:07] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1270-1275].eqiad.wmnet
[18:50:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389493 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1270-1275].eqiad.wmnet completed: - wikikub...
[18:50:56] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1204092 (owner: 10Ncmonitor)
[18:51:13] <logmsgbot>	 !log brett@dns1006 START - running authdns-update
[18:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:52:13] <logmsgbot>	 !log brett@dns1006 END - running authdns-update
[18:52:22] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11389503 (10Ladsgroup)
[18:57:44] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[18:58:12] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1141.eqiad.wmnet with reason: C/D Migration
[19:00:05] <jouncebot>	 brennen and andre: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T1900).
[19:00:38] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1270.eqiad.wmnet with reason: C/D Migration
[19:01:06] <brennen>	 o/
[19:03:00] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1271.eqiad.wmnet with reason: C/D Migration
[19:03:32] <logmsgbot>	 andrew@cumin2002 reimage (PID 563205) is awaiting input
[19:03:48] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[19:03:52] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11389563 (10Ladsgroup)
[19:04:34] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1272.eqiad.wmnet with reason: C/D Migration
[19:05:03] <wikibugs>	 (03CR) 10Pppery: [C:03+1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor)
[19:07:17] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1273.eqiad.wmnet with reason: C/D Migration
[19:10:32] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1274.eqiad.wmnet with reason: C/D Migration
[19:11:51] <wikibugs>	 (03PS2) 10Bking: opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012)
[19:13:09] <brennen>	 !log 1.46.0-wmf.3 train status (T408273): no current blockers, logs clean, rolling to group1
[19:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:14] <stashbot>	 T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273
[19:13:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012) (owner: 10Bking)
[19:13:58] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1275.eqiad.wmnet with reason: C/D Migration
[19:14:57] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207235 (https://phabricator.wikimedia.org/T408273)
[19:14:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207235 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot)
[19:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207235 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot)
[19:16:19] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1160.eqiad.wmnet with reason: C/D Migration
[19:18:34] <logmsgbot>	 !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1161.eqiad.wmnet with reason: C/D Migration
[19:21:00] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet
[19:21:06] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet
[19:21:09] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1270-1275].eqiad.wmnet
[19:21:16] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1270-1275].eqiad.wmnet
[19:21:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389632 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet completed: -...
[19:21:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389633 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1270-1275].eqiad.wmnet completed: - wikikube-...
[19:22:14] <wikibugs>	 (03CR) 10Superpes15: [C:03+1] tcywikisource: Migrate $wgAccountCreationThrottle to throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207171 (https://phabricator.wikimedia.org/T410507) (owner: 10Lucas Werkmeister (WMDE))
[19:23:53] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage
[19:23:59] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.46.0-wmf.3  refs T408273
[19:24:04] <stashbot>	 T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273
[19:27:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11389651 (10RobH) Day 7 Update: * 33 hosts moved today, 44 remain * all row c wikikube migrated, some of row D wikikube migrated ** 23 wikikube hosts remain o...
[19:27:49] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage
[19:31:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11389672 (10RobH) a:05RobH→03brouberol @brouberol, you were tagged into this task by  T405950#11236474 but I don't have any feedback on the migration details for kafka-main1...
[19:31:46] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204094 (owner: 10Ncmonitor)
[19:31:57] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1204093 (owner: 10Ncmonitor)
[19:34:57] <jinxer-wm>	 FIRING: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[19:37:35] <wikibugs>	 (03PS1) 10Novem Linguae: README: remove outdated advice about dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241
[19:39:49] <wikibugs>	 (03CR) 10Novem Linguae: "In response to code review comments at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1206851/1/README#10" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae)
[19:39:57] <jinxer-wm>	 RESOLVED: CalicoHighMemoryUsage: Calico container calico-node-wb54d:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s-staging&var-container_name=calico-node - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage
[19:40:30] <wikibugs>	 (03CR) 10Novem Linguae: undeploy Extension:Capiunto (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206851 (https://phabricator.wikimedia.org/T410172) (owner: 10Novem Linguae)
[19:41:58] <wikibugs>	 (03PS1) 10Dzahn: admin: remove bvibber from releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/1207243
[19:43:20] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9525 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[19:46:56] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1070.eqiad.wmnet with OS trixie
[19:47:26] <wikibugs>	 (03PS3) 10Bking: opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012)
[19:49:26] <wikibugs>	 (03PS1) 10Aude: Remove action_context from page_load events in ReadingList A/B test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207245 (https://phabricator.wikimedia.org/T410535)
[19:49:50] <wikibugs>	 (03PS1) 10Aude: Remove action_context from page_load events in ReadingList A/B test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207246 (https://phabricator.wikimedia.org/T410535)
[19:49:54] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[19:50:08] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[19:50:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207246 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude)
[19:50:29] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[19:50:36] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[19:50:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207245 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude)
[19:52:10] <logmsgbot>	 !log denisse@deploy2002 Started deploy [librenms/librenms@d152b36]: Upgrade LibreNMS to 25.11.0 - T410519
[19:52:26] <logmsgbot>	 !log denisse@deploy2002 Finished deploy [librenms/librenms@d152b36]: Upgrade LibreNMS to 25.11.0 - T410519 (duration: 00m 16s)
[20:01:58] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1070.eqiad.wmnet with reason: host reimage
[20:07:10] <logmsgbot>	 !log ammarpad@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki Manueldinardo08 'Renamed user 7fd4cfd08628d295620b39574c59750f'  # T410545
[20:07:14] <stashbot>	 T410545: Unblock stuck global rename of Renamed user 7fd4cfd08628d295620b39574c59750f - https://phabricator.wikimedia.org/T410545
[20:07:35] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie
[20:07:54] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1070.eqiad.wmnet with reason: host reimage
[20:09:55] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "LGTM other than inline nits/questions" [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[20:12:51] <wikibugs>	 (03PS4) 10Ssingh: O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780)
[20:13:14] <wikibugs>	 (03CR) 10Ssingh: O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[20:16:06] <wikibugs>	 (03CR) 10Bvibber: [C:03+1] "Can confirm I do not need to be in this group at this time. :)" [puppet] - 10https://gerrit.wikimedia.org/r/1207243 (owner: 10Dzahn)
[20:16:16] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] site.pp: reimage hcaptcha-proxy1001 to proper role [puppet] - 10https://gerrit.wikimedia.org/r/1207165 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[20:16:53] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] P:bird::anycast_monitoring: add hcaptcha-proxy.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1204074 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[20:17:42] <wikibugs>	 (03CR) 10Dzahn: gerrit: add dry run rsync (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1195437 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[20:18:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: remove bvibber from releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/1207243 (owner: 10Dzahn)
[20:19:24] <wikibugs>	 (03PS1) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250
[20:19:57] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková)
[20:33:37] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1070.eqiad.wmnet with OS trixie
[20:38:57] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T2100)
[21:00:05] <jouncebot>	 kostajh, kemayo, AaronSchulz, and aude: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:21] <kostajh>	 hi
[21:00:33] <aude>	 i'm here but can wait my turn
[21:01:07] <kostajh>	 I'll start with mine, then
[21:01:14] <aude>	 ok
[21:01:22] <Kemayo>	 Mine can be bundled in with anyone else's if you want.
[21:02:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207108 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[21:02:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[21:02:47] <AaronSchulz>	 mine can be bundled as well
[21:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Record A/B test experiment group [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207108 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[21:04:06] <aude>	 mine can be too.  but idk how that works exactly
[21:06:08] <wikibugs>	 (03PS3) 10Majavah: Initial configuration for tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205954 (https://phabricator.wikimedia.org/T404457)
[21:06:08] <wikibugs>	 (03PS3) 10Majavah: Activate tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205955 (https://phabricator.wikimedia.org/T404457)
[21:06:08] <wikibugs>	 (03PS3) 10Majavah: Set up tokwiki namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457)
[21:06:08] <wikibugs>	 (03PS1) 10Majavah: Allow account creation on tokwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207262 (https://phabricator.wikimedia.org/T404457)
[21:06:58] <wikibugs>	 (03CR) 10Majavah: Set up tokwiki namespaces (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205956 (https://phabricator.wikimedia.org/T404457) (owner: 10Majavah)
[21:09:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for catrope - https://phabricator.wikimedia.org/T410473#11389974 (10Catrope) 05Open→03Resolved a:03Volans Everything works great, thanks!
[21:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:10:40] <wikibugs>	 (03PS2) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250
[21:10:42] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková)
[21:12:30] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Record A/B test experiment group [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[21:13:03] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1207108|hCaptcha: Record A/B test experiment group (T410354)]], [[gerrit:1206960|hCaptcha: Record A/B test experiment group (T410354)]]
[21:13:08] <stashbot>	 T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354
[21:15:26] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:16:59] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:17:48] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1207108|hCaptcha: Record A/B test experiment group (T410354)]], [[gerrit:1206960|hCaptcha: Record A/B test experiment group (T410354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:20:10] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[21:21:04] <wikibugs>	 (03CR) 10Kamila Součková: "I'll remove these from labs and puppet-private hiera too." [puppet] - 10https://gerrit.wikimedia.org/r/1207250 (owner: 10Kamila Součková)
[21:21:44] <wikibugs>	 (03PS3) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250
[21:22:08] <wikibugs>	 (03PS4) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/1207250
[21:24:19] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207108|hCaptcha: Record A/B test experiment group (T410354)]], [[gerrit:1206960|hCaptcha: Record A/B test experiment group (T410354)]] (duration: 11m 16s)
[21:24:24] <stashbot>	 T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354
[21:24:32] <kostajh>	 Syncing a security patch, then my config patch
[21:25:09] <wikibugs>	 (03PS1) 10Kamila Součková: hcaptcha_proxy: remove unused parameters [labs/private] - 10https://gerrit.wikimedia.org/r/1207265
[21:25:26] <jinxer-wm>	 RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:26:59] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:28:40] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] O:haptcha::proxy: add new role for hCaptcha proxy VMs (bird/anycast) [puppet] - 10https://gerrit.wikimedia.org/r/1204073 (https://phabricator.wikimedia.org/T409780) (owner: 10Ssingh)
[21:28:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[21:30:18] <wikibugs>	 (03PS1) 10Aaron Schulz: rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807)
[21:31:05] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:31:36] <kostajh>	 syncing PrivateSettings.php now
[21:31:59] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:34:01] <wikibugs>	 (03CR) 10Aaron Schulz: "Alternatively, I made https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1207267 to do more of this on the gateway level. Tha" [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[21:36:05] <jinxer-wm>	 FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:39:10] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] README: remove outdated advice about dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae)
[21:40:52] <wikibugs>	 (03CR) 10Novem Linguae: "Can we +2 this and have it ride the train? Or does it need a backport?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae)
[21:40:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206830 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[21:41:05] <jinxer-wm>	 FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:41:44] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Enable A/B edit test on zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206830 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan)
[21:41:56] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "No, this is the production config repo, all merges must be immediately deployed. But it's not urgent to fix docs that should have been cor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae)
[21:42:16] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1206830|hCaptcha: Enable A/B edit test on zhwiki and jawiki (T410354)]]
[21:42:20] <stashbot>	 T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354
[21:46:05] <jinxer-wm>	 RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:46:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae)
[21:47:01] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1206830|hCaptcha: Enable A/B edit test on zhwiki and jawiki (T410354)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:47:56] <NovemLinguae>	 I just added a README file change to the backport window if that's easy to squeeze in. If not don't worry about it. https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=2363265
[21:49:11] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[21:49:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T410563, transfer main graph to lagged host) xfer wikidata_main from wdqs1015.eqiad.wmnet -> wdqs1011.eqiad.wmnet, repooling both afterwards
[21:49:53] <stashbot>	 T410563: ProbeDown - https://phabricator.wikimedia.org/T410563
[21:51:47] <wikibugs>	 (03PS1) 10Scott French: mobileapps: revert to 2025-10-13-122439-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296)
[21:52:34] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11390152 (10ATitkov) > Who will be responsible for security review, when this is sharing important top level domains ?  @TheDJ Could it be possibly handled or at l...
[21:53:11] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206830|hCaptcha: Enable A/B edit test on zhwiki and jawiki (T410354)]] (duration: 10m 55s)
[21:53:16] <stashbot>	 T410354: hCaptcha: Enable A/B test for jawiki and zhwiki - https://phabricator.wikimedia.org/T410354
[21:53:27] <kostajh>	 I'm done 
[21:53:33] <kostajh>	 Kemayo aude over to you
[21:53:37] <kostajh>	 sorry that took so long!
[21:53:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207201 (https://phabricator.wikimedia.org/T407286) (owner: 10DLynch)
[21:53:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207245 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude)
[21:53:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207246 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude)
[21:53:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae)
[21:54:33] <aude>	 thanks Kemayo!
[21:54:40] <wikibugs>	 (03Merged) 10jenkins-bot: README: remove outdated advice about dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207241 (owner: 10Novem Linguae)
[21:54:53] <Kemayo>	 kostajh: No worries. My only complaint is that there's not some vague "we can't tell you anything but here's a progress bar" for the waiting-for-a-security-patch part of it. :D
[21:55:25] <kostajh>	 yeah, the process is far from ideal
[21:56:12] <Kemayo>	 Novem's patch lacks anything to test. aude, will you need to check anything on the testservers, or should I go ahead when it's ready?
[21:56:37] <aude>	 i can quickly spot check on wmf.3
[21:56:54] * AaronSchulz still has https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1203191
[21:57:40] <Kemayo>	 AaronSchulz: oops, sorry, I didn't realize you were here or I'd have offered to throw that in to this bundle as well.
[21:58:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[21:59:02] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[21:59:06] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
[21:59:22] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply
[21:59:40] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply
[22:00:04] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 06serviceops: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11390175 (10Volans) Just for context referencing past ideas on the topic: T327300
[22:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T2200)
[22:00:32] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply
[22:00:38] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply
[22:00:59] <wikibugs>	 (03CR) 10Scott French: "I'm happy to give this a try today or tomorrow, or please feel free to go ahead and merge / deploy at your convenience in the interim. Tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207271 (https://phabricator.wikimedia.org/T410296) (owner: 10Scott French)
[22:01:58] <wikibugs>	 (03PS4) 10Bking: opensearch-cluster: Add cluster ro perms to 'opensearch' user, increase default num of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207194 (https://phabricator.wikimedia.org/T408012)
[22:02:35] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply
[22:02:44] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply
[22:03:44] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[22:03:51] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[22:05:13] <wikibugs>	 (03Merged) 10jenkins-bot: TextMatchEditCheck: undo duplicate sub-type logging [extensions/VisualEditor] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207201 (https://phabricator.wikimedia.org/T407286) (owner: 10DLynch)
[22:05:14] <wikibugs>	 (03Merged) 10jenkins-bot: Remove action_context from page_load events in ReadingList A/B test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1207245 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude)
[22:05:17] <wikibugs>	 (03Merged) 10jenkins-bot: Remove action_context from page_load events in ReadingList A/B test [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1207246 (https://phabricator.wikimedia.org/T410535) (owner: 10Aude)
[22:05:55] <logmsgbot>	 !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1207201|TextMatchEditCheck: undo duplicate sub-type logging (T407286)]], [[gerrit:1207245|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207246|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207241|README: remove outdated advice about dblists]]
[22:06:01] <stashbot>	 T407286: Log sub-types of textmatch checks to VEFU - https://phabricator.wikimedia.org/T407286
[22:06:01] <stashbot>	 T410535: Remove action_context from ReadingLists AB test page_load event - https://phabricator.wikimedia.org/T410535
[22:07:55] <wikibugs>	 (03PS1) 10Bvibber: Fix wgMediaViewerThumbnailBucketSizes to match wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207273 (https://phabricator.wikimedia.org/T372165)
[22:08:56] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply
[22:09:00] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply
[22:10:56] <logmsgbot>	 !log kemayo@deploy2002 aude, kemayo, novemlinguae: Backport for [[gerrit:1207201|TextMatchEditCheck: undo duplicate sub-type logging (T407286)]], [[gerrit:1207245|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207246|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207241|README: remove outdated advice about dblists]] synced to the tests
[22:10:57] <logmsgbot>	 ervers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:11:02] <stashbot>	 T407286: Log sub-types of textmatch checks to VEFU - https://phabricator.wikimedia.org/T407286
[22:11:03] <stashbot>	 T410535: Remove action_context from ReadingLists AB test page_load event - https://phabricator.wikimedia.org/T410535
[22:11:04] <aude>	 checking
[22:12:36] <aude>	 looks good
[22:12:48] <Kemayo>	 Excellent, continuing the sync.
[22:12:51] <logmsgbot>	 !log kemayo@deploy2002 aude, kemayo, novemlinguae: Continuing with sync
[22:12:53] <aude>	 thank you!
[22:14:52] <wikibugs>	 06SRE, 10Phabricator: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent - https://phabricator.wikimedia.org/T410572 (10Aklapper) 03NEW p:05Triage→03Low
[22:15:35] <wikibugs>	 06SRE, 10Phabricator: Replace deprecated Phabricator Conduit API call by @ProdPasteBot with its stable equivalent - https://phabricator.wikimedia.org/T410572#11390254 (10Aklapper)
[22:16:13] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860
[22:16:13] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860
[22:16:17] <stashbot>	 T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860
[22:16:52] <logmsgbot>	 !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1207201|TextMatchEditCheck: undo duplicate sub-type logging (T407286)]], [[gerrit:1207245|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207246|Remove action_context from page_load events in ReadingList A/B test (T410535)]], [[gerrit:1207241|README: remove outdated advice about dblists]] (duration: 10m 57s)
[22:16:58] <stashbot>	 T407286: Log sub-types of textmatch checks to VEFU - https://phabricator.wikimedia.org/T407286
[22:16:58] <stashbot>	 T410535: Remove action_context from ReadingLists AB test page_load event - https://phabricator.wikimedia.org/T410535
[22:22:21] <wikibugs>	 (03PS1) 10Arlolra: Deploy Parsoid Read Views to 18 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564)
[22:22:47] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reboot (apply updates) - ryankemper@cumin2002 - T390860
[22:22:50] <AaronSchulz>	 Kemayo: done?
[22:22:52] <stashbot>	 T390860: Elasticsearch dependency upgrade in spicerack - https://phabricator.wikimedia.org/T390860
[22:25:29] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06serviceops, and 5 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573 (10bking) 03NEW
[22:25:33] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to 18 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207276 (https://phabricator.wikimedia.org/T410564) (owner: 10Arlolra)
[22:28:53] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.reboot
[22:29:54] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Infrastructure Security, 06serviceops, and 5 others: October 2025 Bullseye reboots: Search Platform-owned hosts - https://phabricator.wikimedia.org/T410573#11390286 (10bking)
[22:32:00] <Kemayo>	 AaronSchulz: yes, done.
[22:32:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz)
[22:32:40] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11390294 (10RobH) Migration Update: Only 3 #data-persistence hosts remain for migration: pc101[678].   Chatted with @marosgui earlier in IRC and he'll be o...
[22:33:55] <wikibugs>	 (03Merged) 10jenkins-bot: Sandbox cleanup for the Wikimedia REST APIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz)
[22:34:26] <logmsgbot>	 !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1203191|Sandbox cleanup for the Wikimedia REST APIs (T409776 T402426)]]
[22:34:31] <stashbot>	 T409776: Rename & clean up Wikimedia RESTBase APIs - https://phabricator.wikimedia.org/T409776
[22:34:32] <stashbot>	 T402426: OpenAPI description for Wikimedia REST API links to the wrong on-wiki documentation - https://phabricator.wikimedia.org/T402426
[22:37:03] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[22:37:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:39:27] <logmsgbot>	 !log aaron@deploy2002 aaron: Backport for [[gerrit:1203191|Sandbox cleanup for the Wikimedia REST APIs (T409776 T402426)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:39:33] <stashbot>	 T409776: Rename & clean up Wikimedia RESTBase APIs - https://phabricator.wikimedia.org/T409776
[22:39:33] <stashbot>	 T402426: OpenAPI description for Wikimedia REST API links to the wrong on-wiki documentation - https://phabricator.wikimedia.org/T402426
[22:39:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:42:03] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[22:43:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and Hurricane Electric (2001:7f8:54:5::13) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[22:43:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T410563, transfer main graph to lagged host) xfer wikidata_main from wdqs1015.eqiad.wmnet -> wdqs1011.eqiad.wmnet, repooling both afterwards
[22:43:44] <stashbot>	 T410563: ProbeDown - https://phabricator.wikimedia.org/T410563
[22:44:05] <logmsgbot>	 !log aaron@deploy2002 aaron: Continuing with sync
[22:48:09] <logmsgbot>	 !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203191|Sandbox cleanup for the Wikimedia REST APIs (T409776 T402426)]] (duration: 13m 43s)
[22:48:15] <stashbot>	 T409776: Rename & clean up Wikimedia RESTBase APIs - https://phabricator.wikimedia.org/T409776
[22:48:15] <stashbot>	 T402426: OpenAPI description for Wikimedia REST API links to the wrong on-wiki documentation - https://phabricator.wikimedia.org/T402426
[22:49:10] * AaronSchulz is done
[22:49:20] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[22:49:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11390359 (10RobH) @BTullis,  We're now down to 44 hosts overall to migrate, and 12 of those belong to your team.    Please...
[22:51:15] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:55:04] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: reboot should check uptime not jvm start time [cookbooks] - 10https://gerrit.wikimedia.org/r/1207280 (https://phabricator.wikimedia.org/T410577)
[22:57:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: eqiad row C/D Observability host migrations - https://phabricator.wikimedia.org/T405946#11390399 (10RobH) p:05Triage→03High @herron,  We've migrated 9 of the 10 #observability hosts.  We're now only left with alert1002 which the notes detail will require s...
[22:59:50] <wikibugs>	 (03CR) 10Ryan Kemper: elastic: reboot should check uptime not jvm start time (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1207280 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper)
[22:59:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251119T2300)
[23:00:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11390409 (10RobH) >>! In T405950#11238805, @Scott_French wrote: > conf1009 is (1) a member of eqiad main-etcd cluster, so clients will attempt to issue writes to it, (2) the ups...
[23:01:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: reboot should check uptime not jvm start time [cookbooks] - 10https://gerrit.wikimedia.org/r/1207280 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper)
[23:06:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11390432 (10RobH) Please note we didn't get to these two today, will do tomorrow!
[23:08:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Search Platform host migrations - https://phabricator.wikimedia.org/T405948#11390447 (10RobH) 05Open→03Resolved Please note all hosts listed on this task have been migrated.
[23:23:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and Hurricane Electric (2001:7f8:54:5::13) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[23:48:16] <wikibugs>	 (03PS1) 10Dzahn: switch wikipedia25.org from ncredir-lb to dyna [dns] - 10https://gerrit.wikimedia.org/r/1207288
[23:48:29] <wikibugs>	 (03PS1) 10RLazarus: kubernetes: Set default Envoy version to 1.32.12 [puppet] - 10https://gerrit.wikimedia.org/r/1207289 (https://phabricator.wikimedia.org/T405808)
[23:57:34] <wikibugs>	 (03PS4) 10RLazarus: mesh.configuration: Envoy config updates for 1.32 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202881 (https://phabricator.wikimedia.org/T409510)