[00:08:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[00:16:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:26:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:26:57] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410238 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[00:27:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410238 (10ops-monitoring-bot) 03NEW
[00:29:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T410234#11377456 (10phaultfinder)
[00:38:03] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206053
[00:38:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206053 (owner: 10TrainBranchBot)
[00:51:59] <jinxer-wm>	 FIRING: ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:53:42] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206053 (owner: 10TrainBranchBot)
[00:56:59] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:00:51] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:01:59] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:08:05] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206054
[01:08:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206054 (owner: 10TrainBranchBot)
[01:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:15:32] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 14m 41s)
[01:30:48] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206054 (owner: 10TrainBranchBot)
[01:36:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:36:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[01:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:39:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:04:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:06:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:06:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11377477 (10Jclark-ctr)
[02:06:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410214#11377484 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:06:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410238#11377483 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:06:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410212#11377485 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:06:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410210#11377486 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:07:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410209#11377487 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:07:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11377489 (10Jclark-ctr)
[02:07:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410193#11377495 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:07:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410238#11377494 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:07:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410110#11377496 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:07:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410194#11377497 (10Jclark-ctr) →14Duplicate dup:03T409938
[02:16:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:21:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:25:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:30:03] <wikibugs>	 (03PS1) 10Scott French: Disable enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204947 (https://phabricator.wikimedia.org/T405955)
[02:30:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:32:00] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[02:32:00] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[02:32:00] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[02:34:56] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11377506 (10phaultfinder)
[02:35:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[02:39:51] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11377507 (10phaultfinder)
[02:46:57] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 2 Failed : virtual_disk: 2 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T410239 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[02:47:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410239 (10ops-monitoring-bot) 03NEW
[02:51:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:00:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[03:03:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[03:29:03] <wikibugs>	 (03PS3) 10Superpes15: [arwikimedia] Change the logo/icon and add a wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205421 (https://phabricator.wikimedia.org/T353218)
[03:33:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[03:36:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[03:46:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[03:51:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:02:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:06:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:07:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:17:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:18:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:23:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:32:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:37:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:39:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:44:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[04:46:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[05:08:23] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:33:23] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:01:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:04:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:27:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Add support for x4 [puppet] - 10https://gerrit.wikimedia.org/r/1205005 (https://phabricator.wikimedia.org/T404715) (owner: 10Marostegui)
[06:27:37] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1024.eqiad.wmnet with reason: Setting up
[06:32:00] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[06:32:00] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[06:32:00] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[06:39:53] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11377572 (10phaultfinder)
[06:44:45] <wikibugs>	 (03PS1) 10Marostegui: db1262: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1206062 (https://phabricator.wikimedia.org/T409374)
[06:44:55] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11377574 (10phaultfinder)
[06:48:33] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1262: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1206062 (https://phabricator.wikimedia.org/T409374) (owner: 10Marostegui)
[06:49:38] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1262 slowly with 10 steps - Repooling after replacing the DIMM
[06:49:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[06:50:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: db1262 is down - https://phabricator.wikimedia.org/T409374#11377577 (10Marostegui) Host being repooled
[06:51:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[06:52:37] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1262 slowly with 10 steps - Repooling after replacing the DIMM
[06:52:47] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1262 slowly with 10 steps - Repooling after replacing the DIMM
[06:53:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: db1262 is down - https://phabricator.wikimedia.org/T409374#11377578 (10ops-monitoring-bot) Start pool of db1262 slowly with 10 steps - Repooling after replacing the DIMM - marostegui@cumin1003
[06:53:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:05:38] <kostajh>	 jouncebot: nowandnext
[07:05:38] <jouncebot>	 For the next 0 hour(s) and 54 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251116T0800)
[07:05:38] <jouncebot>	 In 0 hour(s) and 54 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T0800)
[07:07:48] <wikibugs>	 (03PS1) 10Marostegui: installserver: Configure es2028 with db-trixie.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1206067 (https://phabricator.wikimedia.org/T408777)
[07:08:33] <wikibugs>	 (03PS2) 10Marostegui: installserver: Configure es2028 with db-trixie.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1206067 (https://phabricator.wikimedia.org/T408777)
[07:11:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Configure es2028 with db-trixie.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1206067 (https://phabricator.wikimedia.org/T408777) (owner: 10Marostegui)
[07:13:22] <icinga-wm>	 PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[07:18:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:22:33] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bookworm
[07:23:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[07:23:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[07:34:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add ability to limit requests from authenticated users [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1206068
[07:39:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: clean puppet certs on host destroy [puppet] - 10https://gerrit.wikimedia.org/r/1204370 (https://phabricator.wikimedia.org/T409912) (owner: 10Filippo Giunchedi)
[07:39:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: inject netbox metadata for stack hosts [puppet] - 10https://gerrit.wikimedia.org/r/1204361 (https://phabricator.wikimedia.org/T409905) (owner: 10Filippo Giunchedi)
[07:39:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: introduce puppet::hosts function [puppet] - 10https://gerrit.wikimedia.org/r/1204360 (https://phabricator.wikimedia.org/T409905) (owner: 10Filippo Giunchedi)
[07:40:29] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[07:44:26] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[07:47:24] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse)
[07:48:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and Init7 (2001:1620:1000::85) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[07:48:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/6 (Transit: Init7 (N/A) {#021469}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:52:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add ability to limit requests from authenticated users [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1206068 (owner: 10Giuseppe Lavagetto)
[07:55:46] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:56:23] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:57:07] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[07:57:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T0800)
[08:00:06] <jouncebot>	 sfaci, kostajh, and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:22] <kostajh>	 good morning
[08:00:34] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Rate-limit for authenticated users - oblivian@cumin1003"
[08:00:36] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Rate-limit for authenticated users - oblivian@cumin1003
[08:01:16] <kostajh>	 I'll get started 
[08:01:27] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Rate-limit for authenticated users - oblivian@cumin1003
[08:01:28] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Rate-limit for authenticated users - oblivian@cumin1003"
[08:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1205558 (https://phabricator.wikimedia.org/T410146) (owner: 10Kosta Harlan)
[08:01:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1205224 (https://phabricator.wikimedia.org/T410008) (owner: 10Kosta Harlan)
[08:03:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[08:03:12] <dcausse>	 o/
[08:03:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and Init7 (2001:1620:1000::85) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[08:03:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/6 (Transit: Init7 (N/A) {#021469}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[08:03:55] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[08:04:23] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply
[08:05:17] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply
[08:05:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[08:06:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[08:08:09] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bookworm
[08:09:56] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Record hcaptcha.execute() calls in VisualEditorFeatureUse [extensions/WikimediaEvents] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1205558 (https://phabricator.wikimedia.org/T410146) (owner: 10Kosta Harlan)
[08:09:57] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Track the interfaceName in open-callback events [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1205224 (https://phabricator.wikimedia.org/T410008) (owner: 10Kosta Harlan)
[08:11:15] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1205558|hCaptcha: Record hcaptcha.execute() calls in VisualEditorFeatureUse (T410146)]], [[gerrit:1205224|hCaptcha: Track the interfaceName in open-callback events (T410008 T402767)]]
[08:11:22] <stashbot>	 T410146: hCaptcha: Generate VisualEditorFeatureUse event when hCaptcha execute is invoked - https://phabricator.wikimedia.org/T410146
[08:11:22] <stashbot>	 T410008: hCaptcha: Update Grafana dashboard to include editing events - https://phabricator.wikimedia.org/T410008
[08:11:23] <stashbot>	 T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767
[08:11:59] <kostajh>	 dcausse: do you want me to sync your patch? I have some other backports after these are done, but can sync yours ahead of them
[08:12:22] <dcausse>	 kostajh: o/ if you can that'd be great thanks! :)
[08:13:02] <kostajh>	 ok, I'll do that one next 
[08:14:54] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply
[08:15:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply
[08:18:07] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply
[08:18:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply
[08:20:15] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah)
[08:23:30] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] Remove tilerator-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1205089 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:30:50] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bookworm
[08:32:28] <tappof>	 !log titan1002: switch /srv mount from /dev/md2 to /dev/vg0/srv T410152
[08:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:32] <stashbot>	 T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152
[08:36:09] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1205558|hCaptcha: Record hcaptcha.execute() calls in VisualEditorFeatureUse (T410146)]], [[gerrit:1205224|hCaptcha: Track the interfaceName in open-callback events (T410008 T402767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[08:36:15] <stashbot>	 T410146: hCaptcha: Generate VisualEditorFeatureUse event when hCaptcha execute is invoked - https://phabricator.wikimedia.org/T410146
[08:36:16] <stashbot>	 T410008: hCaptcha: Update Grafana dashboard to include editing events - https://phabricator.wikimedia.org/T410008
[08:36:16] <stashbot>	 T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767
[08:37:51] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] deployment_server: migrate mw-experimental to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1204945 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[08:38:07] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "Thank you scott!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204947 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[08:40:13] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[08:41:04] <wikibugs>	 (03PS1) 10Slyngshede: Update Meta geo-mapping [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735)
[08:41:59] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:41:59] <wikibugs>	 (03PS2) 10Slyngshede: Update Meta geo-mapping [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735)
[08:42:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[08:43:55] <logmsgbot>	 !log ayounsi@cumin1003 START - Cookbook sre.network.debug for Netbox circuit ID 93
[08:44:08] <logmsgbot>	 !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 93
[08:46:59] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:48:02] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:48:24] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[08:48:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[08:51:59] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:53:02] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:53:15] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205558|hCaptcha: Record hcaptcha.execute() calls in VisualEditorFeatureUse (T410146)]], [[gerrit:1205224|hCaptcha: Track the interfaceName in open-callback events (T410008 T402767)]] (duration: 42m 00s)
[08:53:22] <stashbot>	 T410146: hCaptcha: Generate VisualEditorFeatureUse event when hCaptcha execute is invoked - https://phabricator.wikimedia.org/T410146
[08:53:22] <stashbot>	 T410008: hCaptcha: Update Grafana dashboard to include editing events - https://phabricator.wikimedia.org/T410008
[08:53:23] <stashbot>	 T402767: hCaptcha: Log hCaptcha error codes to Logstash and Prometheus - https://phabricator.wikimedia.org/T402767
[08:54:06] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[08:54:08] <kostajh>	 dcausse: ok, I'll sync your config patch next 
[08:54:14] <dcausse>	 thx!
[08:54:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse)
[08:55:25] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204890 (https://phabricator.wikimedia.org/T408734) (owner: 10DCausse)
[08:55:49] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204890|cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki (T408734)]]
[08:55:53] <stashbot>	 T408734: Enable RU & HE DWIM-style Second Try mappings for autocomplete - https://phabricator.wikimedia.org/T408734
[08:56:59] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:00:23] <logmsgbot>	 !log kharlan@deploy2002 dcausse, kharlan: Backport for [[gerrit:1204890|cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki (T408734)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:00:37] <dcausse>	 testing ^
[09:01:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:01:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[09:02:30] <dcausse>	 kostajh: looks good from my side
[09:02:52] <kostajh>	 dcausse: ok
[09:02:55] <logmsgbot>	 !log kharlan@deploy2002 dcausse, kharlan: Continuing with sync
[09:03:02] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:06:59] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:08:50] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1262 slowly with 10 steps - Repooling after replacing the DIMM
[09:08:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 07Sustainability (Incident Followup): db1262 is down - https://phabricator.wikimedia.org/T409374#11377823 (10ops-monitoring-bot) Completed pool of db1262 slowly with 10 steps - Repooling after replacing the DIMM - marostegui@cumin1003
[09:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:09:45] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204890|cirrus: enable wrong keyboard DWIM-style on hewiki and ruwiki (T408734)]] (duration: 13m 56s)
[09:09:49] <stashbot>	 T408734: Enable RU & HE DWIM-style Second Try mappings for autocomplete - https://phabricator.wikimedia.org/T408734
[09:10:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123) (owner: 10Kosta Harlan)
[09:10:58] <dcausse>	 kostajh: thanks for taking care of the deploy
[09:11:26] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Conditionally disable the addurl rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205102 (https://phabricator.wikimedia.org/T410123) (owner: 10Kosta Harlan)
[09:11:49] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1205102|hCaptcha: Conditionally disable the addurl rule (T410123)]]
[09:11:52] <kostajh>	 dcausse: no problem!
[09:11:53] <stashbot>	 T410123: hCaptcha: Conditionally disable the addurl rule - https://phabricator.wikimedia.org/T410123
[09:13:02] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:15:17] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:idp: Explicitely set internalProxies [puppet] - 10https://gerrit.wikimedia.org/r/1205208 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah)
[09:16:12] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bookworm
[09:16:18] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1205102|hCaptcha: Conditionally disable the addurl rule (T410123)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:17:31] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[09:18:02] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:18:34] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[09:21:15] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:21:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:22:36] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205102|hCaptcha: Conditionally disable the addurl rule (T410123)]] (duration: 10m 47s)
[09:22:40] <stashbot>	 T410123: hCaptcha: Conditionally disable the addurl rule - https://phabricator.wikimedia.org/T410123
[09:23:02] <jinxer-wm>	 FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:26:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205131 (https://phabricator.wikimedia.org/T405727) (owner: 10Santiago Faci)
[09:26:31] <wikibugs>	 (03CR) 10FNegri: [C:03+1] definitions: Add port for x4 on the wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/1205090 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah)
[09:26:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:27:00] <wikibugs>	 (03Merged) 10jenkins-bot: Remove wgMetricsPlatformEnableExperimentOverrides config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205131 (https://phabricator.wikimedia.org/T405727) (owner: 10Santiago Faci)
[09:27:21] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1205131|Remove wgMetricsPlatformEnableExperimentOverrides config variable (T405727)]]
[09:27:25] <stashbot>	 T405727: Remove wgMetricsPlatformEnableExperimentOverrides config variable - https://phabricator.wikimedia.org/T405727
[09:27:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:31:44] <logmsgbot>	 !log kharlan@deploy2002 sfaci, kharlan: Backport for [[gerrit:1205131|Remove wgMetricsPlatformEnableExperimentOverrides config variable (T405727)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:32:42] <sfaci>	 Hi @kharlan! Thank you for taking care of the deployment.
[09:33:12] <sfaci>	 and I'm sorry because I was late
[09:33:43] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[09:34:04] <logmsgbot>	 !log kharlan@deploy2002 sfaci, kharlan: Continuing with sync
[09:34:41] <kostajh>	 sfaci: no worries
[09:34:51] <wikibugs>	 (03CR) 10Jaime Nuche: [C:03+1] releases: stop/mask jenkins in eqiad, start/unmask jenkins in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204982 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn)
[09:35:46] <duesen>	 Hi all! Is IDP login broken for anyone else? I'm hitting a redirect loop trying to log into Turnilo. More info here: https://phabricator.wikimedia.org/T410249
[09:36:10] <marostegui>	 duesen: yes, it is broken
[09:36:37] <duesen>	 marostegui: who can fix it?
[09:37:11] <marostegui>	 duesen: it is currently being investigated
[09:37:27] <duesen>	 ok thanks
[09:37:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:38:48] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[09:38:50] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205131|Remove wgMetricsPlatformEnableExperimentOverrides config variable (T405727)]] (duration: 11m 29s)
[09:38:54] <stashbot>	 T405727: Remove wgMetricsPlatformEnableExperimentOverrides config variable - https://phabricator.wikimedia.org/T405727
[09:40:51] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha::passCaptcha: Log the action, trigger and SiteKey [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1206191
[09:40:53] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test2001.codfw.wmnet
[09:41:05] <kostajh>	 continuing with backports 
[09:41:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1206191 (owner: 10Kosta Harlan)
[09:41:49] <wikibugs>	 (03PS1) 10Brouberol: Revert "P:idp: Explicitely set internalProxies" [puppet] - 10https://gerrit.wikimedia.org/r/1206192
[09:42:14] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Revert "P:idp: Explicitely set internalProxies" [puppet] - 10https://gerrit.wikimedia.org/r/1206192 (owner: 10Brouberol)
[09:42:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Revert "P:idp: Explicitely set internalProxies" [puppet] - 10https://gerrit.wikimedia.org/r/1206192 (owner: 10Brouberol)
[09:42:43] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[09:43:02] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:43:19] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirments for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11377967 (10Gehel)
[09:43:27] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirments for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11377968 (10Gehel) p:05Triage→03High
[09:43:34] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2230.codfw.wmnet onto db-test2001.codfw.wmnet
[09:43:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Revert "P:idp: Explicitely set internalProxies" [puppet] - 10https://gerrit.wikimedia.org/r/1206192 (owner: 10Brouberol)
[09:46:13] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Upgrade the superset-production-memcached image to Trixie [deployment-charts] - 10https://gerrit.wikimedia.org/r/1202101 (https://phabricator.wikimedia.org/T409151) (owner: 10Stevemunene)
[09:47:46] <marostegui>	 duesen: should be back now
[09:49:45] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[09:50:25] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply
[09:50:27] <wikibugs>	 (03PS1) 10Marostegui: installserver: Reformat es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206195 (https://phabricator.wikimedia.org/T408777)
[09:51:15] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service idp1005:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:52:22] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha::passCaptcha: Log the action, trigger and SiteKey [extensions/ConfirmEdit] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1206191 (owner: 10Kosta Harlan)
[09:52:41] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1206191|hCaptcha::passCaptcha: Log the action, trigger and SiteKey]]
[09:54:07] <wikibugs>	 (03PS1) 10Btullis: Remove the Data Platform SRE team from the contactgroup for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1206196 (https://phabricator.wikimedia.org/T382607)
[09:54:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206195 (https://phabricator.wikimedia.org/T408777) (owner: 10Marostegui)
[09:56:03] <wikibugs>	 (03PS2) 10Btullis: Remove the Data Platform SRE team from the contactgroup for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1206196 (https://phabricator.wikimedia.org/T382607)
[09:56:54] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1206191|hCaptcha::passCaptcha: Log the action, trigger and SiteKey]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[09:57:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Reformat es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206195 (https://phabricator.wikimedia.org/T408777) (owner: 10Marostegui)
[09:59:54] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11378203 (10Gehel)
[10:00:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for hfanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1206198
[10:00:48] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS trixie
[10:01:11] <wikibugs>	 (03PS3) 10Btullis: Remove the Data Platform SRE team from the contactgroup for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1206196 (https://phabricator.wikimedia.org/T382607)
[10:02:22] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7630/co" [puppet] - 10https://gerrit.wikimedia.org/r/1206196 (https://phabricator.wikimedia.org/T382607) (owner: 10Btullis)
[10:03:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for hfanwmf [puppet] - 10https://gerrit.wikimedia.org/r/1206198 (owner: 10Muehlenhoff)
[10:04:14] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bookworm
[10:04:44] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[10:09:39] <wikibugs>	 (03PS1) 10Jcrespo: garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020)
[10:09:54] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206191|hCaptcha::passCaptcha: Log the action, trigger and SiteKey]] (duration: 17m 13s)
[10:12:20] <wikibugs>	 (03CR) 10Jcrespo: "Moritz: Ok for me to debian package this under the garage name for wmf repo?" [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[10:15:04] <wikibugs>	 (03CR) 10Muehlenhoff: "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[10:17:41] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "+1, stacking commits here would be more readable." [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli)
[10:18:17] <wikibugs>	 (03CR) 10Jcrespo: "Matthew, I did a first debian packaging of garage, what would it be your preferred way for me to share it so you could have a look, if you" [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[10:18:22] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts db-test2002.codfw.wmnet
[10:19:57] <wikibugs>	 (03PS3) 10Muehlenhoff: sre.ganeti.reboot-vm: Use skip_acked=True [cookbooks] - 10https://gerrit.wikimedia.org/r/1203483 (https://phabricator.wikimedia.org/T330136)
[10:20:01] <logmsgbot>	 !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS bookworm
[10:23:19] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[10:24:14] <wikibugs>	 (03PS1) 10Klausman: admin/data.yaml: Add FIDO SSH keys for klausman [puppet] - 10https://gerrit.wikimedia.org/r/1206201
[10:24:24] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Thanks, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1206196 (https://phabricator.wikimedia.org/T382607) (owner: 10Btullis)
[10:25:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] definitions: Add port for x4 on the wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/1205090 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah)
[10:29:00] <logmsgbot>	 fceratto@cumin1003 decommission (PID 986094) is awaiting input
[10:31:05] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply
[10:31:34] <wikibugs>	 (03PS1) 10Marostegui: installserver: Move es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206203
[10:31:36] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply
[10:32:00] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[10:32:00] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[10:32:00] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[10:33:34] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Move es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206203 (owner: 10Marostegui)
[10:33:49] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply
[10:34:03] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply
[10:35:01] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
[10:35:52] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: set 53s timeout for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206205 (https://phabricator.wikimedia.org/T408223)
[10:36:37] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
[10:36:38] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[10:36:38] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db-test2002.codfw.wmnet
[10:37:40] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test2002.codfw.wmnet
[10:37:41] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[10:41:25] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[10:41:58] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bookworm
[10:42:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply
[10:42:22] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11378338 (10gmodena) >>! In T409769#11375265, @bking wrote: > However, would everyone be OK with not ordering hardware specifically for WDQS until we have m...
[10:42:55] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply
[10:43:19] <logmsgbot>	 fceratto@cumin1003 makevm (PID 1006226) is awaiting input
[10:43:33] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply
[10:44:56] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11378339 (10phaultfinder)
[10:48:43] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Remove the Data Platform SRE team from the contactgroup for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1206196 (https://phabricator.wikimedia.org/T382607) (owner: 10Btullis)
[10:48:50] <wikibugs>	 (03PS1) 10Marostegui: Revert "installserver: Move es2028" [puppet] - 10https://gerrit.wikimedia.org/r/1206221
[10:49:57] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11378366 (10phaultfinder)
[10:51:04] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "installserver: Move es2028" [puppet] - 10https://gerrit.wikimedia.org/r/1206221 (owner: 10Marostegui)
[10:51:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:52:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] rest-gateway: set 53s timeout for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206205 (https://phabricator.wikimedia.org/T408223) (owner: 10Hnowlan)
[10:52:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[10:53:28] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Remove the Data Platform SRE team from the contactgroup for wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/1206196 (https://phabricator.wikimedia.org/T382607) (owner: 10Btullis)
[10:54:09] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "nit: I would suggest to add T410007 too in the commit message" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206205 (https://phabricator.wikimedia.org/T408223) (owner: 10Hnowlan)
[10:54:27] <wikibugs>	 (03PS1) 10Marostegui: installserver: Add db-trixie.cfg to es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206319 (https://phabricator.wikimedia.org/T408777)
[10:58:46] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[10:59:13] <wikibugs>	 (03PS2) 10Hnowlan: rest-gateway: set 53s timeout for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206205 (https://phabricator.wikimedia.org/T408223)
[10:59:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Add db-trixie.cfg to es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206319 (https://phabricator.wikimedia.org/T408777) (owner: 10Marostegui)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1100)
[11:01:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:03:27] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update revertrisk-wikidata image in both experimental and revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206344 (https://phabricator.wikimedia.org/T406179)
[11:03:30] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[11:04:36] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2002.codfw.wmnet - fceratto@cumin1003"
[11:07:40] <logmsgbot>	 fceratto@cumin1003 makevm (PID 1006226) is awaiting input
[11:07:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:09:22] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Thnx for deploying." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206344 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira)
[11:09:45] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11378426 (10TheDJ) >>! In T408715#11332556, @AntiCompositeNumber wrote: > It's historically been easy for applications to generate their...
[11:10:27] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test2002.codfw.wmnet - fceratto@cumin1003"
[11:10:28] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:10:28] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.wipe-cache db-test2002.codfw.wmnet on all recursors
[11:10:31] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test2002.codfw.wmnet on all recursors
[11:10:38] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "Done, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206205 (https://phabricator.wikimedia.org/T408223) (owner: 10Hnowlan)
[11:11:03] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2002.codfw.wmnet - fceratto@cumin1003"
[11:11:07] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test2002.codfw.wmnet - fceratto@cumin1003"
[11:12:46] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: set 53s timeout for action API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206205 (https://phabricator.wikimedia.org/T408223) (owner: 10Hnowlan)
[11:14:08] <logmsgbot>	 fceratto@cumin1003 makevm (PID 1006226) is awaiting input
[11:16:20] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:16:34] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:18:01] <wikibugs>	 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11378450 (10LSobanski) And one more time :)
[11:18:39] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11378451 (10TheDJ)
[11:19:49] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:20:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:20:20] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db-test2002.codfw.wmnet with OS trixie
[11:22:35] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] definitions: Add port for x4 on the wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/1205090 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah)
[11:23:25] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[11:27:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete grants file [puppet] - 10https://gerrit.wikimedia.org/r/1204916 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[11:27:38] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bookworm
[11:28:07] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update llm image and re-enable gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206348
[11:29:06] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11378472 (10gmodena) >>! In T409769#11373497, @Gehel wrote: > Do you already have an idea of the size of SSD that you need? Looking at current servers, I su...
[11:29:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T410234#11378473 (10phaultfinder)
[11:31:04] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[11:31:46] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[11:32:57] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bookworm
[11:33:07] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test2002.codfw.wmnet with reason: host reimage
[11:33:10] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+1] ml-services: update llm image and re-enable gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206348 (owner: 10AikoChou)
[11:35:54] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: update llm image and re-enable gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206348 (owner: 10AikoChou)
[11:38:00] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update llm image and re-enable gpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206348 (owner: 10AikoChou)
[11:38:11] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[11:38:15] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test2002.codfw.wmnet with reason: host reimage
[11:39:10] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Bugfix release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1206349
[11:39:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Bugfix release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1206349 (owner: 10Giuseppe Lavagetto)
[11:39:50] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Bugfix - oblivian@cumin1003"
[11:39:52] <logmsgbot>	 !log oblivian@cumin1003 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfix - oblivian@cumin1003
[11:40:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, see inline comment on how to simplify the user/group setup." [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[11:40:38] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Bugfix - oblivian@cumin1003
[11:40:40] <logmsgbot>	 !log oblivian@cumin1003 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Bugfix - oblivian@cumin1003"
[11:41:22] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[11:43:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, keys have also been verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1206201 (owner: 10Klausman)
[11:50:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Update mfischerwmf ssh key - https://phabricator.wikimedia.org/T410270 (10MFischer) 03NEW
[11:50:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Security: Investigate iptables replacements - https://phabricator.wikimedia.org/T279683#11378546 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff We can close this task, since a replacement has been defined for about two years now: We're gradually...
[11:50:34] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[11:50:39] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[11:53:03] <wikibugs>	 (03PS2) 10Muehlenhoff: test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860)
[11:53:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11378556 (10ayounsi) > Once the router change is done, therefore, we need to somehow adjust the netmask on all the existing hosts on the v...
[11:54:15] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[11:57:26] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test2002.codfw.wmnet with OS trixie
[11:57:26] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test2002.codfw.wmnet
[12:00:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860) (owner: 10Muehlenhoff)
[12:00:40] <wikibugs>	 (03CR) 10Jcrespo: garage: Productionize garage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[12:01:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11378569 (10ayounsi) a:03Papaul @papaul is that something you could look into ? Is there is a way to disable the NIC's LLDP through the BIOS menu ? Maybe some solution from the la...
[12:02:18] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2090.codfw.wmnet with OS bullseye
[12:02:25] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11378576 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2090.codfw.wmnet with OS bullseye
[12:02:59] <wikibugs>	 (03CR) 10Majavah: [C:03+2] definitions: Add port for x4 on the wiki replicas [homer/public] - 10https://gerrit.wikimedia.org/r/1205090 (https://phabricator.wikimedia.org/T409560) (owner: 10Majavah)
[12:09:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.5s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:12:37] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206359 (https://phabricator.wikimedia.org/T408777)
[12:12:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206359 (https://phabricator.wikimedia.org/T408777) (owner: 10Marostegui)
[12:13:42] <taavi>	 !log update CR firewall policy to add the x4 port to wiki replicas related rules T409560
[12:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:46] <stashbot>	 T409560: Add support for x1 and x4 sections on wiki replicas on the load balancer layer - https://phabricator.wikimedia.org/T409560
[12:14:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206359 (https://phabricator.wikimedia.org/T408777) (owner: 10Marostegui)
[12:14:57] <logmsgbot>	 !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' .
[12:17:23] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bookworm
[12:18:40] <wikibugs>	 (03PS2) 10Jcrespo: garage: Productionize garage [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020)
[12:19:46] <wikibugs>	 (03PS1) 10Muehlenhoff: partman: Apply workarounds to swap handling affecting trixie installations [puppet] - 10https://gerrit.wikimedia.org/r/1206362 (https://phabricator.wikimedia.org/T408777)
[12:20:00] <wikibugs>	 (03CR) 10Jcrespo: "Question" [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[12:21:18] <wikibugs>	 (03CR) 10Muehlenhoff: garage: Productionize garage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[12:22:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:23:00] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-services: fix resource request [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206363
[12:24:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.358s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:24:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove tilerator-admin group [puppet] - 10https://gerrit.wikimedia.org/r/1205089 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:25:00] <wikibugs>	 (03CR) 10Jcrespo: "this ok?" [puppet] - 10https://gerrit.wikimedia.org/r/1206199 (https://phabricator.wikimedia.org/T410020) (owner: 10Jcrespo)
[12:25:19] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2090.codfw.wmnet with reason: host reimage
[12:27:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:29:19] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test2002.codfw.wmnet
[12:29:26] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2090.codfw.wmnet with reason: host reimage
[12:32:41] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2230.codfw.wmnet onto db-test2002.codfw.wmnet
[12:32:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:34:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:36:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "Looks good - but waiting for the reimage to finish correctly before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1206362 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff)
[12:36:46] <jinxer-wm>	 FIRING: Traffic bill over quota: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[12:38:39] <wikibugs>	 (03PS5) 10Muehlenhoff: Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565)
[12:39:50] <wikibugs>	 (03PS12) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810)
[12:39:55] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS bookworm
[12:42:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.909s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:43:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11378727 (10BTullis) It's interesting. We's veactually got ext4 errors from 4 drives on this host. ` btullis@an-worker1208:~$ sudo dmesg -T|grep "EXT4-fs e...
[12:47:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.475s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:48:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:48:25] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2090.codfw.wmnet with OS bullseye
[12:48:36] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11378733 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2090.codfw.wmnet with OS bullseye complete...
[12:51:42] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "As discussed on IRC, LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup)
[12:52:48] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[12:53:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[12:53:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch maps-test2001 to maps::staging [puppet] - 10https://gerrit.wikimedia.org/r/1201690 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[12:54:49] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11378749 (10ayounsi) > I personally prefer to use the first (ok second) address in each v6 subnet as the gateway, i.e. 2a02:ec80:400:1::1/64 Sounds good to me....
[12:54:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[12:56:46] <jinxer-wm>	 RESOLVED: Traffic bill over quota: Alert for device cr1-drmrs.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[12:57:33] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[12:58:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.852s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:58:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:01:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Transport link saturation not alerting - https://phabricator.wikimedia.org/T409330#11378763 (10ayounsi) a:03ayounsi My bad ! I turned them off after adding the transit/peering saturation alerts. Forgetting transport and core links.... I'll take ca...
[13:03:00] <wikibugs>	 (03CR) 10Klausman: [C:03+2] admin/data.yaml: Add FIDO SSH keys for klausman [puppet] - 10https://gerrit.wikimedia.org/r/1206201 (owner: 10Klausman)
[13:03:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.818s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:04:05] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[13:06:59] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:10:18] <wikibugs>	 (03PS3) 10Muehlenhoff: test_import: Drop workaround for python-elasticsearch [cookbooks] - 10https://gerrit.wikimedia.org/r/1203491 (https://phabricator.wikimedia.org/T390860)
[13:10:25] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-services: fix resource request [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206363 (owner: 10Dpogorzelski)
[13:11:15] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2091.codfw.wmnet with OS bullseye
[13:11:28] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11378783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2091.codfw.wmnet with OS bullseye
[13:13:10] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix resource request [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206363 (owner: 10Dpogorzelski)
[13:15:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11378805 (10MoritzMuehlenhoff)
[13:21:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:23:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10observability: titan200[12] RAM/SSD upgrade coordination - https://phabricator.wikimedia.org/T361229#11378833 (10tappof)
[13:25:07] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS bookworm
[13:26:51] <moritzm>	 !log update trixie installer image to 13.2  T410147
[13:26:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:55] <stashbot>	 T410147: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147
[13:33:41] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2091.codfw.wmnet with reason: host reimage
[13:33:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:34:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[13:37:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2091.codfw.wmnet with reason: host reimage
[13:40:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205115 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[13:41:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:44:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T410234#11378892 (10phaultfinder)
[13:52:02] <Amir1>	 !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists dkwiki; drop database if exists dkwikibooks; drop database if exists dkwiktionary; (T297297)
[13:52:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:06] <stashbot>	 T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297
[13:53:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11378918 (10Volans)
[13:53:10] <tappof>	 !log titan2002: switch /srv mount from /dev/md2 to /dev/vg0/srv T410152
[13:53:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:14] <stashbot>	 T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152
[13:56:52] <wikibugs>	 (03PS2) 10Kosta Harlan: hCaptcha: Update passive mode config for addurl trigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204830 (https://phabricator.wikimedia.org/T409957)
[13:56:53] <wikibugs>	 (03PS2) 10DCausse: cirrus: index field to sort on title [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205130 (https://phabricator.wikimedia.org/T40403)
[13:58:47] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2091.codfw.wmnet with OS bullseye
[13:59:00] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11378947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2091.codfw.wmnet with OS bullseye complete...
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1400).
[14:00:05] <jouncebot>	 tgr, James_F, edsanders, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:10] <edsanders>	 o/
[14:00:13] <Lucas_WMDE>	 I can’t deploy today, sorry
[14:00:27] <edsanders>	 I can self deploy 
[14:00:38] <James_F>	 o/
[14:00:55] <James_F>	 I’ll do mine afterwards.
[14:01:24] <kostajh>	 hi
[14:01:32] <kostajh>	 I can deploy mine as well
[14:02:33] <kostajh>	 edsanders: do you want to start? Then James_F, me, and tgr|away 
[14:02:37] <edsanders>	 sure
[14:02:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205096 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders)
[14:03:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11378960 (10Jclark-ctr) @BTullis  Originally, this server had three failed drives. One was listed as foreign (3F4EE0807AD4630F) in slot 7. I imported the f...
[14:03:25] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2092.codfw.wmnet with OS bullseye
[14:03:31] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11378961 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2092.codfw.wmnet with OS bullseye
[14:03:45] <wikibugs>	 (03Merged) 10jenkins-bot: Make LQT opt-in on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205096 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders)
[14:04:08] <logmsgbot>	 !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1205096|Make LQT opt-in on ptwikibooks (T402549)]]
[14:04:12] <stashbot>	 T402549: ptwikibooks: Convert LQT pages to Flow - https://phabricator.wikimedia.org/T402549
[14:04:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11378965 (10Jclark-ctr)
[14:04:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T410239#11378967 (10Jclark-ctr) →14Duplicate dup:03T409938
[14:04:53] <wikibugs>	 (03CR) 10Gehel: [C:03+2] airflow: assume the PYTHONPATH env var is defined in the airflow image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1203451 (https://phabricator.wikimedia.org/T408711) (owner: 10Brouberol)
[14:06:47] <icinga-wm>	 RECOVERY - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1208 is OK: communication: 0 OK : controller: 0 OK : physical_disk: 0 OK : virtual_disk: 0 OK : bbu: 0 OK : enclosure: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[14:08:44] <logmsgbot>	 !log esanders@deploy2002 esanders: Backport for [[gerrit:1205096|Make LQT opt-in on ptwikibooks (T402549)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:09:43] <logmsgbot>	 !log esanders@deploy2002 esanders: Continuing with sync
[14:10:00] <moritzm>	 !log installing glibc security updates
[14:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:06] <James_F>	 !log jforrester@deploy2002:~$ mwscript sql --wiki=wikimaniawiki /srv/mediawiki/php-1.46.0-wmf.2/extensions/WikiLambda/sql/mysql/table-usage.sql # T401683
[14:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:10] <stashbot>	 T401683: Wikimania in-person request: Enable Wikifunctions client mode on Wikimania wiki - https://phabricator.wikimedia.org/T401683
[14:12:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn)
[14:14:58] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T410234#11379012 (10phaultfinder)
[14:15:01] <logmsgbot>	 !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205096|Make LQT opt-in on ptwikibooks (T402549)]] (duration: 10m 53s)
[14:15:05] <stashbot>	 T402549: ptwikibooks: Convert LQT pages to Flow - https://phabricator.wikimedia.org/T402549
[14:16:55] <wikibugs>	 (03PS2) 10Jforrester: Enable embedded Wikifunctions on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204979 (https://phabricator.wikimedia.org/T401683)
[14:21:54] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11379026 (10Volans) Adding #data-engineering for visibility and @Milimetric, @Ahoelzl for approval (either of them) from the data en...
[14:23:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204979 (https://phabricator.wikimedia.org/T401683) (owner: 10Jforrester)
[14:25:13] <wikibugs>	 (03Merged) 10jenkins-bot: Enable embedded Wikifunctions on Wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204979 (https://phabricator.wikimedia.org/T401683) (owner: 10Jforrester)
[14:25:31] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1204979|Enable embedded Wikifunctions on Wikimania wiki (T401683)]]
[14:25:35] <stashbot>	 T401683: Wikimania in-person request: Enable Wikifunctions client mode on Wikimania wiki - https://phabricator.wikimedia.org/T401683
[14:25:47] <Amir1>	 !log mwscript-k8s --dblist=large -- purgeUserOptions.php --login-age 15 (T406724)
[14:25:51] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2092.codfw.wmnet with reason: host reimage
[14:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:52] <stashbot>	 T406724: Clean up watchlist and user properties of users if they don't log in for certain time - https://phabricator.wikimedia.org/T406724
[14:26:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] partman: Apply workarounds to swap handling affecting trixie installations [puppet] - 10https://gerrit.wikimedia.org/r/1206362 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff)
[14:29:18] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2092.codfw.wmnet with reason: host reimage
[14:29:22] <wikibugs>	 (03PS1) 10Gehel: airflow: update base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206380 (https://phabricator.wikimedia.org/T408711)
[14:29:38] <wikibugs>	 (03PS1) 10Marostegui: installserver: Clean up es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206381 (https://phabricator.wikimedia.org/T408777)
[14:29:48] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1204979|Enable embedded Wikifunctions on Wikimania wiki (T401683)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:30:05] <wikibugs>	 (03PS1) 10Majavah: P:tlsproxy::envoy: Allow customizing upstrem address per-service [puppet] - 10https://gerrit.wikimedia.org/r/1206382 (https://phabricator.wikimedia.org/T409328)
[14:30:07] <wikibugs>	 (03PS1) 10Majavah: hieradata: cloudweb2002-dev: Use localhost to reach CAS from Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1206383 (https://phabricator.wikimedia.org/T409328)
[14:30:07] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] airflow: update base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206380 (https://phabricator.wikimedia.org/T408711) (owner: 10Gehel)
[14:31:22] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[14:32:00] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[14:32:00] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[14:32:00] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[14:32:13] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7632/co" [puppet] - 10https://gerrit.wikimedia.org/r/1206383 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah)
[14:32:33] <wikibugs>	 (03PS1) 10Volans: admin: update ssh key for mfischerwmf [puppet] - 10https://gerrit.wikimedia.org/r/1206385 (https://phabricator.wikimedia.org/T410270)
[14:33:02] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update mfischerwmf ssh key - https://phabricator.wikimedia.org/T410270#11379082 (10Volans) p:05Triage→03Medium Confirmed the ssh key out of band.
[14:33:09] <wikibugs>	 (03PS2) 10Majavah: hieradata: cloudweb2002-dev: Use localhost to reach CAS from Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1206383 (https://phabricator.wikimedia.org/T409328)
[14:33:58] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7633/co" [puppet] - 10https://gerrit.wikimedia.org/r/1206383 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah)
[14:34:14] <wikibugs>	 (03CR) 10Gehel: [C:03+2] airflow: update base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206380 (https://phabricator.wikimedia.org/T408711) (owner: 10Gehel)
[14:34:26] <James_F>	 kostajh: Over to you once this finishes.
[14:34:35] <kostajh>	 James_F: thanks
[14:34:35] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11379086 (10bking) a:03bking
[14:35:25] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204979|Enable embedded Wikifunctions on Wikimania wiki (T401683)]] (duration: 09m 55s)
[14:35:30] <stashbot>	 T401683: Wikimania in-person request: Enable Wikifunctions client mode on Wikimania wiki - https://phabricator.wikimedia.org/T401683
[14:36:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205115 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[14:36:29] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1208.eqiad.wmnet
[14:37:01] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Enable hCaptcha editing for idwiki, jawiki, and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205115 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[14:37:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11379093 (10Papaul) @ayounsi yes I can look into it. Thanks.
[14:37:20] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1205115|hCaptcha: Enable hCaptcha editing for idwiki, jawiki, and ptwiki (T405586)]]
[14:37:24] <stashbot>	 T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
[14:37:57] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org
[14:37:58] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy4002.wikimedia.org
[14:38:11] <wikibugs>	 (03CR) 10Andrew Bogott: toolforge haproxy config: replace httpchk with http-check send (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1203175 (owner: 10Andrew Bogott)
[14:38:18] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3002.wikimedia.org
[14:38:20] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[14:39:15] <logmsgbot>	 !log gehel@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:40:41] <wikibugs>	 (03CR) 10Btullis: [C:03+2] LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene)
[14:41:41] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[14:41:48] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1205115|hCaptcha: Enable hCaptcha editing for idwiki, jawiki, and ptwiki (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:43:12] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[14:43:12] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:43:12] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors
[14:43:16] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors
[14:43:45] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[14:43:49] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[14:44:16] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[14:44:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11379107 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[14:45:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379108 (10ssingh) >>! In T410047#11374122, @cmooney wrote: > @ssingh I made a patch and can kick off the changes in Netbox and on the ro...
[14:47:32] <wikibugs>	 (03PS2) 10Ssingh: P:cache::base allow geoip to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede)
[14:47:43] <wikibugs>	 (03PS3) 10Ssingh: P:cache::base allow geoip to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede)
[14:48:06] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[14:48:23] <wikibugs>	 (03CR) 10Ssingh: "Good idea, might as well do it now. Updated." [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede)
[14:48:33] <logmsgbot>	 !log gehel@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[14:49:00] <wikibugs>	 (03PS2) 10Majavah: P:tlsproxy::envoy: Allow customizing upstrem address per-service [puppet] - 10https://gerrit.wikimedia.org/r/1206382 (https://phabricator.wikimedia.org/T409328)
[14:49:00] <wikibugs>	 (03PS3) 10Majavah: hieradata: cloudweb2002-dev: Use localhost to reach CAS from Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1206383 (https://phabricator.wikimedia.org/T409328)
[14:49:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379134 (10ayounsi) You can use 198.35.26.5/28. It's marked as reserved for infra, but we don't need it (and we will even less need it af...
[14:49:49] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7634/co" [puppet] - 10https://gerrit.wikimedia.org/r/1206383 (https://phabricator.wikimedia.org/T409328) (owner: 10Majavah)
[14:49:55] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11379135 (10phaultfinder)
[14:50:31] <wikibugs>	 (03PS2) 10Daniel Kinzler: rest-gateway: allow rate limits per time unit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205191 (https://phabricator.wikimedia.org/T408132)
[14:50:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2092.codfw.wmnet with OS bullseye
[14:50:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379141 (10cmooney) >>! In T410047#11379108, @ssingh wrote: > My plan for now to unblock the hCaptcha work was to decommission one of the...
[14:50:54] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11379144 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2092.codfw.wmnet with OS bullseye complete...
[14:51:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[14:52:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379156 (10ssingh) >>! In T410047#11379134, @ayounsi wrote: > You can use 198.35.26.5/28. It's marked as reserved for infra, but we don't...
[14:53:09] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379157 (10ssingh) >>! In T410047#11379141, @cmooney wrote: >>>! In T410047#11379108, @ssingh wrote: >> My plan for now to unblock the hC...
[14:53:23] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205115|hCaptcha: Enable hCaptcha editing for idwiki, jawiki, and ptwiki (T405586)]] (duration: 16m 02s)
[14:53:28] <stashbot>	 T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
[14:54:14] <kostajh>	 tgr|away: over to you 
[14:54:51] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11379159 (10phaultfinder)
[14:56:31] <icinga-wm>	 PROBLEM - Host an-worker1208 is DOWN: PING CRITICAL - Packet loss = 100%
[14:56:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379165 (10cmooney) >>! In T410047#11379157, @ssingh wrote: > Yeah, good point about the LVS IPs since we no longer need them given Liber...
[14:58:23] <icinga-wm>	 RECOVERY - Host an-worker1208 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[15:00:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:00:35] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy4002.wikimedia.org
[15:00:37] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[15:01:49] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555)
[15:01:50] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555)
[15:01:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fixes [puppet] - 10https://gerrit.wikimedia.org/r/1206393
[15:02:18] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555)
[15:02:18] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555)
[15:02:31] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Fixes [puppet] - 10https://gerrit.wikimedia.org/r/1206393 (owner: 10Giuseppe Lavagetto)
[15:02:52] <tgr_>	 kostajh: sorry, wasn't around. I'll reschedule my patch.
[15:03:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2093.codfw.wmnet with OS bullseye
[15:03:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[15:03:15] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11379181 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2093.codfw.wmnet with OS bullseye
[15:04:23] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4002.wikimedia.org - sukhe@cumin1003"
[15:04:39] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy4002.wikimedia.org - sukhe@cumin1003"
[15:04:39] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:04:39] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy4002.wikimedia.org on all recursors
[15:04:42] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy4002.wikimedia.org on all recursors
[15:05:05] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4002.wikimedia.org - sukhe@cumin1003"
[15:05:09] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy4002.wikimedia.org - sukhe@cumin1003"
[15:05:23] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy4002.wikimedia.org with OS trixie
[15:05:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11379188 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[15:06:00] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11379191 (10bking) >>! In T409769#11378338, @gmodena wrote: >>>! In T409769#11375265, @bking wrote: >> However, would everyone be OK with not ordering hardw...
[15:06:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11379193 (10Milimetric) Approved!  Thank you @Volans
[15:08:23] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11379212 (10ssingh) >>! In T410047#11379165, @cmooney wrote: >>>! In T410047#11379157, @ssingh wrote: >> Yeah, good point about the LVS IP...
[15:11:03] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:11:57] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1001.eqiad.wmnet
[15:15:54] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1208.eqiad.wmnet
[15:17:31] <dbrant>	 jouncebot: nowandnext
[15:17:31] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 12 minute(s)
[15:17:31] <jouncebot>	 In 0 hour(s) and 12 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1530)
[15:19:31] <wikibugs>	 (03PS1) 10Cmelo: Release CampaignEvents extension to all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206395 (https://phabricator.wikimedia.org/T409760)
[15:23:40] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[15:25:20] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host an-worker1208.eqiad.wmnet
[15:25:38] <wikibugs>	 (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204965 (owner: 10PipelineBot)
[15:25:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2093.codfw.wmnet with reason: host reimage
[15:25:57] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts db-test1002.eqiad.wmnet
[15:27:46] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204965 (owner: 10PipelineBot)
[15:28:04] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy4002.wikimedia.org with reason: host reimage
[15:29:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2093.codfw.wmnet with reason: host reimage
[15:29:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11379313 (10RobH) @BCornwall and @ssingh: We chatted about this last week, can we schedule this work to move dns1006 tomorrow, Tuesday November 18th at 9AM Pacific / 5PM GMT?
[15:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1530)
[15:30:51] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[15:31:59] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[15:32:22] <logmsgbot>	 !log dbrant@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[15:33:05] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1208.eqiad.wmnet
[15:33:14] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy4002.wikimedia.org with reason: host reimage
[15:33:24] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:33:52] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[15:34:13] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet
[15:34:17] <logmsgbot>	 !log dbrant@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[15:34:37] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[15:34:59] <logmsgbot>	 !log dbrant@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[15:35:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox Cable report - incorrectly parsing Nokia power supplies - https://phabricator.wikimedia.org/T410073#11379334 (10LSobanski) p:05Triage→03Medium a:03ayounsi
[15:35:20] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11379336 (10MoritzMuehlenhoff)
[15:35:56] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[15:35:56] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy3002.wikimedia.org
[15:36:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11379339 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[15:36:26] <logmsgbot>	 fceratto@cumin1003 decommission (PID 1374925) is awaiting input
[15:36:50] <wikibugs>	 (03PS1) 10Gehel: Airflow: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206399 (https://phabricator.wikimedia.org/T408711)
[15:37:04] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Airflow: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206399 (https://phabricator.wikimedia.org/T408711) (owner: 10Gehel)
[15:37:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11379344 (10LSobanski) p:05Triage→03Medium
[15:37:59] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, and 2 others: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11379345 (10jhathaway) p:05High→03Medium
[15:39:18] <wikibugs>	 06SRE, 06serviceops, 07Datacenter-Switchover: Add scap lock/unlock steps to sre.switchdc.mediawiki cookbook - https://phabricator.wikimedia.org/T330996#11379346 (10LSobanski)
[15:39:59] <wikibugs>	 (03CR) 10Gehel: [C:03+2] Airflow: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206399 (https://phabricator.wikimedia.org/T408711) (owner: 10Gehel)
[15:41:33] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: firmware-upgrade cookbook fails after successful upgrade - https://phabricator.wikimedia.org/T331135#11379352 (10LSobanski) 05Open→03Declined Considering the age of this task I'm resolving it, please reopen if specific issues occur.
[15:41:53] <logmsgbot>	 !log gehel@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:42:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11379354 (10BTullis) a:05Jclark-ctr→03BTullis OK, we're back to 12 data volumes, all cleanly mounted. ` btullis@an-worker1208:~$ findmnt -t ext4 TARGET...
[15:43:42] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
[15:43:59] <logmsgbot>	 !log gehel@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:45:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops-radar, 07Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009#11379368 (10LSobanski) 05Open→03Declined Considering the age of the task I'm closing it, please reopen if this is...
[15:46:46] <logmsgbot>	 fceratto@cumin1003 decommission (PID 1374925) is awaiting input
[15:48:02] <wikibugs>	 (03CR) 10Xcollazo: [C:04-1] "This work has been paused. Let's not merge until further notice." [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: 10Xcollazo)
[15:48:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2093.codfw.wmnet with OS bullseye
[15:48:40] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11379387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2093.codfw.wmnet with OS bullseye complete...
[15:49:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] cloudcephosd: switch 1048 to single interface [puppet] - 10https://gerrit.wikimedia.org/r/1203383 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi)
[15:49:44] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy4002.wikimedia.org with OS trixie
[15:49:45] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host hcaptcha-proxy4002.wikimedia.org
[15:50:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11379401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[15:50:11] <wikibugs>	 10SRE-tools, 10Ganeti, 06Infrastructure-Foundations: Ganeti: consider --no-wait-for-sync as a default option for instance creation - https://phabricator.wikimedia.org/T335522#11379406 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is being used and works well, resolving
[15:50:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy4001.ulsfo.wmnet
[15:53:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11379425 (10RobH) @Clement_Goubert,  Is it possible that I could send the commands for this or do we need someone in your team?  If we need someone in your team, could we schedu...
[15:54:07] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
[15:54:07] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:54:08] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db-test1002.eqiad.wmnet
[15:54:22] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1002.eqiad.wmnet
[15:54:23] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[15:54:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy4001.ulsfo.wmnet
[15:55:12] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[15:55:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:56:37] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[15:57:49] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:58:44] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1002.eqiad.wmnet
[15:58:47] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11379455 (10MatthewVernon) @Jhancock.wm for each of ms-be209[0-3] the install was failing because puppet couldn't run, because one of the spinning disks h...
[15:59:18] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1002.eqiad.wmnet - fceratto@cumin1003"
[15:59:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[15:59:23] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1002.eqiad.wmnet - fceratto@cumin1003"
[15:59:23] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:59:23] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.wipe-cache db-test1002.eqiad.wmnet on all recursors
[15:59:26] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test1002.eqiad.wmnet on all recursors
[15:59:28] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[15:59:54] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1002.eqiad.wmnet - fceratto@cumin1003"
[15:59:56] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11379459 (10RobH) @Marostegui and/or @jcrespo,  Please note we've now migrated all #data-persistence hosts that do not require direct scheduling with you v...
[15:59:58] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1002.eqiad.wmnet - fceratto@cumin1003"
[16:02:36] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db-test1002.eqiad.wmnet with OS trixie
[16:02:36] <wikibugs>	 (03CR) 10Btullis: [C:03+2] LVS: Add druid-public-coordinator to service list [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene)
[16:02:46] <wikibugs>	 (03CR) 10Btullis: [C:03+2] druid: add druid-coordinator to druid public worker role [puppet] - 10https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene)
[16:04:31] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:04:41] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:04:59] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet
[16:08:34] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:08:45] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:12:37] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1002.eqiad.wmnet with reason: host reimage
[16:12:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy4002.ulsfo.wmnet
[16:12:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11379541 (10Volans) p:05Triage→03Medium
[16:13:29] <wikibugs>	 (03PS2) 10Volans: create user for AnkitaM, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn)
[16:13:41] <jinxer-wm>	 FIRING: ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:14:53] <hnowlan>	 jouncebot: nowandnext
[16:14:53] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 15 minute(s)
[16:14:53] <jouncebot>	 In 0 hour(s) and 15 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1630)
[16:15:04] <wikibugs>	 (03PS1) 10Dpogorzelski: ml k8s: handle service start order [puppet] - 10https://gerrit.wikimedia.org/r/1206402
[16:15:39] <wikibugs>	 (03PS3) 10Volans: create user for AnkitaM, add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn)
[16:16:17] <wikibugs>	 (03PS5) 10Hnowlan: trafficserver: Route group1 /page/lint(.*) to the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz)
[16:16:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy4002.ulsfo.wmnet
[16:17:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:18:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:18:44] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11379581 (10Marostegui) @RobH thanks for working on this! For backup* and ms-backup1002 @jcrespo would be able to tell you better than I do For moss-be1002...
[16:18:55] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: Route group1 /page/lint(.*) to the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz)
[16:19:06] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1002.eqiad.wmnet with reason: host reimage
[16:19:08] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:19:18] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1206404 (https://phabricator.wikimedia.org/T410281)
[16:19:41] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1206405 (https://phabricator.wikimedia.org/T410282)
[16:20:09] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1206406 (https://phabricator.wikimedia.org/T410283)
[16:20:17] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11379618 (10Marostegui)
[16:20:55] <wikibugs>	 (03CR) 10Hnowlan: "Relation chain is broken, need to fix" [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz)
[16:22:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: eqiad row C/D Data Platform host migrations - https://phabricator.wikimedia.org/T405943#11379646 (10RobH) @btullis,  Please note we now only have 12 data platform hosts remaining for migration.  I still need cl...
[16:22:23] <logmsgbot>	 jhancock@cumin1003 provision (PID 1428712) is awaiting input
[16:22:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:23:01] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:23:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11379651 (10RobH) p:05Medium→03High >>! In T405945#11372555, @RobH wrote: > @LSobanski, >  > The only two #infrastructure-foundations hosts le...
[16:25:16] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:26:20] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:26:58] <wikibugs>	 (03PS1) 10RLazarus: admin: Add rzl FIDO ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1206407
[16:27:20] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:27:50] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:27:52] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11379665 (10Jclark-ctr) @Marostegui   Thanks, that works for me!  Was es1033 going to be decommissioned? Is that still the case? Also, will any cookbooks n...
[16:29:28] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:29:35] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11379673 (10Marostegui) >>! In T405942#11379665, @Jclark-ctr wrote: > @Marostegui   Thanks, that works for me! >  Was es1033 going to be decommissioned? Is...
[16:29:57] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:30:05] <jouncebot>	 jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1630).
[16:31:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1236 with weight 0 T410281', diff saved to https://phabricator.wikimedia.org/P85342 and previous config saved to /var/cache/conftool/dbconfig/20251117-163121-marostegui.json
[16:31:25] <stashbot>	 T410281: Switchover s7 master (db1181 -> db1236) - https://phabricator.wikimedia.org/T410281
[16:31:28] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:31:42] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:31:44] <logmsgbot>	 !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T410281
[16:32:24] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1236 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1206404 (https://phabricator.wikimedia.org/T410281) (owner: 10Gerrit maintenance bot)
[16:32:36] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: Route group1 /page/lint(.*) to the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz)
[16:33:01] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Route transform/wikitext/to/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1194996 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz)
[16:33:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Sandbox cleanup for the Wikimedia REST APIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz)
[16:35:03] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Looks good! Two thoughts:" [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto)
[16:35:15] <marostegui>	 !log Starting s7 eqiad failover from db1181 to db1236 - T410281
[16:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:36] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1236 to s7 primary T410281', diff saved to https://phabricator.wikimedia.org/P85343 and previous config saved to /var/cache/conftool/dbconfig/20251117-163535-marostegui.json
[16:35:37] <wikibugs>	 (03PS6) 10Hnowlan: trafficserver: Route group1 /page/lint(.*) to the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz)
[16:36:21] <logmsgbot>	 !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1181 T410281', diff saved to https://phabricator.wikimedia.org/P85344 and previous config saved to /var/cache/conftool/dbconfig/20251117-163620-marostegui.json
[16:36:54] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1181 gradually with 4 steps - Repooling after switchover
[16:37:06] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply
[16:37:38] <logmsgbot>	 !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1181 gradually with 4 steps - Repooling after switchover
[16:37:40] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply
[16:37:50] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1181 gradually with 4 steps - Repooling after switchover
[16:38:41] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:39:38] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test1002.eqiad.wmnet with OS trixie
[16:39:38] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test1002.eqiad.wmnet
[16:40:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11379750 (10MoritzMuehlenhoff) >>! In T409860#11372373, @ssingh wrote: > `hcaptcha-proxy3001` wor...
[16:40:41] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1002.eqiad.wmnet
[16:40:42] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2094.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:41:05] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11379757 (10bking) Per IRC conversation with @RobH , the proposed hosts  ` wdqs1028 wdqs1029 wdqs1030 wdqs1031 wdqs1032 ` Do not have 10G NICs. Unfortunatel...
[16:42:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11379764 (10Volans)
[16:42:50] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11379766 (10Volans) @ngkountas just to be sure, when you say `analytics-privatedata-users level 2` you mean with kerberos?
[16:43:20] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11379768 (10Volans) p:05Triage→03Medium
[16:44:29] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Bump the orchestrator timeout down a skosh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205263 (https://phabricator.wikimedia.org/T407503) (owner: 10Cory Massaro)
[16:45:17] <wikibugs>	 (03PS2) 10Volans: admin: update ssh key for mfischerwmf [puppet] - 10https://gerrit.wikimedia.org/r/1206385 (https://phabricator.wikimedia.org/T410270)
[16:45:17] <wikibugs>	 (03PS4) 10Volans: admin: add user ankita97531 [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn)
[16:45:17] <wikibugs>	 (03PS1) 10Volans: admin: edit user ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854)
[16:46:09] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, once you have a second key you could consider to move the old key to the buster_ssh_keys key (if it will still be there." [puppet] - 10https://gerrit.wikimedia.org/r/1206407 (owner: 10RLazarus)
[16:47:02] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2230.codfw.wmnet onto db-test1002.eqiad.wmnet
[16:51:23] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts db-test1003.eqiad.wmnet
[16:51:40] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11379832 (10LSobanski) p:05Triage→03Medium a:03Dzahn
[16:53:27] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11379845 (10Jclark-ctr) @bking all of these are  10g
[16:54:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] admin: add user ankita97531 [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn)
[16:55:04] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2094.codfw.wmnet with OS bullseye
[16:55:12] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11379867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2094.codfw.wmnet with OS bullseye
[16:56:05] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[16:57:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy5001.eqsin.wmnet
[16:57:22] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11379875 (10Jhancock.wm) @bking do you need me to set up anything in codfw at this time?
[16:57:31] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11379876 (10bking) Ah, thanks for the correction. I have crossed through the above.
[16:58:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:01:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy5001.eqsin.wmnet
[17:01:55] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
[17:02:51] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] admin: edit user ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) (owner: 10Volans)
[17:03:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to ops-limited for matthieulec - https://phabricator.wikimedia.org/T410291 (10MLechvien-WMF) 03NEW
[17:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:03:44] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db-test1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003"
[17:03:44] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:03:45] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db-test1003.eqiad.wmnet
[17:04:23] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.ganeti.makevm for new host db-test1003.eqiad.wmnet
[17:04:24] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[17:04:32] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy3002.wikimedia.org
[17:04:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:05:13] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1045.eqiad.wmnet with OS trixie
[17:05:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:06:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11379965 (10fgiunchedi)
[17:06:44] <logmsgbot>	 !log fceratto@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[17:06:49] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.netbox
[17:07:14] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:08:23] <logmsgbot>	 jhancock@cumin1003 reimage (PID 1468492) is awaiting input
[17:08:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11379974 (10fgiunchedi) Thank you @cmooney ! FYI as per Andrew we really only care about cloudcephosd1035 through c...
[17:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:09:28] <wikibugs>	 (03PS1) 10Matthieulec: admin: Adding matthieulec to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1206422 (https://phabricator.wikimedia.org/T410291)
[17:10:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:11:36] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml k8s: handle service start order (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206402 (owner: 10Dpogorzelski)
[17:12:18] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[17:12:33] <logmsgbot>	 fceratto@cumin1003 makevm (PID 1477624) is awaiting input
[17:14:22] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1003.eqiad.wmnet - fceratto@cumin1003"
[17:14:53] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM db-test1003.eqiad.wmnet - fceratto@cumin1003"
[17:14:53] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:14:53] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.dns.wipe-cache db-test1003.eqiad.wmnet on all recursors
[17:14:56] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) db-test1003.eqiad.wmnet on all recursors
[17:15:07] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:15:07] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy3002.wikimedia.org
[17:15:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380025 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for...
[17:15:24] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1003.eqiad.wmnet - fceratto@cumin1003"
[17:15:28] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM db-test1003.eqiad.wmnet - fceratto@cumin1003"
[17:15:47] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.reimage for host db-test1003.eqiad.wmnet with OS trixie
[17:16:30] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy3002.wikimedia.org
[17:16:32] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[17:16:47] <wikibugs>	 (03PS2) 10Dpogorzelski: ml k8s: handle service start order [puppet] - 10https://gerrit.wikimedia.org/r/1206402
[17:16:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[17:17:11] <wikibugs>	 (03CR) 10Dpogorzelski: ml k8s: handle service start order (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206402 (owner: 10Dpogorzelski)
[17:18:02] <wikibugs>	 (03PS3) 10Dpogorzelski: ml k8s: handle service start order [puppet] - 10https://gerrit.wikimedia.org/r/1206402
[17:18:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11380053 (10Volans)
[17:19:17] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11380058 (10Volans) p:05Triage→03Medium Pending approval from @WMDE-leszek
[17:20:17] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[17:20:31] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[17:20:31] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:20:31] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy3002.wikimedia.org on all recursors
[17:20:35] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy3002.wikimedia.org on all recursors
[17:20:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11380070 (10Dzahn) The approval is already here on the ticket.
[17:21:04] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[17:21:09] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy3002.wikimedia.org - sukhe@cumin1003"
[17:21:14] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage
[17:21:32] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[17:21:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[17:22:41] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: 1 VM %request for codfw1dev CAS test/dev, hostname: cloudidp - https://phabricator.wikimedia.org/T410294 (10Andrew) 03NEW
[17:23:16] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1181 gradually with 4 steps - Repooling after switchover
[17:24:57] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage
[17:25:15] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: codfw   1 VM request for codfw1dev CAS test/dev, hostname: cloudidp - https://phabricator.wikimedia.org/T410294#11380104 (10Andrew)
[17:25:40] <wikibugs>	 (03CR) 10Bearloga: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[17:26:11] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs/interfaces: remove public1-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1206424 (https://phabricator.wikimedia.org/T410047)
[17:26:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T410234#11380113 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm balanced power
[17:26:25] <wikibugs>	 (03CR) 10Volans: [C:03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1202704 (https://phabricator.wikimedia.org/T409409) (owner: 10Hnowlan)
[17:26:49] <kostajh>	 jouncebot: nowandnext
[17:26:49] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 33 minute(s)
[17:26:49] <jouncebot>	 In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1800)
[17:26:49] <jouncebot>	 In 0 hour(s) and 33 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1800)
[17:26:58] <kostajh>	 I have a config patch to deploy, if that is ok
[17:27:35] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1003.eqiad.wmnet with reason: host reimage
[17:28:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11380126 (10Volans)
[17:28:41] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Disable edit integration on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206425 (https://phabricator.wikimedia.org/T405586)
[17:29:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users and SQL Lab for Arian Bozorg (WMDE) - https://phabricator.wikimedia.org/T409409#11380134 (10Volans) p:05Triage→03Medium @Arian_Bozorg patch merged, it should get live within 30 minutes from now. Once you've...
[17:30:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206425 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[17:31:39] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Disable edit integration on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206425 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan)
[17:31:41] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to ops-limited for matthieulec - https://phabricator.wikimedia.org/T410291#11380154 (10Volans) p:05Triage→03Medium Pending approval from either @Kappakayala or @mark (from `data.yaml` approval list)
[17:31:59] <logmsgbot>	 !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1206425|hCaptcha: Disable edit integration on jawiki (T405586)]]
[17:32:03] <stashbot>	 T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
[17:32:49] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1003.eqiad.wmnet with reason: host reimage
[17:37:12] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1206425|hCaptcha: Disable edit integration on jawiki (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[17:37:16] <stashbot>	 T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
[17:37:31] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Continuing with sync
[17:37:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:39:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11380223 (10BCornwall) @robh Works for me. Mind adding a calendar invite? Thanks.
[17:39:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[17:39:57] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11380238 (10Volans)
[17:40:10] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C:03+1] Release CampaignEvents extension to all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206395 (https://phabricator.wikimedia.org/T409760) (owner: 10Cmelo)
[17:40:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11380242 (10Volans) Whoops, my bad, it was not tracked in the summary and I missed it in the thread.
[17:40:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:41:56] <wikibugs>	 (03PS2) 10Dzahn: admin: upgrade lpintscher from ldap_only to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/1205187 (https://phabricator.wikimedia.org/T409933)
[17:42:57] <logmsgbot>	 !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206425|hCaptcha: Disable edit integration on jawiki (T405586)]] (duration: 10m 58s)
[17:43:01] <stashbot>	 T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
[17:43:41] <wikibugs>	 (03CR) 10Volans: [C:03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1205187 (https://phabricator.wikimedia.org/T409933) (owner: 10Dzahn)
[17:45:13] <kostajh>	 done with the backport
[17:45:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11380265 (10Volans) @Lydia_Pintscher patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please res...
[17:46:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380279 (10ssingh) >>! In T409860#11379750, @MoritzMuehlenhoff wrote: >>>! In T409860#11372373,...
[17:47:39] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: DNM: Prep patch for removal of old maps roles [puppet] - 10https://gerrit.wikimedia.org/r/1169636 (https://phabricator.wikimedia.org/T381565) (owner: 10Alexandros Kosiaris)
[17:52:11] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db-test1003.eqiad.wmnet with OS trixie
[17:52:11] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host db-test1003.eqiad.wmnet
[17:57:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195#11380368 (10Jhancock.wm) a:03Jhancock.wm Is this a false alert? I'm not seeing any issues physically with the server or in the idrac.   If this drive does need to be replaced, could i get some dmesg errors t...
[17:57:58] <wikibugs>	 (03CR) 10Scott French: "Thanks for the reviews, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1204945 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[17:58:04] <jinxer-wm>	 FIRING: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[17:58:16] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304 (10MatthewVernon) 03NEW
[17:59:37] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11380408 (10MatthewVernon)
[18:00:05] <jouncebot>	 swfrench-wmf: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1800).
[18:00:05] <jouncebot>	 ryankemper: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T1800).
[18:00:13] <swfrench-wmf>	 o/
[18:01:00] <wikibugs>	 (03CR) 10Scott French: [C:03+2] deployment_server: migrate mw-experimental to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1204945 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:01:26] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdf) failed in thanos-be2008 - https://phabricator.wikimedia.org/T409036#11380418 (10Jhancock.wm) @MatthewVernon avoiding reopening the ticket for metric reasons. did this one need any additional attention? i was holding onto the bad drive until...
[18:01:49] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1045.eqiad.wmnet with OS trixie
[18:03:04] <jinxer-wm>	 RESOLVED: MediaWikiElevatedUnknownLogins: Elevated number of login successes (source unknown) via mw-web - TODO - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?from=now-6h&orgId=1&to=now&viewPanel=26 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiElevatedUnknownLogins
[18:03:10] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS trixie
[18:03:55] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2094.codfw.wmnet with OS bullseye
[18:04:02] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11380431 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-be2094.codfw.wmnet with OS bullseye execute...
[18:05:32] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11380442 (10Jhancock.wm) @MatthewVernon got the main issue with this one fixed. it failed cause it's trying to reach the wrong puppet server. Gonna take a...
[18:05:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:07:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:08:29] <wikibugs>	 (03PS3) 10Volans: admin: update ssh key for mfischerwmf [puppet] - 10https://gerrit.wikimedia.org/r/1206385 (https://phabricator.wikimedia.org/T410270)
[18:08:29] <wikibugs>	 (03PS5) 10Volans: admin: add user ankita97531 [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn)
[18:08:29] <wikibugs>	 (03PS2) 10Volans: admin: edit user ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854)
[18:09:55] <wikibugs>	 (03CR) 10Aaron Schulz: "set-time-limit was remove in e784ab5897c9479aab525dbe2573b76ed46c83f2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577642 (https://phabricator.wikimedia.org/T247114) (owner: 10Ppchelko)
[18:10:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy5002.eqsin.wmnet
[18:11:59] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.clone of db2230.codfw.wmnet onto db-test1003.eqiad.wmnet
[18:12:36] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[18:12:36] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy3002.wikimedia.org
[18:12:48] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[18:12:52] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: No-deployment scap run to switch mw-experimental to PHP 8.3 - T405955
[18:13:09] <logmsgbot>	 !log swfrench@deploy2002 Stopping before sync operations
[18:14:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy5002.eqsin.wmnet
[18:15:42] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11380476 (10jcrespo) @Jclark-ctr Would me stopping backups tomorrow, Tuesday 18 before your TZ (e.g. before 11 am UTC/6am Eastern Timezone) and then those...
[18:17:03] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2230.codfw.wmnet onto db-test1003.eqiad.wmnet
[18:17:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380482 (10MoritzMuehlenhoff) >>! In T409860#11380279, @ssingh wrote:  > Thanks! Can you share t...
[18:17:29] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[18:17:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206395 (https://phabricator.wikimedia.org/T409760) (owner: 10Cmelo)
[18:19:20] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage
[18:21:36] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:22:07] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[18:22:34] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:23:20] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage
[18:23:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
[18:24:31] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: codfw   1 VM request for codfw1dev CAS test/dev, hostname: cloudidp - https://phabricator.wikimedia.org/T410294#11380505 (10MoritzMuehlenhoff) Looks good, but you definitely don't need 8G of RAM, 4G should be more than enough...
[18:25:56] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: codfw   1 VM request for codfw1dev CAS test/dev, hostname: cloudidp - https://phabricator.wikimedia.org/T410294#11380511 (10Andrew) >>! In T410294#11380505, @MoritzMuehlenhoff wrote: > Looks good, but you definitely don't need...
[18:26:07] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: codfw   1 VM request for codfw1dev CAS test/dev, hostname: cloudidp - https://phabricator.wikimedia.org/T410294#11380513 (10Andrew)
[18:27:44] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
[18:27:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:30:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204947 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:31:44] <wikibugs>	 (03Merged) 10jenkins-bot: Disable enrollment in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204947 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)
[18:32:00] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[18:32:00] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[18:32:00] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[18:32:01] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1204947|Disable enrollment in PHP 8.3 (T405955)]]
[18:32:05] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:34:15] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[18:34:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[18:34:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[18:36:12] <wikibugs>	 (03PS1) 10Bvibber: Reduced MediaViewer bucket sizes list to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206433 (https://phabricator.wikimedia.org/T372165)
[18:36:33] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.ganeti.makevm for new host hcaptcha-proxy7002.wikimedia.org
[18:36:33] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1204947|Disable enrollment in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:36:34] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.netbox
[18:36:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 17 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206433 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[18:37:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:38:31] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[18:39:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[18:39:54] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - sukhe@cumin1003"
[18:40:11] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM hcaptcha-proxy7002.wikimedia.org - sukhe@cumin1003"
[18:40:11] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:40:11] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.wipe-cache hcaptcha-proxy7002.wikimedia.org on all recursors
[18:40:15] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) hcaptcha-proxy7002.wikimedia.org on all recursors
[18:40:42] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7002.wikimedia.org - sukhe@cumin1003"
[18:40:46] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM hcaptcha-proxy7002.wikimedia.org - sukhe@cumin1003"
[18:41:21] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7002.wikimedia.org with OS trixie
[18:41:32] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380553 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[18:42:38] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204947|Disable enrollment in PHP 8.3 (T405955)]] (duration: 10m 37s)
[18:42:42] <stashbot>	 T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
[18:47:44] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1047.eqiad.wmnet with OS trixie
[18:48:35] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1048.eqiad.wmnet with OS trixie
[18:49:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-drmrs and NTT (2001:728:0:5000::164c) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[18:51:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[18:52:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-drmrs:xe-0/1/5 (Transit: NTT (345038) {#345038}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[18:53:36] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11380576 (10Novem_Linguae) Level 2 is yes ssh key, no kerberos.  https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels
[18:54:39] <wikibugs>	 (03CR) 10Novem Linguae: admin: edit user ngkountas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) (owner: 10Volans)
[18:54:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] admin: add user ankita97531 [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn)
[18:54:56] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11380577 (10phaultfinder)
[18:55:50] <wikibugs>	 (03CR) 10Dzahn: admin: edit user ngkountas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) (owner: 10Volans)
[18:56:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11380593 (10Dzahn) linking to T405517 as another practical example
[18:56:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy6001.drmrs.wmnet
[18:57:23] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "kerberos should be level 3 - https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels" [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) (owner: 10Volans)
[18:57:47] <swfrench-wmf>	 !log disable puppet on A:lvs-codfw for pybal config change - T352245
[18:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:51] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[18:58:18] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on hcaptcha-proxy3002.wikimedia.org with reason: host reimage
[18:59:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206407 (owner: 10RLazarus)
[18:59:58] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11380618 (10phaultfinder)
[19:00:02] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1203556 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[19:00:22] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hiera: temporarily point codfw LVS at conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1203556 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French)
[19:01:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy6001.drmrs.wmnet
[19:03:29] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on hcaptcha-proxy3002.wikimedia.org with reason: host reimage
[19:04:25] <swfrench-wmf>	 FYI here as well, I'm going to be applying some pybal config changes in codfw, which will result in some transient icinga check noise
[19:04:52] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1206407 (owner: 10RLazarus)
[19:05:28] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage
[19:05:40] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T352245)
[19:05:44] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[19:06:09] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:06:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy6002.drmrs.wmnet
[19:09:34] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage
[19:10:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy6002.drmrs.wmnet
[19:10:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy7001.magru.wmnet
[19:11:43] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T352245)
[19:11:47] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[19:14:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy7001.magru.wmnet
[19:15:30] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw (T352245)
[19:15:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11380687 (10KFrancis) Hi all, the NDA has been sent for signatures.  I'll confirm when it's complete.
[19:15:42] <wikibugs>	 (03PS2) 10Kgraessle: Set AutoModeratorMultiLingualRevertRisk with available wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203498 (https://phabricator.wikimedia.org/T400727)
[19:16:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:16:44] <wikibugs>	 (03CR) 10BPirkle: [C:03+1] Sandbox cleanup for the Wikimedia REST APIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203191 (https://phabricator.wikimedia.org/T409776) (owner: 10Aaron Schulz)
[19:19:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy7002.magru.wmnet
[19:20:43] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host hcaptcha-proxy3002.wikimedia.org with OS trixie
[19:20:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380697 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[19:21:37] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw (T352245)
[19:21:41] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[19:23:40] <jinxer-wm>	 FIRING: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[19:24:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy7002.magru.wmnet
[19:26:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:26:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11380736 (10MoritzMuehlenhoff)
[19:27:07] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-codfw (T352245)
[19:27:11] <stashbot>	 T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245
[19:27:29] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-codfw (T352245)
[19:30:54] <logmsgbot>	 !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-codfw (T352245)
[19:31:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11380759 (10Jclark-ctr) @jcrespo  That works for me Thanks!
[19:31:31] <logmsgbot>	 !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-codfw (T352245)
[19:33:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11380788 (10bking)
[19:33:50] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha-proxy7002.wikimedia.org with OS trixie
[19:33:51] <logmsgbot>	 !log sukhe@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host hcaptcha-proxy7002.wikimedia.org
[19:33:57] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380794 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[19:35:54] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:42:14] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7002.wikimedia.org with OS trixie
[19:42:24] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11380805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin10...
[19:46:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11380854 (10bking)
[19:56:59] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:58:22] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1048.eqiad.wmnet with OS trixie
[19:59:11] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1049.eqiad.wmnet with OS trixie
[20:07:02] <wikibugs>	 (03PS1) 10Santiago Faci: Test Kitchen UI: Deploying v1.1.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206446 (https://phabricator.wikimedia.org/T409546)
[20:07:06] <wikibugs>	 (03CR) 10Bearloga: [C:03+1] "By the way "EventStreamConfig" is misspelled in the commit message (missing e), but not a blocker for merging." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno)
[20:08:01] <wikibugs>	 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests: Site: codfw   1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11380957 (10Andrew)
[20:13:11] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploying v1.1.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206446 (https://phabricator.wikimedia.org/T409546) (owner: 10Santiago Faci)
[20:14:40] <wikibugs>	 (03PS1) 10Andrew Bogott: Add cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294)
[20:14:49] <wikibugs>	 (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.1.2 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206446 (https://phabricator.wikimedia.org/T409546) (owner: 10Santiago Faci)
[20:15:12] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott)
[20:15:53] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage
[20:19:10] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage
[20:19:37] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be2094.codfw.wmnet with OS bullseye
[20:19:44] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11380986 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-be2094.codfw.wmnet with OS bullseye
[20:20:07] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott)
[20:20:21] <mutante>	 !log codesearch9.codesearch - systemctl restart hound-search (T410310)
[20:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:24] <stashbot>	 T410310: Codesearch "everything" index stuck in pre-start state; unable to search "everything" - https://phabricator.wikimedia.org/T410310
[20:28:53] <logmsgbot>	 !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host hcaptcha-proxy7002.wikimedia.org with OS trixie
[20:29:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11381013 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 f...
[20:36:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:38:56] <jinxer-wm>	 FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:39:20] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2094.codfw.wmnet with reason: host reimage
[20:40:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:43:08] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2094.codfw.wmnet with reason: host reimage
[20:49:42] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1049.eqiad.wmnet with OS trixie
[20:51:00] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1051.eqiad.wmnet with OS trixie
[20:55:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[20:58:14] <icinga-wm>	 RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T2100).
[21:00:05] <jouncebot>	 cmelo and bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:10] <bvibber>	 o/
[21:00:26] <bvibber>	 i can spiderpig one or both in a pinch
[21:02:08] <cmelo>	 \o/
[21:02:10] <bvibber>	 woohoo
[21:02:23] <bvibber>	 cmelo: i can spiderpig both config patches together in a pinch
[21:03:04] <cmelo>	 that would be great thank you
[21:03:10] <bvibber>	 cool, starting...
[21:03:25] <jinxer-wm>	 RESOLVED: MirrorHighLag: Mirrors - /srv/mirrors/debian synchronization lag - https://wikitech.wikimedia.org/wiki/Mirrors - https://grafana.wikimedia.org/d/dbd8a904-eab2-48d1-a3b9-fa1851ef3ed2/mirrors?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DMirrorHighLag
[21:03:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206433 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[21:03:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206395 (https://phabricator.wikimedia.org/T409760) (owner: 10Cmelo)
[21:04:29] <wikibugs>	 (03Merged) 10jenkins-bot: Reduced MediaViewer bucket sizes list to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206433 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber)
[21:04:33] <wikibugs>	 (03Merged) 10jenkins-bot: Release CampaignEvents extension to all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206395 (https://phabricator.wikimedia.org/T409760) (owner: 10Cmelo)
[21:04:54] <logmsgbot>	 !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1206433|Reduced MediaViewer bucket sizes list to group1 (T372165)]], [[gerrit:1206395|Release CampaignEvents extension to all remaining wikis (T409760)]]
[21:04:59] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:04:59] <stashbot>	 T409760: Release CampaignEvents extension to all remaining wikis - NOV 17 - https://phabricator.wikimedia.org/T409760
[21:05:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:05:59] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Enable hCaptcha editing for fawiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206455 (https://phabricator.wikimedia.org/T405586)
[21:07:44] <bvibber>	 cmelo: will there be anything you need to test against test servers?
[21:08:00] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage
[21:08:51] <mutante>	 !log LDAP - added ankita97531 to group nda - T409894
[21:08:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:55] <stashbot>	 T409894: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894
[21:09:08] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:09:13] <bvibber>	 (be a couple minutes yet before test servers live)
[21:09:37] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11381211 (10Dzahn) 05In progress→03Resolved a:03Dzahn You have been added the requested LDAP group "nda".
[21:09:56] <logmsgbot>	 !log bvibber@deploy2002 bvibber, cmelo: Backport for [[gerrit:1206433|Reduced MediaViewer bucket sizes list to group1 (T372165)]], [[gerrit:1206395|Release CampaignEvents extension to all remaining wikis (T409760)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:09:59] <wikibugs>	 (03PS1) 10Scott French: hiera: temporarily move etcd replication to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/1206452 (https://phabricator.wikimedia.org/T352245)
[21:10:01] <wikibugs>	 (03PS2) 10Scott French: hiera: switch codfw etcd-main cluster to cfssl/pki [puppet] - 10https://gerrit.wikimedia.org/r/1203557 (https://phabricator.wikimedia.org/T352245)
[21:10:02] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:10:02] <wikibugs>	 (03PS1) 10Scott French: hiera: move etcd replication back to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1206453 (https://phabricator.wikimedia.org/T352245)
[21:10:02] <stashbot>	 T409760: Release CampaignEvents extension to all remaining wikis - NOV 17 - https://phabricator.wikimedia.org/T409760
[21:10:11] <bvibber>	 cmelo: ok they're ready to test :D
[21:10:36] <cmelo>	 thank you :)
[21:11:28] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage
[21:11:58] <bvibber>	 ok mine looks good no explosions :D
[21:12:13] <cmelo>	 all good, thank you so much :)
[21:12:17] <bvibber>	 woohoo
[21:12:21] <logmsgbot>	 !log bvibber@deploy2002 bvibber, cmelo: Continuing with sync
[21:17:36] <logmsgbot>	 !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206433|Reduced MediaViewer bucket sizes list to group1 (T372165)]], [[gerrit:1206395|Release CampaignEvents extension to all remaining wikis (T409760)]] (duration: 12m 43s)
[21:17:42] <stashbot>	 T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165
[21:17:42] <stashbot>	 T409760: Release CampaignEvents extension to all remaining wikis - NOV 17 - https://phabricator.wikimedia.org/T409760
[21:17:45] <bvibber>	 \o/
[21:17:50] <bvibber>	 cmelo: all done
[21:18:30] <cmelo>	 thank you!!!!
[21:20:26] <bvibber>	 yw :D
[21:20:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:22:06] <Amir1>	 !log DROP table if exists securepoll_u4c2025_edits; on all wikis (T355594)
[21:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:10] <stashbot>	 T355594: For global elections, stop creating eligible voters table for each election on every wiki and keeping them forever - https://phabricator.wikimedia.org/T355594
[21:22:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[21:27:14] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1051.eqiad.wmnet with OS trixie
[21:30:48] <wikibugs>	 (03PS1) 10Ahmon Dancy: scap.cfg.erb: Restore beta cluster php_fpm_restart_script setting [puppet] - 10https://gerrit.wikimedia.org/r/1206458 (https://phabricator.wikimedia.org/T396166)
[21:35:45] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1051.eqiad.wmnet with OS trixie
[21:37:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:51:20] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage
[21:52:36] <wikibugs>	 (03PS1) 10Aaron Schulz: Mark non-wikimedia.org math APIs as deprecated in the sandbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206466 (https://phabricator.wikimedia.org/T409773)
[21:55:09] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage
[21:57:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251117T2200).
[22:03:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:11:35] <papaul>	 !log reboot sretest2004 to troubleshoot LLDP issue
[22:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:14] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1051.eqiad.wmnet with OS trixie
[22:21:50] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1052.eqiad.wmnet with OS trixie
[22:32:00] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[22:32:00] <jinxer-wm>	 Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ...
[22:32:00] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[22:38:24] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage
[22:43:06] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage
[22:43:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:46:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[22:51:14] <jinxer-wm>	 FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[22:59:54] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11381549 (10phaultfinder)
[23:01:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:04:54] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11381563 (10phaultfinder)
[23:07:20] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS trixie
[23:07:28] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11381580 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2004.codfw.wmnet with OS trixie
[23:08:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:13:51] <wikibugs>	 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11381585 (10EdErhart-WMF) Hey folks, especially @Dzahn - we'd like to move forward with locating the YoW microsite at wikipedia25.org.
[23:15:17] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1052.eqiad.wmnet with OS trixie
[23:23:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:24:21] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:27:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[23:29:21] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:33:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[23:38:12] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS trixie
[23:39:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:50:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:51:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:51:29] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS trixie
[23:51:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11381762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2004.codfw.wmnet with OS trixie completed: - sretest2004 (**PASS**)...
[23:53:32] <icinga-wm>	 PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100%
[23:54:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:55:02] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage
[23:57:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[23:58:00] <icinga-wm>	 RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 30.48 ms
[23:58:31] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage