[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T0000) [00:01:27] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [00:08:59] RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 30.48 ms [00:12:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:18:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:28:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206492 [00:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206492 (owner: 10TrainBranchBot) [00:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:39:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [00:52:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1206492 (owner: 10TrainBranchBot) [00:56:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1053.eqiad.wmnet with OS trixie [00:59:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:00:53] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:02:28] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 35s) [01:04:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:08:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206498 [01:08:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206498 (owner: 10TrainBranchBot) [01:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:14:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:20:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:31:03] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1206498 (owner: 10TrainBranchBot) [01:32:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:33:35] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11381926 (10Papaul) [01:50:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [01:55:55] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:57:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:06:49] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1055.eqiad.wmnet'] [02:06:56] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1055.eqiad.wmnet'] [02:08:04] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1056.eqiad.wmnet with OS trixie [02:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.3 [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206519 (https://phabricator.wikimedia.org/T408273) [02:08:08] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.3 [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206519 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [02:11:18] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11381964 (10Dzahn) Hi @cmadeo @EdErhart-WMF, You should be allowed to have subpages in any of the options. What I wanted to recommend now is to use the `25.wikip... [02:11:24] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11381965 (10Papaul) [02:20:46] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.3 [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206519 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [02:24:12] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [02:27:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:28:22] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage [02:32:00] FIRING: KubernetesDeploymentUnavailableReplicas: ... [02:32:00] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [02:32:00] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [02:36:16] (03PS2) 10Scott French: hiera: temporarily disable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1206452 (https://phabricator.wikimedia.org/T352245) [02:36:16] (03PS3) 10Scott French: hiera: switch codfw etcd-main cluster to cfssl/pki [puppet] - 10https://gerrit.wikimedia.org/r/1203557 (https://phabricator.wikimedia.org/T352245) [02:36:16] (03PS2) 10Scott French: hiera: move etcd replication back to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1206453 (https://phabricator.wikimedia.org/T352245) [02:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:47:29] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206452 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [02:47:39] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203557 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [02:47:45] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206453 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [02:50:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [02:51:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:53:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T0300) [03:04:51] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11381998 (10phaultfinder) [03:09:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11381999 (10phaultfinder) [03:23:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:33:00] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1056.eqiad.wmnet with OS trixie [03:39:06] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS trixie [03:46:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [03:55:05] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [03:58:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T0400) [04:02:29] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206589 (https://phabricator.wikimedia.org/T408273) [04:02:31] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206589 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [04:03:19] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206589 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [04:03:52] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.3 refs T408273 [04:03:56] T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273 [04:11:45] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [04:11:45] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [04:11:45] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:23:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1055.eqiad.wmnet with OS trixie [04:26:16] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS trixie [04:32:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:42:20] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [04:48:40] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [04:50:50] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.46.0-wmf.3 refs T408273 (duration: 46m 58s) [04:50:55] T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273 [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T0500) [05:02:40] !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.25 (duration: 02m 38s) [05:08:24] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:14:55] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1057.eqiad.wmnet with OS trixie [05:15:40] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1058.eqiad.wmnet with OS trixie [05:20:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [05:20:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [05:20:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:22:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:24:31] (03PS1) 10KartikMistry: Update cxserver to [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206643 (https://phabricator.wikimedia.org/T409688) [05:25:48] (03PS2) 10KartikMistry: Update cxserver to 2025-11-18-043632-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206643 (https://phabricator.wikimedia.org/T409688) [05:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:29:15] Deploying cxserver.. [05:29:20] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-11-18-043632-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206643 (https://phabricator.wikimedia.org/T409688) (owner: 10KartikMistry) [05:30:48] (03CR) 10KartikMistry: "Yes. MinT/machinetranslation is no longer using people.w.o for downloading models." [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) (owner: 10Clément Goubert) [05:31:29] (03Merged) 10jenkins-bot: Update cxserver to 2025-11-18-043632-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206643 (https://phabricator.wikimedia.org/T409688) (owner: 10KartikMistry) [05:31:43] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage [05:32:47] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11382104 (10Papaul) [05:33:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:44] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1058.eqiad.wmnet with reason: host reimage [05:38:53] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:39:21] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:45:32] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:46:03] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:46:21] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:46:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:46:55] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:47:17] !log Update cxserver to 2025-11-18-043632-production (T409688, T408515) [05:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:23] T409688: Upgrade sentencex library - https://phabricator.wikimedia.org/T409688 [05:47:23] T408515: Update Apertium service to Trixie - https://phabricator.wikimedia.org/T408515 [05:49:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:54:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:56:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:59:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:04:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:10:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1206452 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [06:10:37] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update revertrisk-wikidata image in both experimental and revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206344 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [06:12:24] (03Merged) 10jenkins-bot: ml-services: update revertrisk-wikidata image in both experimental and revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206344 (https://phabricator.wikimedia.org/T406179) (owner: 10Kevin Bazira) [06:13:49] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [06:16:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:18:26] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11382134 (10Papaul) [06:20:25] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [06:22:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1163 with weight 0 T410282', diff saved to https://phabricator.wikimedia.org/P85350 and previous config saved to /var/cache/conftool/dbconfig/20251118-062209-marostegui.json [06:22:14] T410282: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T410282 [06:22:17] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11382136 (10MoritzMuehlenhoff) 05Resolved→03Open This misses the tracking entry in data.yaml [06:22:40] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T410282 [06:22:57] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1206405 (https://phabricator.wikimedia.org/T410282) (owner: 10Gerrit maintenance bot) [06:23:04] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1203557 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [06:26:30] !log Starting s1 eqiad failover from db1184 to db1163 - T410282 [06:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1163 to s1 primary T410282', diff saved to https://phabricator.wikimedia.org/P85351 and previous config saved to /var/cache/conftool/dbconfig/20251118-062645-marostegui.json [06:27:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1184 T410282', diff saved to https://phabricator.wikimedia.org/P85353 and previous config saved to /var/cache/conftool/dbconfig/20251118-062720-marostegui.json [06:27:24] T410282: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T410282 [06:27:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy3002.esams.wmnet [06:27:40] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1184 gradually with 4 steps - Repooling after switchover [06:28:31] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11382155 (10MoritzMuehlenhoff) [06:28:38] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11382156 (10Papaul) @cmooney @ayouns I update the task with all the IPV4 and IPV6 addresses for the links, irb's and loopbacks. Please review and let me know i... [06:30:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool pc1 T405942', diff saved to https://phabricator.wikimedia.org/P85355 and previous config saved to /var/cache/conftool/dbconfig/20251118-063010-marostegui.json [06:30:14] T405942: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942 [06:30:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool ms3 T405942', diff saved to https://phabricator.wikimedia.org/P85356 and previous config saved to /var/cache/conftool/dbconfig/20251118-063048-marostegui.json [06:31:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy3002.esams.wmnet [06:34:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 18 hosts with reason: Network maint [06:34:35] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11382177 (10Marostegui) @Jclark-ctr the following hosts are ready for you to proceed. No special cookbooks or downtime are required: db1153 db1167 db1121 e... [06:34:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 20 hosts with reason: Network maint [06:36:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:37:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:37:32] (03PS2) 10Marostegui: installserver: Clean up es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206381 (https://phabricator.wikimedia.org/T408777) [06:37:32] (03PS1) 10Marostegui: installserver: Do not format es1055 [puppet] - 10https://gerrit.wikimedia.org/r/1206670 [06:38:18] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1058.eqiad.wmnet with OS trixie [06:41:08] (03CR) 10Marostegui: [C:03+2] installserver: Do not format es1055 [puppet] - 10https://gerrit.wikimedia.org/r/1206670 (owner: 10Marostegui) [06:41:12] (03CR) 10Marostegui: [C:03+2] installserver: Clean up es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1206381 (https://phabricator.wikimedia.org/T408777) (owner: 10Marostegui) [06:42:08] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11382182 (10Dzahn) Yes, but It's because there is already https://gerrit.wikimedia.org/r/c/operations/puppet/+/1205192 for T409893 which will supersede it. Can be merged or rev... [06:42:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:44:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:49:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:50:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:51:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T0700). [07:00:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:03:56] (03CR) 10Giuseppe Lavagetto: "Everything you just wrote is right:" [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [07:09:29] (03CR) 10Dzahn: "thank you! But can we remove these completely? Because they are not used anymore and that way we won't ask ourselves again when it comes t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) (owner: 10Clément Goubert) [07:09:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:09:57] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403273#11382211 (10phaultfinder) [07:13:09] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1184 gradually with 4 steps - Repooling after switchover [07:13:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206455 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [07:14:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:14:57] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T403275#11382215 (10phaultfinder) [07:18:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:23:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:35:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:38:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy3001.esams.wmnet [07:41:07] (03PS1) 10Muehlenhoff: installserver: Readd es2028 with modified db-trixie partman config [puppet] - 10https://gerrit.wikimedia.org/r/1206708 (https://phabricator.wikimedia.org/T408777) [07:42:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy3001.esams.wmnet [07:43:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204830 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan) [07:50:13] !log installing libssh security updates [07:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy2002.codfw.wmnet [07:54:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy2002.codfw.wmnet [07:55:15] !log jmm@cumin1003 DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: dbprov1003.eqiad.wmnet [07:55:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov1003 - https://phabricator.wikimedia.org/T409524#11382308 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: dbprov1003.eqiad.wmnet [07:56:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy2001.codfw.wmnet [07:57:42] (03PS5) 10Superpes15: [kywiki] Add new rollbacker and eliminator usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205428 (https://phabricator.wikimedia.org/T410121) [08:00:05] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T0800). [08:00:05] Superpes and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy2001.codfw.wmnet [08:00:58] I’ll be around in 30-40 minutes and can help sync patches then. [08:01:39] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11382323 (10MoritzMuehlenhoff) [08:06:47] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1206453 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [08:15:03] !log installing openssl bugfix updates on trixie hosts [08:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:26] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp allow increase of Tomcat heap allocation [puppet] - 10https://gerrit.wikimedia.org/r/1205084 (owner: 10Slyngshede) [08:18:33] (03PS1) 10Bartosz Wójtowicz: ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206756 (https://phabricator.wikimedia.org/T408538) [08:19:56] (03CR) 10AikoChou: [C:03+1] ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206756 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [08:20:12] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11382376 (10MatthewVernon) [08:21:04] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206756 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [08:22:16] (03PS2) 10Matthieulec: admin: Adding matthieulec to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1206422 (https://phabricator.wikimedia.org/T410291) [08:22:34] (03CR) 10CI reject: [V:04-1] admin: Adding matthieulec to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1206422 (https://phabricator.wikimedia.org/T410291) (owner: 10Matthieulec) [08:22:54] (03Merged) 10jenkins-bot: ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206756 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [08:23:53] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:24:39] (03CR) 10KartikMistry: "Yes." [deployment-charts] - 10https://gerrit.wikimedia.org/r/931086 (https://phabricator.wikimedia.org/T335491) (owner: 10Clément Goubert) [08:25:24] (03CR) 10Slyngshede: [C:03+2] P:cache::base allow geoip to be disabled [puppet] - 10https://gerrit.wikimedia.org/r/1202986 (owner: 10Slyngshede) [08:25:59] PROBLEM - Host tcp-proxy1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:56] (03CR) 10Marostegui: "The change to partman also expected?" [puppet] - 10https://gerrit.wikimedia.org/r/1206708 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [08:28:58] (03PS4) 10Matthieulec: admin: Adding matthieulec to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1206422 (https://phabricator.wikimedia.org/T410291) [08:29:38] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [08:29:38] I'm back, and will start my backports [08:30:00] (03CR) 10Muehlenhoff: "Yes! It brings back db.cfg to the previous state, so that I can trigger the original error condition again." [puppet] - 10https://gerrit.wikimedia.org/r/1206708 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [08:30:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206455 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [08:30:10] (03CR) 10Marostegui: [C:03+1] installserver: Readd es2028 with modified db-trixie partman config [puppet] - 10https://gerrit.wikimedia.org/r/1206708 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [08:31:01] (03Merged) 10jenkins-bot: hCaptcha: Enable hCaptcha editing for fawiki, trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206455 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [08:31:05] RECOVERY - Host tcp-proxy1002 is UP: PING WARNING - Packet loss = 66%, RTA = 0.60 ms [08:31:18] (03CR) 10Dpogorzelski: [C:03+2] ml k8s: handle service start order [puppet] - 10https://gerrit.wikimedia.org/r/1206402 (owner: 10Dpogorzelski) [08:32:08] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1206455|hCaptcha: Enable hCaptcha editing for fawiki, trwiki (T405586)]] [08:32:12] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [08:32:43] jhancock@cumin1003 reimage (PID 1676231) is awaiting input [08:33:24] FIRING: JobUnavailable: Reduced availability for job tcp_proxy in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:35:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:36:25] FIRING: SystemdUnitFailed: haproxy.service on tcp-proxy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2094.codfw.wmnet with OS bullseye [08:37:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11382425 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2094.codfw.wmnet with OS bullseye [08:37:37] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1206455|hCaptcha: Enable hCaptcha editing for fawiki, trwiki (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:37:40] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [08:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:39:43] (03PS4) 10Superpes15: [arwikimedia] Change the logo/icon and update the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205421 (https://phabricator.wikimedia.org/T353218) [08:40:10] !log kharlan@deploy2002 kharlan: Continuing with sync [08:40:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:40:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [08:40:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:46:34] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206455|hCaptcha: Enable hCaptcha editing for fawiki, trwiki (T405586)]] (duration: 14m 26s) [08:46:38] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [08:46:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:47:07] On to the next one [08:47:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204830 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan) [08:48:48] (03Merged) 10jenkins-bot: hCaptcha: Update passive mode config for addurl trigger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1204830 (https://phabricator.wikimedia.org/T409957) (owner: 10Kosta Harlan) [08:49:18] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1204830|hCaptcha: Update passive mode config for addurl trigger (T409957)]] [08:49:22] T409957: hCaptcha: Adjust config to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [08:49:26] (03PS5) 10Superpes15: [arwikimedia] Change the logo/icon and update the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205421 (https://phabricator.wikimedia.org/T353218) [08:51:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:53:02] (03PS6) 10Superpes15: [arwikimedia] Change the logo and update the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205421 (https://phabricator.wikimedia.org/T353218) [08:53:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: No free IPs on public1-ulsfo vlan (Nov 2025) - https://phabricator.wikimedia.org/T410047#11382501 (10ayounsi) See also {T367732} [08:53:42] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1204830|hCaptcha: Update passive mode config for addurl trigger (T409957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:54:51] (03PS7) 10Superpes15: [arwikimedia] Change the logo/icon and update the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205421 (https://phabricator.wikimedia.org/T353218) [08:55:09] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "fix up sretest1005 - jmm@cumin2002" [08:56:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "fix up sretest1005 - jmm@cumin2002" [08:57:57] !log kharlan@deploy2002 kharlan: Continuing with sync [08:58:25] (03PS1) 10Filippo Giunchedi: install_server: add pause-reboot.cfg to debug boot problems [puppet] - 10https://gerrit.wikimedia.org/r/1206809 (https://phabricator.wikimedia.org/T407586) [08:58:59] (03CR) 10CI reject: [V:04-1] install_server: add pause-reboot.cfg to debug boot problems [puppet] - 10https://gerrit.wikimedia.org/r/1206809 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [08:59:22] kostajh So no time left for my patches? [08:59:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [08:59:47] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2094.codfw.wmnet with reason: host reimage [09:00:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [09:01:07] (03PS2) 10Filippo Giunchedi: install_server: add pause-reboot.cfg to debug boot problems [puppet] - 10https://gerrit.wikimedia.org/r/1206809 (https://phabricator.wikimedia.org/T407586) [09:01:19] Superpes: which ones are the highest priority? I can probably deploy at least some [09:02:05] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1204830|hCaptcha: Update passive mode config for addurl trigger (T409957)]] (duration: 12m 47s) [09:02:10] T409957: hCaptcha: Adjust config and logic to not unset addurl rule if 100% passive mode is being used - https://phabricator.wikimedia.org/T409957 [09:02:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205195 (https://phabricator.wikimedia.org/T353218) (owner: 10Superpes15) [09:03:01] (03PS3) 10Volans: admin: edit user ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) [09:03:05] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) (owner: 10Volans) [09:03:13] kostajh The idea was to do three together and then the logos one separately! If you can do at least one of these (maybe the logos), then I can schedule the other 3 for the next window :) [09:03:20] (03Merged) 10jenkins-bot: [arwikimedia] Disable local file uploading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205195 (https://phabricator.wikimedia.org/T353218) (owner: 10Superpes15) [09:03:23] (03PS1) 10Filippo Giunchedi: installserver: set cloudcontrol2010-dev with standard recipes [puppet] - 10https://gerrit.wikimedia.org/r/1206812 (https://phabricator.wikimedia.org/T407586) [09:03:51] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1205195|[arwikimedia] Disable local file uploading (T353218)]] [09:03:56] T353218: ar.wikimedia.org - Change Project Logo / Visibility / File Uploads - https://phabricator.wikimedia.org/T353218 [09:04:26] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2094.codfw.wmnet with reason: host reimage [09:05:44] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 138915 [09:08:15] !log kharlan@deploy2002 superpes, kharlan: Backport for [[gerrit:1205195|[arwikimedia] Disable local file uploading (T353218)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:08:24] RESOLVED: JobUnavailable: Reduced availability for job tcp_proxy in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:08:27] Testing [09:09:05] Works fine :) [09:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:09:31] !log ayounsi@cumin1003 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 138915 [09:09:42] !log ayounsi@cumin1003 START - Cookbook sre.network.peering with action 'email' for AS: 397715 [09:09:51] kostajh ^^ (Sorry forgot to ping) [09:10:08] !log kharlan@deploy2002 superpes, kharlan: Continuing with sync [09:10:16] (03PS1) 10Esanders: Hackaround 2015 broken convert on ptwikibooks [extensions/Flow] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1206813 (https://phabricator.wikimedia.org/T402549) [09:10:21] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 397715 [09:10:24] (03CR) 10Vgutierrez: cache::text: introduce rate-limits by traffic class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) (owner: 10Giuseppe Lavagetto) [09:10:30] (03PS1) 10Esanders: Hackaround 2015 broken convert on ptwikibooks [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206814 (https://phabricator.wikimedia.org/T402549) [09:10:58] !log VACUUM large container dbs on ms-be1070 T377827 [09:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:02] T377827: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827 [09:11:25] RESOLVED: SystemdUnitFailed: haproxy.service on tcp-proxy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Flow] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1206813 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders) [09:12:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206814 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders) [09:14:13] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205195|[arwikimedia] Disable local file uploading (T353218)]] (duration: 10m 22s) [09:14:17] T353218: ar.wikimedia.org - Change Project Logo / Visibility / File Uploads - https://phabricator.wikimedia.org/T353218 [09:14:50] Thanks for your help as always kostajh :3 [09:15:05] (03PS1) 10Ayounsi: Add nokia Console and PSx ports [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1206816 (https://phabricator.wikimedia.org/T410073) [09:15:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205421 (https://phabricator.wikimedia.org/T353218) (owner: 10Superpes15) [09:15:51] Superpes: sure thing [09:16:55] (03Merged) 10jenkins-bot: [arwikimedia] Change the logo/icon and update the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205421 (https://phabricator.wikimedia.org/T353218) (owner: 10Superpes15) [09:17:27] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1205421|[arwikimedia] Change the logo/icon and update the wordmark (T353218)]] [09:21:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206385 (https://phabricator.wikimedia.org/T410270) (owner: 10Volans) [09:21:54] !log kharlan@deploy2002 kharlan, superpes: Backport for [[gerrit:1205421|[arwikimedia] Change the logo/icon and update the wordmark (T353218)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:21:58] T353218: ar.wikimedia.org - Change Project Logo / Visibility / File Uploads - https://phabricator.wikimedia.org/T353218 [09:22:09] (03CR) 10Cmelo: [C:03+1] Drop $wgCampaignEventsCountrySchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201814 (https://phabricator.wikimedia.org/T408932) (owner: 10Daimona Eaytoy) [09:22:29] (03CR) 10Muehlenhoff: "Also needs the krb: present on the user record" [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) (owner: 10Volans) [09:22:43] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1070.eqiad.wmnet with reason: vacuum overlarge container dbs [09:22:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:22:56] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#11382672 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6c2310e9-bbd5-42c9-9901-2414d47f819d) set by mvernon@cumin... [09:23:16] Superpes: can you check the patch on mwdebug? [09:23:20] Testing [09:23:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2094.codfw.wmnet with OS bullseye [09:24:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11382674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2094.codfw.wmnet with OS bullseye complete... [09:24:11] Looks fine! kostajh [09:24:17] !log kharlan@deploy2002 kharlan, superpes: Continuing with sync [09:25:07] (03PS4) 10Volans: admin: edit user ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) [09:26:12] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11382680 (10MoritzMuehlenhoff) [09:27:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy1002.eqiad.wmnet [09:27:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:27:58] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) (owner: 10Volans) [09:28:29] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205421|[arwikimedia] Change the logo/icon and update the wordmark (T353218)]] (duration: 11m 02s) [09:28:33] T353218: ar.wikimedia.org - Change Project Logo / Visibility / File Uploads - https://phabricator.wikimedia.org/T353218 [09:28:46] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11382693 (10MatthewVernon) @Jhancock.wm reimage had stalled again because puppet wasn't happy, again because of an EFI/vfat partition on one of the spinni... [09:30:44] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "fixup sretest1006 - jmm@cumin1003" [09:31:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy1002.eqiad.wmnet [09:31:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:33:00] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "fixup sretest1006 - jmm@cumin1003" [09:34:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy1001.eqiad.wmnet [09:37:38] Many thanks again kostajh (sorry but I'm on train and my connection keeps crashing)! [09:37:50] Superpes: no problem, thanks for verifying it [09:37:55] (03PS1) 10Marostegui: db1169: Make a note [puppet] - 10https://gerrit.wikimedia.org/r/1206821 (https://phabricator.wikimedia.org/T410369) [09:38:26] !log installing curl security updates [09:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:31] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1206821 (https://phabricator.wikimedia.org/T410369) (owner: 10Marostegui) [09:38:44] (03CR) 10Marostegui: [C:03+2] db1169: Make a note [puppet] - 10https://gerrit.wikimedia.org/r/1206821 (https://phabricator.wikimedia.org/T410369) (owner: 10Marostegui) [09:38:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy1001.eqiad.wmnet [09:42:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1169 T410369', diff saved to https://phabricator.wikimedia.org/P85361 and previous config saved to /var/cache/conftool/dbconfig/20251118-094246-marostegui.json [09:42:51] T410369: Install Debian Trixie on one s1 host - https://phabricator.wikimedia.org/T410369 [09:43:39] (03PS1) 10Marostegui: db1169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1206822 (https://phabricator.wikimedia.org/T410369) [09:46:21] (03CR) 10Marostegui: [C:03+2] db1169: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1206822 (https://phabricator.wikimedia.org/T410369) (owner: 10Marostegui) [09:46:51] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11382802 (10MoritzMuehlenhoff) [09:47:13] (03CR) 10Volans: [C:03+2] admin: update ssh key for mfischerwmf [puppet] - 10https://gerrit.wikimedia.org/r/1206385 (https://phabricator.wikimedia.org/T410270) (owner: 10Volans) [09:47:19] (03CR) 10Volans: [C:03+2] admin: add user ankita97531 [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn) [09:47:22] (03CR) 10Volans: [C:03+2] admin: edit user ngkountas [puppet] - 10https://gerrit.wikimedia.org/r/1206416 (https://phabricator.wikimedia.org/T409854) (owner: 10Volans) [09:48:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be1070.eqiad.wmnet [09:48:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1070.eqiad.wmnet [09:48:27] !log mvernon@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1067.eqiad.wmnet with reason: vacuum overlarge container dbs [09:48:32] !log VACUUM large container dbs on ms-be1067 T377827 [09:48:33] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#11382814 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cee1089c-9a64-4ae6-8caa-f6e442f9ac23) set by mvernon@cumin... [09:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:36] T377827: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827 [09:51:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1206809 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [09:51:36] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11382841 (10Volans) 05Open→03Resolved p:05Triage→03Medium Patch merged, resolving. @AnkitaM you're currently part of the LDAP `nda` group and should be able to acces... [09:52:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11382846 (10Volans) @MGerlach @AnkitaM: patch merged, it should get live within 30 minutes from now. Once you've verified all works... [09:52:49] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Update mfischerwmf ssh key - https://phabricator.wikimedia.org/T410270#11382859 (10Volans) @MFischer patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected please resolve this task. [09:53:06] (03CR) 10Vgutierrez: [C:03+1] hiera: lvs/interfaces: remove public1-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1206424 (https://phabricator.wikimedia.org/T410047) (owner: 10Ssingh) [09:53:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to run queries on superset.wikimedia.org for Nik Gkountas - https://phabricator.wikimedia.org/T409854#11382864 (10Volans) @ngkountas patch merged, it should get live within 30 minutes from now. Once you've verified all works as expected plea... [09:55:16] (03CR) 10Filippo Giunchedi: [C:03+2] install_server: add pause-reboot.cfg to debug boot problems [puppet] - 10https://gerrit.wikimedia.org/r/1206809 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [09:55:39] (03CR) 10Filippo Giunchedi: [C:03+2] installserver: set cloudcontrol2010-dev with standard recipes [puppet] - 10https://gerrit.wikimedia.org/r/1206812 (https://phabricator.wikimedia.org/T407586) (owner: 10Filippo Giunchedi) [09:56:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:48] (03CR) 10Filippo Giunchedi: [C:03+2] cloudcephosd: switch 1049 to single interface [puppet] - 10https://gerrit.wikimedia.org/r/1203384 (https://phabricator.wikimedia.org/T399180) (owner: 10Filippo Giunchedi) [09:58:07] (03PS1) 10D3r1ck01: session: Use fresh MW services container in CLI mode (take 2) [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206824 (https://phabricator.wikimedia.org/T405450) [09:58:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [core] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206824 (https://phabricator.wikimedia.org/T405450) (owner: 10D3r1ck01) [10:03:36] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [10:05:01] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "fixup sretest2010 - jmm@cumin1003" [10:05:16] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "fixup sretest2010 - jmm@cumin1003" [10:06:26] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11382889 (10MoritzMuehlenhoff) [10:08:38] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 64.48 ms [10:08:47] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11382907 (10fgiunchedi) [10:09:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Reimage to trixie [10:09:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [10:09:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [10:09:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [10:12:47] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11382932 (10MoritzMuehlenhoff) [10:12:53] 06SRE, 06Infrastructure-Foundations, 10netops: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan - https://phabricator.wikimedia.org/T409690#11382933 (10fgiunchedi) Something else I forgot: I'm assuming codfw also is applicable in this case? i.e. these hos... [10:13:23] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11382935 (10fgiunchedi) [10:13:40] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS trixie [10:13:47] !log hashar@deploy2002 Started deploy [integration/docroot@a7f5910]: build: Updating npm dependencies (linting only) [10:13:59] !log hashar@deploy2002 Finished deploy [integration/docroot@a7f5910]: build: Updating npm dependencies (linting only) (duration: 00m 11s) [10:15:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.remove-downtime for ms-be1067.eqiad.wmnet [10:15:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1067.eqiad.wmnet [10:15:56] (03PS1) 10Kosta Harlan: hCaptcha: Enable A/B edit test on zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206830 (https://phabricator.wikimedia.org/T410354) [10:16:37] 06SRE, 06cloud-services-team, 13Patch-For-Review: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11382952 (10fgiunchedi) I'm giving debugging this issue one more go, as part of this we now have `pause-reboot.cfg` included for `cloudcontrol2010... [10:17:31] (03CR) 10Muehlenhoff: [C:03+2] installserver: Readd es2028 with modified db-trixie partman config [puppet] - 10https://gerrit.wikimedia.org/r/1206708 (https://phabricator.wikimedia.org/T408777) (owner: 10Muehlenhoff) [10:19:23] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11382958 (10MoritzMuehlenhoff) [10:22:55] marostegui@cumin1003 reimage (PID 1965681) is awaiting input [10:24:18] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1169.eqiad.wmnet with OS trixie [10:24:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS trixie [10:25:12] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on backup[1006-1007].eqiad.wmnet,ms-backup[1001-1002].eqiad.wmnet with reason: Network maintenance [10:25:19] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for lpintscher - https://phabricator.wikimedia.org/T409933#11382972 (10Lydia_Pintscher) 05In progress→03Resolved a:03Lydia_Pintscher Works great now. Thank you! [10:25:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11382975 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8cc3d2b8-5b7e-411e-aa85-6a4983c97ec1) set by jynus@cumin1003 for 1 day, 0:00:0... [10:26:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:27:04] !log installing libxml2 security updates [10:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:36] (03PS2) 10Tiziano Fogli: check_icinga: add flags to suppress notifications/pages [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) [10:27:43] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11382978 (10jcrespo) Media backups processing on eqiad is stopped and the following hosts have been downtimed for 24 hours from now: ` backup1006 backup10... [10:28:58] (03PS4) 10Federico Ceratto: sre.mysql.clone: Refactor, Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202673 (https://phabricator.wikimedia.org/T410376) [10:30:08] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [10:30:57] (03PS1) 10Majavah: P:toolforge: haproxy: Increase tune.maxrewrite [puppet] - 10https://gerrit.wikimedia.org/r/1206834 (https://phabricator.wikimedia.org/T410352) [10:31:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:32:30] (03PS5) 10Federico Ceratto: sre.mysql.clone: Refactor, Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202673 (https://phabricator.wikimedia.org/T410376) [10:32:58] marostegui@cumin1003 reimage (PID 1975807) is awaiting input [10:33:43] (03PS5) 10Sergio Gimeno: EventStreamConfig: add stream for Growth and Editing team edit rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) [10:33:55] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206834 (https://phabricator.wikimedia.org/T410352) (owner: 10Majavah) [10:34:14] (03CR) 10Majavah: [C:03+2] P:toolforge: haproxy: Increase tune.maxrewrite [puppet] - 10https://gerrit.wikimedia.org/r/1206834 (https://phabricator.wikimedia.org/T410352) (owner: 10Majavah) [10:34:18] (03PS1) 10David Caro: toolforge:haproxy: added limit rates to the logs [puppet] - 10https://gerrit.wikimedia.org/r/1206835 (https://phabricator.wikimedia.org/T410352) [10:34:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [10:34:45] (03CR) 10David Caro: "Tested in toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/1206835 (https://phabricator.wikimedia.org/T410352) (owner: 10David Caro) [10:35:10] 06SRE, 10SRE-Access-Requests: Update mfischerwmf ssh key - https://phabricator.wikimedia.org/T410270#11383005 (10MFischer) 05Open→03Resolved a:03MFischer Thank you kindly @Volans! :) [10:35:27] (03CR) 10Tiziano Fogli: [C:03+2] "Yeah, I know ... I'm sorry but I forgot to make two separate commits. I tried to simplify the review by leaving comments on the real chang" [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:35:50] (03Merged) 10jenkins-bot: check_icinga: add flags to suppress notifications/pages [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/1204891 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:35:56] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11383008 (10MoritzMuehlenhoff) [10:39:52] (03CR) 10CI reject: [V:04-1] sre.mysql.clone: Refactor, Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202673 (https://phabricator.wikimedia.org/T410376) (owner: 10Federico Ceratto) [10:40:22] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1169.eqiad.wmnet with OS trixie [10:41:49] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS trixie [10:41:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:42:25] (03PS6) 10Federico Ceratto: sre.mysql.clone: Refactor, Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202673 (https://phabricator.wikimedia.org/T410376) [10:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:49:00] (03CR) 10Muehlenhoff: "sre.ganeti.makevm will also kick off the initial reimage, so I would really untangle the VM creation and the setup of the service (which w" [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [10:50:07] marostegui@cumin1003 reimage (PID 1992032) is awaiting input [10:50:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:51:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:51:27] (03CR) 10Muehlenhoff: Add cloudidp2001-dev (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [10:51:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:53:39] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1169.eqiad.wmnet with OS trixie [10:56:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:59:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1100) [11:01:06] !log marostegui@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1169.eqiad.wmnet'] [11:02:52] (03CR) 10Marostegui: [C:03+1] sre.mysql.clone: Refactor, Pool in source host ASAP [cookbooks] - 10https://gerrit.wikimedia.org/r/1202673 (https://phabricator.wikimedia.org/T410376) (owner: 10Federico Ceratto) [11:04:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [11:04:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:05:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [11:09:14] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [11:09:24] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1169.eqiad.wmnet'] [11:09:44] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [11:09:48] !log marostegui@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1169.eqiad.wmnet'] [11:09:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:09:53] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1169.eqiad.wmnet'] [11:09:56] !log marostegui@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1169.eqiad.wmnet'] [11:11:08] (03PS1) 10Santiago Faci: Test Kitchen UI: Deploying v1.1.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206840 (https://phabricator.wikimedia.org/T409546) [11:11:34] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Add an option to the reimage cookbook to also update firmware - https://phabricator.wikimedia.org/T410384 (10MoritzMuehlenhoff) 03NEW [11:13:04] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206835 (https://phabricator.wikimedia.org/T410352) (owner: 10David Caro) [11:14:59] !log installing qemu security updates [11:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:36] 06SRE, 10SRE-Access-Requests: Update to FIDO backed production SSH key for btullis - https://phabricator.wikimedia.org/T409279#11383161 (10Volans) @BTullis can you confirm all is working fine and we can resolve this task? [11:17:10] FIRING: BFDdown: BFD session down between cr1-drmrs and 2a02:ec80:600:fe01::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:18:19] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1169.eqiad.wmnet'] [11:18:29] !log marostegui@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1169.eqiad.wmnet'] [11:18:45] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11383162 (10MoritzMuehlenhoff) Specs look good [11:19:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [11:20:07] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [11:21:30] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11383168 (10taavi) [11:22:10] RESOLVED: BFDdown: BFD session down between cr1-drmrs and 2a02:ec80:600:fe01::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:22:22] (03PS1) 10Tiziano Fogli: metamonitoring/icinga/ext-mon: add dummy smtp auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1206846 (https://phabricator.wikimedia.org/T393625) [11:22:44] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [11:23:04] (03CR) 10Tiziano Fogli: [C:03+2] metamonitoring/icinga/ext-mon: add dummy smtp auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1206846 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [11:23:07] (03CR) 10Tiziano Fogli: [V:03+2 C:03+2] metamonitoring/icinga/ext-mon: add dummy smtp auth info [labs/private] - 10https://gerrit.wikimedia.org/r/1206846 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [11:23:31] filippo@cumin1003 reimage (PID 1978293) is awaiting input [11:23:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [11:24:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [11:25:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [11:25:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2003.codfw.wmnet [11:25:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wikidata: apply [11:25:46] (03CR) 10Vgutierrez: [C:03+1] "nice job" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [11:26:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wikidata: apply [11:26:22] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1169.eqiad.wmnet'] [11:26:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:27:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS trixie [11:27:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [11:29:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2003.codfw.wmnet [11:29:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:30:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:33:42] 10ops-eqiad, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388 (10Marostegui) 03NEW [11:34:14] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1169.eqiad.wmnet with OS trixie [11:34:50] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: Add an option to the reimage cookbook to also update firmware - https://phabricator.wikimedia.org/T410384#11383237 (10Marostegui) The host can be rebooted if needed as many times as needed - it is out of the load balancer and mariadb is stopped. [11:35:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:35:22] (03PS3) 10Clément Goubert: rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) [11:35:29] 10ops-eqiad, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383239 (10Marostegui) The host can be rebooted as many times as needed - it is out of the load balancer and mariadb is stopped. [11:39:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:41:45] (03PS1) 10Ayounsi: Outbound saturation: add transport interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1206849 (https://phabricator.wikimedia.org/T409330) [11:43:04] hnowlan: should we worry about those wikifeeds alerts? [11:43:20] (03CR) 10CI reject: [V:04-1] Outbound saturation: add transport interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1206849 (https://phabricator.wikimedia.org/T409330) (owner: 10Ayounsi) [11:43:47] (03PS1) 10Urbanecm: beta: Set wgGEReviseToneRecommendationProvider to subpage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206850 (https://phabricator.wikimedia.org/T407356) [11:44:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:45:28] (03PS1) 10Novem Linguae: undeploy Extension:Capiunto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206851 (https://phabricator.wikimedia.org/T410172) [11:46:05] (03PS4) 10Clément Goubert: rest-gateway: Point to DC-local mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) [11:46:10] (03CR) 10Clément Goubert: "Yes, I was actually thinking about something similar. I've reopened https://phabricator.wikimedia.org/T401396 for discussion in a more sui" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1204865 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [11:48:45] vgutierrez: depends on your definition of "we" https://phabricator.wikimedia.org/T410296 :P [11:48:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:49:21] (but yes) [11:53:31] I'm going to roll restart wikifeeds just to see if anything changes, but it seems likely this is a mobileapps issue [11:53:33] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: sync [11:53:46] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync [11:54:53] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: migrate /api/rest_v1/ sandbox to Special:RestSandbox [puppet] - 10https://gerrit.wikimedia.org/r/1190754 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [11:55:41] hnowlan: I str roll restarting both on Thursday evening [11:55:49] Lots of pods in unknown states and whatnot [11:56:47] not surprising given how much OOMkills are part of expected behaviour after our last dance with its performance [11:56:53] 06SRE: Authorize blake for Icinga tasks - https://phabricator.wikimedia.org/T410390 (10Blake) 03NEW [11:57:02] hnowlan: yah [11:57:12] there's a little context in the ticket, but it seems like there might have been a code change in mobileapps that has had a knock-on effect [11:57:20] jouncebot: nowandnext [11:57:20] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1100) [11:57:21] In 1 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1300) [11:57:33] (03CR) 10Hnowlan: [C:03+2] trafficserver: Route group1 /page/lint(.*) to the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [11:57:37] (03PS2) 10Ayounsi: Outbound saturation: add transport interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1206849 (https://phabricator.wikimedia.org/T409330) [11:57:37] (03PS1) 10Ayounsi: Add alerting for core link saturation [alerts] - 10https://gerrit.wikimedia.org/r/1206855 (https://phabricator.wikimedia.org/T409330) [11:57:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1003.eqiad.wmnet [11:58:50] (03PS7) 10Giuseppe Lavagetto: cache::text: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203055 (https://phabricator.wikimedia.org/T406555) [11:58:50] (03PS4) 10Giuseppe Lavagetto: cache::upload: introduce rate-limits by traffic class [puppet] - 10https://gerrit.wikimedia.org/r/1203297 (https://phabricator.wikimedia.org/T406555) [11:58:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:59:07] !log installing rabbitmq-server security updates [11:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1003.eqiad.wmnet [12:05:38] (03CR) 10Clément Goubert: [C:03+1] wikikube: decommission worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048] [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) (owner: 10Jasmine) [12:07:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people2004.codfw.wmnet [12:09:08] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:51] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::elasticsearch::haproxy: Use firewall wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1203425 (owner: 10Majavah) [12:10:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:11:14] !incidents [12:11:15] Could not fetch teams from the api, sorry [12:11:15] could not find the team [12:11:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2004.codfw.wmnet [12:12:04] (03PS1) 10Blake: Authorize blake for icinga tasks [puppet] - 10https://gerrit.wikimedia.org/r/1206858 (https://phabricator.wikimedia.org/T410390) [12:12:24] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::elasticsearch::haproxy: Enable native Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1203426 (https://phabricator.wikimedia.org/T343885) (owner: 10Majavah) [12:12:31] jhathaway: I'm in a therapy session, you around? I can jump in if needed but I'd rather not '^^ [12:13:24] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:26] Raine: I'm around looking [12:15:01] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11383363 (10MatthewVernon) A couple of notes on extracting thumbnail size from `uri_path` - a [[ https://phabricator.wikimedia.org/T360589... [12:15:06] thank you <3 lmk if you need extra hands, there's already a few in the other channel [12:15:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1005.eqiad.wmnet [12:15:57] FIRING: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:16:12] (03PS1) 10KartikMistry: Update Recommendation API to 2025-11-17-092813-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206859 (https://phabricator.wikimedia.org/T406854) [12:17:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:18:24] FIRING: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:18:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1005.eqiad.wmnet [12:19:08] FIRING: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:19:36] (03CR) 10Ayounsi: "I also had a look at the .sh file, the overall logic lgtm but don't rely on me to find a bash bug in there :)" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [12:19:44] (03CR) 10Ayounsi: [C:03+1] UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [12:19:47] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [12:19:51] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [12:20:35] !incidents [12:20:36] 7012 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [12:20:57] RESOLVED: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:23:24] RESOLVED: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:36] (03CR) 10Cathal Mooney: [C:03+1] "Thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/1206424 (https://phabricator.wikimedia.org/T410047) (owner: 10Ssingh) [12:34:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::5e5e:ab00:103d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:34:58] 07sre-alert-triage, 06serviceops: Alert in need of triage: KubernetesWorkerUnschedulable - https://phabricator.wikimedia.org/T400969#11383492 (10Clement_Goubert) Silencing for 3 months. [12:36:22] (03PS1) 10Bartosz Wójtowicz: ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206864 (https://phabricator.wikimedia.org/T408538) [12:37:06] (03PS1) 10David Caro: rados_quota_exporter: use secondary file [puppet] - 10https://gerrit.wikimedia.org/r/1206866 [12:38:08] (03CR) 10AikoChou: [C:03+1] ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206864 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:38:31] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206864 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:39:01] (03CR) 10David Caro: [V:03+1] "Manually tested in cloudcontrol1007" [puppet] - 10https://gerrit.wikimedia.org/r/1206866 (owner: 10David Caro) [12:39:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::5e5e:ab00:103d:83c7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:39:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [12:39:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [12:39:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:40:09] (03Merged) 10jenkins-bot: ml-services: Update revise-tone-task-generator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206864 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [12:40:42] (03CR) 10David Caro: [C:03+2] toolforge:haproxy: added limit rates to the logs [puppet] - 10https://gerrit.wikimedia.org/r/1206835 (https://phabricator.wikimedia.org/T410352) (owner: 10David Caro) [12:42:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:43:27] !log bwojtowicz@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:44:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:46:51] (03CR) 10Clément Goubert: [C:03+1] "Will need an additional CR for `2052-2054,2063,2079-2084,2096-2101,2116-2123,2216-2241`" [puppet] - 10https://gerrit.wikimedia.org/r/1205225 (https://phabricator.wikimedia.org/T409102) (owner: 10Jasmine) [12:49:20] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on 74 hosts with reason: up for decom [12:49:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:53:13] !log cgoubert@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on 74 hosts with reason: up for decom [12:54:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:55:11] (03CR) 10Sergio Gimeno: [C:03+1] beta: Set wgGEReviseToneRecommendationProvider to subpage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206850 (https://phabricator.wikimedia.org/T407356) (owner: 10Urbanecm) [12:58:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206850 (https://phabricator.wikimedia.org/T407356) (owner: 10Urbanecm) [12:59:52] (03Merged) 10jenkins-bot: beta: Set wgGEReviseToneRecommendationProvider to subpage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206850 (https://phabricator.wikimedia.org/T407356) (owner: 10Urbanecm) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1300) [13:02:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1033 kernel reboot', diff saved to https://phabricator.wikimedia.org/P85363 and previous config saved to /var/cache/conftool/dbconfig/20251118-130200-marostegui.json [13:02:31] !log Reboot es1033 (Debian trixie) for kernel upgrade [13:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:05:40] Need to do a rolling restart of eqiad kafka-main brokers to pick up a new SSL cert, heads up Raine jhathaway [13:06:56] thanks claime, ack [13:07:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host hcaptcha-proxy7002.wikimedia.org with OS trixie [13:07:20] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11383539 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002... [13:07:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:07:56] OK to deploy recommendation API? [13:08:34] claime: ^^ [13:08:36] kart_: should be fine I think [13:08:45] Thanks. [13:09:02] kafka roll restart should not affect the kafka service in any major way [13:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:09:29] (03CR) 10KartikMistry: [C:03+2] Update Recommendation API to 2025-11-17-092813-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206859 (https://phabricator.wikimedia.org/T406854) (owner: 10KartikMistry) [13:09:59] Hmm crap, leaders are really unbalanced :/ [13:10:08] I need an adult cc brouberol elukey [13:10:28] Should we rebalance before running a roll-restart https://grafana.wikimedia.org/goto/ZJDYD2iDg?orgId=1 [13:10:45] In which case, should we actually take up the opportunity to do the actual reboot campaign? [13:11:14] Luca is out this week [13:11:17] The ssl cert will expire in 2 days (1 week before expiry may be a little short for that alert) [13:11:22] moritzm: ack, sorry about that [13:11:33] (03Merged) 10jenkins-bot: Update Recommendation API to 2025-11-17-092813-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206859 (https://phabricator.wikimedia.org/T406854) (owner: 10KartikMistry) [13:11:45] akosiaris, could use your opinion in lieu of e.lukey then [13:12:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:13:21] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:14:33] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:16:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.upgrade for es1033.eqiad.wmnet [13:16:51] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool es1033 - Upgrading es1033.eqiad.wmnet [13:16:58] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es1033 - Upgrading es1033.eqiad.wmnet [13:18:06] claime: they've been unbalanced by that much for 6+ months heh [13:18:30] imo reboot one of the heaviest-loaded ones first heh [13:18:43] cdanis: I know, but we haven't had to do major operations on them in that span of time iirc [13:20:00] claime: could do https://wikitech.wikimedia.org/wiki/Kafka/Administration#Replica_Elections perhaps [13:20:52] or maybe actually you need the reassign-partitions stuff later on eh [13:21:03] sorry I'll stop yapping and continue drinking my coffee [13:25:33] Using topicmappr, it comes out to a 421GB relocation volume and a ton of leader changes [13:28:40] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for es1033.eqiad.wmnet [13:29:48] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:31:11] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1033 gradually with 4 steps - Repooling after upgrade [13:31:21] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) es1033 gradually with 4 steps - Repooling after upgrade [13:31:32] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es1033 gradually with 4 steps - Repooling after upgrade [13:34:33] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:35:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11383609 (10Clement_Goubert) >>! In T405950#11379425, @RobH wrote: > @Clement_Goubert, > > Is it possible that I could send the commands for this or do we need someone in your... [13:35:51] !log Update Recommendation API to 2025-11-17-092813-production (T406854) [13:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:55] T406854: Collection data unavailable in several rec-api hosts - https://phabricator.wikimedia.org/T406854 [13:38:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:38:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [13:38:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:43:49] (03PS1) 10Dpogorzelski: ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206870 [13:44:05] (03CR) 10Dpogorzelski: [C:03+2] ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206870 (owner: 10Dpogorzelski) [13:44:32] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11383645 (10MoritzMuehlenhoff) [13:45:43] (03Merged) 10jenkins-bot: ml-services: update llm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206870 (owner: 10Dpogorzelski) [13:46:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host zuul1001.eqiad.wmnet [13:50:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zuul1001.eqiad.wmnet [13:50:42] !log dpogorzelski@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [13:51:29] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383688 (10Jclark-ctr) @Marostegui Per @cmooney We should be using cookbook ` cookbook sre.hosts.reimage --new --uefi --no82 --os -t We need to reimage these UEFI with HTTP mod... [13:52:49] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383695 (10Jclark-ctr) @cmooney is there any work around for getting legacy bios? [13:53:21] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:55:12] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1206866 (owner: 10David Caro) [13:56:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:08] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383708 (10Marostegui) @cmooney at the moment our databases are only installed via legacy BIOS - does this mean we cannot reimage databases connected to nokia switches? [13:57:22] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383710 (10Jclark-ctr) a:03Jclark-ctr [13:58:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [13:59:51] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host hcaptcha-proxy7002.wikimedia.org with OS trixie [14:00:00] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11383719 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for... [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1400). [14:00:05] edsanders, Superpes, and Sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:26] Hi [14:01:00] I can’t deploy, sorry [14:01:08] *maybe* later in the hour, if the meeting finishes early ^^ [14:01:33] (03PS1) 10Alexandros Kosiaris: admin: Add (akosiaris) FIDO ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1206873 [14:01:40] I can deploy [14:02:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1206813 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders) [14:02:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206814 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders) [14:02:31] Looks like it's all config changes [14:04:11] (03Merged) 10jenkins-bot: Hackaround 2015 broken convert on ptwikibooks [extensions/Flow] (wmf/1.46.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1206813 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders) [14:04:19] (03Merged) 10jenkins-bot: Hackaround 2015 broken convert on ptwikibooks [extensions/Flow] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206814 (https://phabricator.wikimedia.org/T402549) (owner: 10Esanders) [14:04:55] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1206813|Hackaround 2015 broken convert on ptwikibooks (T402549)]], [[gerrit:1206814|Hackaround 2015 broken convert on ptwikibooks (T402549)]] [14:04:59] T402549: ptwikibooks: Convert LQT pages to Flow - https://phabricator.wikimedia.org/T402549 [14:08:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:08:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [14:08:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:08:57] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383743 (10Marostegui) As a long term follow up: T410400 [14:09:14] FIRING: KubernetesDeploymentUnavailableReplicas: ... [14:09:14] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [14:09:14] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:09:38] !log esanders@deploy2002 esanders: Backport for [[gerrit:1206813|Hackaround 2015 broken convert on ptwikibooks (T402549)]], [[gerrit:1206814|Hackaround 2015 broken convert on ptwikibooks (T402549)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:10:21] !log filippo@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [14:10:46] !log esanders@deploy2002 esanders: Continuing with sync [14:11:56] o/ [14:12:50] Superpes: shall I do your configs together? [14:13:03] or do you want to self deploy? [14:13:10] @edsanders Yes please :) [14:13:23] I can't deploy [14:14:12] No issue in deploy them together (they're both quite easy) [14:15:05] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206813|Hackaround 2015 broken convert on ptwikibooks (T402549)]], [[gerrit:1206814|Hackaround 2015 broken convert on ptwikibooks (T402549)]] (duration: 10m 09s) [14:15:09] T402549: ptwikibooks: Convert LQT pages to Flow - https://phabricator.wikimedia.org/T402549 [14:15:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205428 (https://phabricator.wikimedia.org/T410121) (owner: 10Superpes15) [14:15:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205443 (https://phabricator.wikimedia.org/T410199) (owner: 10Superpes15) [14:16:32] (03Merged) 10jenkins-bot: [kywiki] Add new rollbacker and eliminator usergroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205428 (https://phabricator.wikimedia.org/T410121) (owner: 10Superpes15) [14:16:35] (03Merged) 10jenkins-bot: [dewiki] Enable SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1205443 (https://phabricator.wikimedia.org/T410199) (owner: 10Superpes15) [14:16:56] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1033 gradually with 4 steps - Repooling after upgrade [14:16:59] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [14:17:05] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1205428|[kywiki] Add new rollbacker and eliminator usergroups (T410121)]], [[gerrit:1205443|[dewiki] Enable SandboxLink extension (T410199)]] [14:17:11] T410121: Enable “Rollbacker” and “eliminator” user groups on kywiki - https://phabricator.wikimedia.org/T410121 [14:17:11] T410199: Enable SandboxLink on German Wikipedia - https://phabricator.wikimedia.org/T410199 [14:18:51] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383774 (10cmooney) @Marostegui my apologies I should have perhaps discussed this more widely with the team. I was under the impression there was no barrier to using UEFI mode on any of our current hardware, but... [14:22:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11383782 (10Jhancock.wm) [14:22:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install ms-be209[0-4] - https://phabricator.wikimedia.org/T405958#11383784 (10Jhancock.wm) 05Open→03Resolved @MatthewVernon thanks for your help! [14:22:48] (03PS1) 10Alexandros Kosiaris: Empty maintenance_hosts array [puppet] - 10https://gerrit.wikimedia.org/r/1206876 (https://phabricator.wikimedia.org/T400442) [14:22:50] (03PS1) 10Alexandros Kosiaris: Cleanup maintenance_hosts hiera variable use [puppet] - 10https://gerrit.wikimedia.org/r/1206877 (https://phabricator.wikimedia.org/T400442) [14:22:52] !log esanders@deploy2002 superpes, esanders: Backport for [[gerrit:1205428|[kywiki] Add new rollbacker and eliminator usergroups (T410121)]], [[gerrit:1205443|[dewiki] Enable SandboxLink extension (T410199)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:22:58] T410121: Enable “Rollbacker” and “eliminator” user groups on kywiki - https://phabricator.wikimedia.org/T410121 [14:22:59] Testing [14:22:59] T410199: Enable SandboxLink on German Wikipedia - https://phabricator.wikimedia.org/T410199 [14:23:51] @edsanders Both look fine! [14:23:55] Thanks :) [14:23:55] !log esanders@deploy2002 superpes, esanders: Continuing with sync [14:24:09] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383794 (10Marostegui) Thanks for the answer @cmooney - I've been talking to @MoritzMuehlenhoff about it and we went ahead and created T410400 As I mentioned to him, we'd need I/F to help with that but we'd have... [14:24:45] sergi0: are you able to self deploy next? [14:25:05] sure, only my change right? [14:25:18] yes - just you left [14:25:23] yep, np [14:26:00] 10ops-eqiad, 06SRE, 06DC-Ops: PXE failing on db1169 - https://phabricator.wikimedia.org/T410388#11383799 (10Marostegui) [14:27:57] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1205428|[kywiki] Add new rollbacker and eliminator usergroups (T410121)]], [[gerrit:1205443|[dewiki] Enable SandboxLink extension (T410199)]] (duration: 10m 52s) [14:28:03] T410121: Enable “Rollbacker” and “eliminator” user groups on kywiki - https://phabricator.wikimedia.org/T410121 [14:28:03] T410199: Enable SandboxLink on German Wikipedia - https://phabricator.wikimedia.org/T410199 [14:28:17] @esanders Many thanks for your assistance :3 [14:28:40] (03CR) 10Jforrester: [C:03+1] "LGTM, do you want to schedule this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206851 (https://phabricator.wikimedia.org/T410172) (owner: 10Novem Linguae) [14:29:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sgimeno@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [14:30:14] any deployers around? if so will add one to the window right now [14:30:19] (03Merged) 10jenkins-bot: EventStreamConfig: add stream for Growth and Editing team edit rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1203812 (https://phabricator.wikimedia.org/T405177) (owner: 10Sergio Gimeno) [14:30:52] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1203812|EventStreamConfig: add stream for Growth and Editing team edit rates (T405177)]] [14:30:56] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [14:31:37] @NovemLinguae I can deploy it after mine finishes [14:31:48] ok, ready [14:31:56] thanks. is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1206851 [14:32:34] Could you add it to the window in https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1400 [14:33:03] sure. doing [14:34:03] done. https://wikitech.wikimedia.org/w/index.php?diff=2362336 [14:36:00] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1203812|EventStreamConfig: add stream for Growth and Editing team edit rates (T405177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:36:04] T405177: Revise Tone: Instrumentation - https://phabricator.wikimedia.org/T405177 [14:36:58] !log sgimeno@deploy2002 sgimeno: Continuing with sync [14:40:57] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1203812|EventStreamConfig: add stream for Growth and Editing team edit rates (T405177)]] (duration: 10m 05s) [14:41:17] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1059.eqiad.wmnet with OS trixie [14:43:23] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1206851|undeploy Extension:Capiunto (T410172)]] [14:43:27] T410172: Drop the decade-stalled Capiunto experimental extension from production - https://phabricator.wikimedia.org/T410172 [14:43:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:48:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:48:59] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [14:48:59] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [14:48:59] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [14:51:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:28] scap is stuck for 10min now in `K8s images build/push output redirected to /var/lib/spiderpig/scap-image-build-and-push-log`, not sure how to check what's going on. Any deployer around? [14:56:22] sergi0: in the logs shown in spiderpig, do you see l10n updates? [14:57:20] logs tell they finished `14:45:36 Finished l10n-update (duration: 02m 10s)` [14:57:39] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [14:57:46] sergi0: `14:44:17 542 languages rebuilt out of 542` - all that matters is that they happened [14:58:33] i.e., since they did, this triggers a full image rebuild, meaning ~ 20 minutes for the build, followed by ~ 20m for the deployment [14:58:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1208 - https://phabricator.wikimedia.org/T409938#11383988 (10BTullis) No further kernel messages from `an-worker1208` about hard drives since yesterday. ` btullis@an-worker1208:~$ sudo dmesg -T|tail [Mon... [14:59:23] gotcha, I was scared from not seeing updates, thanks for clarifying [15:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1500) [15:01:03] (03CR) 10Samuel (WMF): [C:03+1] hCaptcha: Enable A/B edit test on zhwiki and jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206830 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [15:02:06] (03CR) 10Btullis: [C:03+2] DNS: Add druid-public-coordinator record [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [15:02:38] !log btullis@dns1004 START - running authdns-update [15:02:42] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:03:13] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [15:03:37] !log btullis@dns1004 END - running authdns-update [15:05:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:06:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [15:08:26] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:14] filippo@cumin1003 reimage (PID 2210505) is awaiting input [15:11:07] !log sgimeno@deploy2002 novemlinguae, sgimeno: Backport for [[gerrit:1206851|undeploy Extension:Capiunto (T410172)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:11:11] T410172: Drop the decade-stalled Capiunto experimental extension from production - https://phabricator.wikimedia.org/T410172 [15:11:20] (03PS2) 10Tiziano Fogli: metamonitoring/icinga: suppress script-managed notifications and pages [puppet] - 10https://gerrit.wikimedia.org/r/1206884 (https://phabricator.wikimedia.org/T393625) [15:11:25] (03PS3) 10Tiziano Fogli: metamonitoring/icinga: add smtp settings to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1206885 (https://phabricator.wikimedia.org/T393625) [15:11:30] (03PS4) 10Tiziano Fogli: metamonitoring/icinga: generate contacts list [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) [15:11:46] @NovemLinguae are you around, can you test please? [15:12:14] This one may be hard to manually test. The extension was only deployed to wikidataclient-test wiki (I'm not even sure the domain for that one) [15:12:26] I can do a couple sanity checks to make sure major wikis aren't down. [15:12:42] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:12:54] Do you know if beta cluster config changes make it into MWDebug extension? If so I can test that, this was deployed to almost every beta wiki. [15:13:45] hmm, I'm not sure about that [15:13:48] wikidataclient-test is a dblist, not a wiki. see https://noc.wikimedia.org/conf/highlight.php?file=dblists/wikidataclient-test.dblist [15:14:09] and beta wikis operate outside of MWDebug. the changes are getting deployed 30 minutes after merged. [15:14:26] thanks. testwiki is in that dblist. will check there and report back. [15:15:41] Test looks good. OK to proceed. [15:16:47] (03PS2) 10David Caro: rados_quota_exporter: use secondary file [puppet] - 10https://gerrit.wikimedia.org/r/1206866 [15:16:47] (03CR) 10David Caro: rados_quota_exporter: use secondary file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206866 (owner: 10David Caro) [15:16:59] !log sgimeno@deploy2002 novemlinguae, sgimeno: Continuing with sync [15:17:05] (03CR) 10David Caro: rados_quota_exporter: use secondary file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206866 (owner: 10David Caro) [15:17:40] (03CR) 10David Caro: [C:03+2] rados_quota_exporter: use secondary file [puppet] - 10https://gerrit.wikimedia.org/r/1206866 (owner: 10David Caro) [15:19:36] sounds like it'll be another 20 minutes because of the language cache rebuild thing. sorry my patch was so time consuming [15:21:48] no worries [15:21:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:25:13] (03CR) 10Andrew Bogott: Add cloudidp2001-dev (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [15:26:28] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11384055 (10fgiunchedi) As far as I understand the problem, lvm metadata size and alignment can be related to the underlying block device reported data, specifically th... [15:26:45] 06SRE, 06Traffic, 13Patch-For-Review: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11384056 (10SLyngshede-WMF) Script and tooling https://gitlab.wikimedia.org/slyngshede/meta-geomap I'll move it to the SRE namespace after a review [15:29:10] (03PS2) 10Andrew Bogott: Config for cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) [15:29:10] (03PS1) 10Andrew Bogott: Add cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206895 (https://phabricator.wikimedia.org/T410294) [15:29:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 2 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406 (10bking) 03NEW [15:29:50] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1059.eqiad.wmnet with OS trixie [15:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1530) [15:30:05] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy etcd maintenance in codfw (one-off). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1530). [15:30:08] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, and 3 others: Hardware requirements for WDQS backend migration. - https://phabricator.wikimedia.org/T409769#11384093 (10bking) @Jhancock.wm thanks for reaching out, but we are OK at the moment in CODFW. For some reason we just have more hosts there ;) . @Jclark-c... [15:30:14] o/ [15:30:14] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206851|undeploy Extension:Capiunto (T410172)]] (duration: 46m 51s) [15:30:18] T410172: Drop the decade-stalled Capiunto experimental extension from production - https://phabricator.wikimedia.org/T410172 [15:30:28] yay, that's done :) [15:30:36] jit, all yours [15:30:44] thanks sergio0 :) [15:32:47] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11384110 (10herron) >>! In T409310#11363845, @elukey wrote: > @herron Hi! Could you please backfill `slo:period_error_budget_remaining:ratio` too? I see that the time series start from Oct 27th, this is the rol... [15:33:24] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:27] (03PS1) 10Daniel Kinzler: rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) [15:34:53] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1060.eqiad.wmnet with OS trixie [15:35:10] !log disable puppet on A:conf-codfw - T352245 [15:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:18] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [15:35:57] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11384115 (10BTullis) Yes, please. Feel free to go ahead. Apologies for the delay. [15:37:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11384128 (10RobH) 05Open→03Resolved All machine learning hosts have been migrated, resolving this task. [15:37:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host zuul1002.eqiad.wmnet [15:39:19] !log silenced EtcdReplicationDown db7447af-851f-4faa-a4fd-b535ee9fbcdb - T352245 [15:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:35] !log sukhe@cumin1003 START - Cookbook sre.hosts.decommission for hosts hcaptcha-proxy7002.wikimedia.org [15:40:40] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11384157 (10fgiunchedi) And `lsblk -t` for comparison: ` root@cloudcontrol2010-dev:/# lsblk -t NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED... [15:41:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zuul1002.eqiad.wmnet [15:42:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet [15:42:58] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1206452 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [15:43:01] (03CR) 10Scott French: [C:03+2] hiera: temporarily disable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1206452 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [15:43:19] (03CR) 10Ssingh: [C:03+1] LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [15:44:25] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Cleaning up Puppet and Netbox VLAN sub-ints on edge sites - https://phabricator.wikimedia.org/T410411 (10ssingh) 03NEW [15:44:37] !log sukhe@cumin1003 START - Cookbook sre.dns.netbox [15:45:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 2 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11384193 (10bking) [15:45:49] (03PS2) 10Ssingh: hiera: lvs/interfaces: remove public1-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1206424 (https://phabricator.wikimedia.org/T410047) [15:46:19] (03CR) 10Clare Ming: [C:03+2] Test Kitchen UI: Deploying v1.1.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206840 (https://phabricator.wikimedia.org/T409546) (owner: 10Santiago Faci) [15:46:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:47:50] (03CR) 10Ssingh: "No code change, added the bug # in the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1206424 (https://phabricator.wikimedia.org/T410047) (owner: 10Ssingh) [15:47:56] !log sukhe@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy7002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [15:48:01] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.1.2 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206840 (https://phabricator.wikimedia.org/T409546) (owner: 10Santiago Faci) [15:48:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11384205 (10RobH) Awesome! We're also moving dns1006 at the same time (we'll move it first while k8 hosts drain) and then we'll move onto moving these! I'll ping you in about... [15:48:17] !log sukhe@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hcaptcha-proxy7002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1003" [15:48:18] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:18] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts hcaptcha-proxy7002.wikimedia.org [15:48:27] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11384208 (10Ladsgroup) At the scale we are talking, they won't make any dent in the stats. [15:48:35] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10vm-requests: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 - https://phabricator.wikimedia.org/T409860#11384209 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for... [15:48:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1088.eqiad.wmnet [15:48:43] !log transferred etcd-mirror replication to conf2006 - T352245 [15:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:50] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [15:49:09] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic, 13Patch-For-Review: Cleaning up Puppet and Netbox VLAN sub-ints on edge sites - https://phabricator.wikimedia.org/T410411#11384212 (10ssingh) p:05Triage→03Low [15:49:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [15:49:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [15:49:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:50:42] gonna roll restart mobileapps to at least temporarily quieten this ^ (if that's okay with your window swfrench-wmf) [15:51:07] hnowlan: ack, and thank you for checking! should be fine :) [15:51:08] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [15:51:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host zuul2001.codfw.wmnet [15:52:04] FYI, folks: in a couple of minutes, we'll enter the more hitful portion of this maintenance where I'll need to take the scap lock for a bit [15:52:05] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: sync [15:52:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zuul2001.codfw.wmnet [15:53:24] FIRING: JobUnavailable: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:53:24] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: sync [15:53:46] (03CR) 10Kamila Součková: [C:03+2] deployment-server: generate clusterinfo for helm [puppet] - 10https://gerrit.wikimedia.org/r/1201802 (https://phabricator.wikimedia.org/T388969) (owner: 10Kamila Součková) [15:53:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:54:30] (03CR) 10Dzahn: [C:03+1] "thanks, volans" [puppet] - 10https://gerrit.wikimedia.org/r/1205192 (https://phabricator.wikimedia.org/T409893) (owner: 10Dzahn) [15:54:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:54:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [15:54:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:55:12] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [15:58:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206895 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [15:59:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 2 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11384283 (10bking) 05Open→03In progress p:05Triage→03Medium [16:00:05] swfrench-wmf: It is that lovely time of the day again! You are hereby commanded to deploy etcd maintenance in codfw (one-off). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1530). [16:00:05] jelto, arnoldokoth, and mutante: OwO what's this, a deployment window?? SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1600). nyaa~ [16:00:14] still here :) [16:00:18] (03PS1) 10Majavah: P:toolforge: k8s: haproxy: Increment frontend session limit [puppet] - 10https://gerrit.wikimedia.org/r/1206902 [16:00:28] (03CR) 10Scott French: [C:03+2] hiera: switch codfw etcd-main cluster to cfssl/pki [puppet] - 10https://gerrit.wikimedia.org/r/1203557 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [16:01:58] (03CR) 10Muehlenhoff: "One missing setting inline, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [16:01:59] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab2002.codfw.wmnet with reason: deployment [16:02:29] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on phab1004.eqiad.wmnet with reason: deployment [16:02:35] !log brennen@deploy2002 Started deploy [phabricator/deployment@8b1bc09]: deploy phab2002 for T409947 [16:02:39] T409947: Merge and deploy upstream master from Phorge as of 2025-11-12 - https://phabricator.wikimedia.org/T409947 [16:02:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7638/co" [puppet] - 10https://gerrit.wikimedia.org/r/1206902 (owner: 10Majavah) [16:02:53] !log cumin2024@db2205.codfw.wmnet[(none)]> drop database if exists fixcopyrightwiki; drop database if exists langcomwiki; drop database if exists mowiki; drop database if exists mowiktionary; (T297297) [16:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:57] T297297: Investigate the unusual dbs in s3 - https://phabricator.wikimedia.org/T297297 [16:03:06] !log brennen@deploy2002 Finished deploy [phabricator/deployment@8b1bc09]: deploy phab2002 for T409947 (duration: 00m 31s) [16:03:17] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206902 (owner: 10Majavah) [16:03:23] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: k8s: haproxy: Increment frontend session limit [puppet] - 10https://gerrit.wikimedia.org/r/1206902 (owner: 10Majavah) [16:03:32] !log brennen@deploy2002 Started deploy [phabricator/deployment@8b1bc09]: deploy phab1004 for T409947 [16:04:28] !log brennen@deploy2002 Finished deploy [phabricator/deployment@8b1bc09]: deploy phab1004 for T409947 (duration: 00m 56s) [16:08:29] brennen: just in case I need to briefly lock scap, are you largely done with your phab deployments? [16:08:46] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on releases1003.eqiad.wmnet with reason: releases [16:09:47] swfrench-wmf: all clear [16:10:04] brennen: great, thanks! [16:10:11] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206905 [16:10:12] !log migrating etcd to PKI certs on conf2004 - T352245 [16:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:17] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:10:40] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on releases2003.codfw.wmnet with reason: releases [16:13:43] (03CR) 10Dzahn: [C:03+2] releases: flip the active backend from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204933 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [16:15:40] (03CR) 10Dzahn: [V:03+1 C:03+2] releases: control jenkins service by DC name, not host name [puppet] - 10https://gerrit.wikimedia.org/r/1204980 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [16:15:55] (03PS1) 10Kosta Harlan: hCaptcha: Validate sitekey of /siteverify API call [extensions/ConfirmEdit] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206906 (https://phabricator.wikimedia.org/T410024) [16:16:22] (03CR) 10Dzahn: [V:03+1 C:03+2] releases: stop/mask jenkins in eqiad, start/unmask jenkins in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1204982 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [16:17:56] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Measure request frequency of thumbnail sizes - https://phabricator.wikimedia.org/T410304#11384438 (10MatthewVernon) It's about 0.5% difference in count of 250, which isn't a vast amount, but it's not nothing. And the ranking of... [16:18:24] RESOLVED: JobUnavailable: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:18:30] !log swfrench@deploy2002 Locking from deployment [ALL REPOSITORIES]: Hold deployments during etcd certificate change - T352245 [16:18:34] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:18:46] going to hold the scap lock for a few minutes during this last restart [16:19:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host zuul2002.codfw.wmnet [16:21:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1060.eqiad.wmnet with OS trixie [16:22:02] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS trixie [16:22:29] !log migrating etcd to PKI certs on conf2005 - T352245 [16:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zuul2002.codfw.wmnet [16:23:02] !log swfrench@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during etcd certificate change - T352245 (duration: 04m 32s) [16:25:00] !log begin rolling restarts of codfw-associated confds - T352245 [16:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:07] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:28:01] (03CR) 10BCornwall: [V:03+2 C:03+2] wmf-debci: Also create man1 dir [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1204941 (https://phabricator.wikimedia.org/T352003) (owner: 10BCornwall) [16:28:15] !log restarted navtiming on webperf2003 - T352245 [16:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:27] (03CR) 10Scott French: [C:03+2] hiera: move etcd replication back to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1206453 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [16:32:11] swfrench-wmf: can I have the puppet-merge lock for a second? [16:32:42] mutante: so, you don't want me to merge your patch? [16:32:52] swfrench-wmf: I do need it to be merged please [16:33:04] mutante: ack, so it's okay if I merge it right now [16:33:14] yes please [16:33:27] (03CR) 10Muehlenhoff: [C:03+1] "LGTM and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1206873 (owner: 10Alexandros Kosiaris) [16:33:34] mutante: awesome, doing :) [16:33:51] thanks! [16:34:15] {{done}} [16:34:41] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1206873 (owner: 10Alexandros Kosiaris) [16:35:33] (03PS3) 10Andrew Bogott: Config for cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) [16:35:40] (03CR) 10Andrew Bogott: [C:03+2] Add cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206895 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [16:35:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [16:36:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:36:56] !incidents [16:36:56] 7013 (UNACKED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [16:36:57] 7012 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [16:37:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:37:11] !ack 7013 [16:37:11] 7013 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [16:37:26] o/ [16:37:44] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage [16:37:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr3-eqsin.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:38:09] !log transferred etcd-mirror replication back to conf2005 - T352245 [16:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:12] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:41:20] !log andrew@cumin2002 START - Cookbook sre.ganeti.makevm for new host cloudidp2001-dev.wikimedia.org [16:41:22] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [16:41:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:42:45] RESOLVED: [2x] Primary inbound port utilisation over 80% #page: Device cr1-magru.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:43:20] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage [16:44:03] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS trixie [16:44:34] FIRING: ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:45:35] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudidp2001-dev.wikimedia.org - andrew@cumin2002" [16:45:41] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudidp2001-dev.wikimedia.org - andrew@cumin2002" [16:45:41] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:41] !log andrew@cumin2002 START - Cookbook sre.dns.wipe-cache cloudidp2001-dev.wikimedia.org on all recursors [16:45:44] 06SRE, 06Infrastructure-Foundations, 10netops: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11384628 (10Papaul) I took a look at xe-1/0/8 as you mentioned it was cp5002 and i saw dns5004 and just to realized that this task has been open since 2020 5 years ago so now on por... [16:45:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudidp2001-dev.wikimedia.org on all recursors [16:46:17] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidp2001-dev.wikimedia.org - andrew@cumin2002" [16:46:22] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidp2001-dev.wikimedia.org - andrew@cumin2002" [16:47:06] 06SRE, 06Infrastructure-Foundations: Integrate Trixie 13.2 point update - https://phabricator.wikimedia.org/T410147#11384645 (10MoritzMuehlenhoff) [16:47:51] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426 (10DSmit-WMF) 03NEW [16:49:34] FIRING: [3x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:41] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudidp2001-dev.wikimedia.org with OS bookworm [16:50:01] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev - https://phabricator.wikimedia.org/T410294#11384666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cum... [16:50:06] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11384667 (10EdErhart-WMF) Thanks for those thoughts @Dzahn! We'd still like to put the new microsite at wikipedia25.org. They are two separate experiences that wil... [16:51:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:53:03] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: eqiad row C/D Data Persistence host migrations - https://phabricator.wikimedia.org/T405942#11384678 (10MatthewVernon) @RobH / @Jclark-ctr as I [[ https://phabricator.wikimedia.org/T405942#11276918 | noted above ]], `moss-be1002` can be done whene... [16:54:34] RESOLVED: [3x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:29] !log deleted EtcdReplicationDown silence db7447af-851f-4faa-a4fd-b535ee9fbcdb - T352245 [16:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:33] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [16:56:00] !log silenced wikifeeds codfw swagger alert for 24h T410296 [16:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:03] T410296: Significant increase in wikifeeds latency since 2025/11/13 - https://phabricator.wikimedia.org/T410296 [16:57:55] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to ops-limited for matthieulec - https://phabricator.wikimedia.org/T410291#11384699 (10Kappakayala) Approved! [16:59:03] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [17:00:05] jhathaway and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:02:17] (03CR) 10Volans: [C:03+1] "Approved on task, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1206422 (https://phabricator.wikimedia.org/T410291) (owner: 10Matthieulec) [17:04:48] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [17:05:04] FIRING: [6x] ProbeDown: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:05:36] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1062.eqiad.wmnet with OS trixie [17:07:11] (03PS1) 10Scott French: hiera: point codfw LVS back to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/1206922 (https://phabricator.wikimedia.org/T352245) [17:07:32] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206922 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [17:07:41] (03CR) 10CI reject: [V:04-1] hiera: point codfw LVS back to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/1206922 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [17:08:09] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases1003.eqiad.wmnet with reason: failover [17:08:31] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=dns1006.wikimedia.org [reason: T405623 eqiad row C/D host migration] [17:08:34] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet with reason: failover [17:08:35] T405623: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623 [17:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:09:19] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/nda for AnkitaM - https://phabricator.wikimedia.org/T409894#11384792 (10MGerlach) Thank you @Dzahn and @Volans [17:09:22] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11384793 (10DSantamaria) 👍 Approved if you need my approval! [17:10:08] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS trixie [17:10:13] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on dns1006.wikimedia.org with reason: C/D Migration [17:12:03] (03PS1) 10Alexandros Kosiaris: admin: Remove some older keys of mine (akosiaris) [puppet] - 10https://gerrit.wikimedia.org/r/1206923 [17:12:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11384803 (10Volans) p:05Triage→03Medium [17:13:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:14:56] (03CR) 10Dzahn: [C:03+1] admin: Remove some older keys of mine (akosiaris) [puppet] - 10https://gerrit.wikimedia.org/r/1206923 (owner: 10Alexandros Kosiaris) [17:14:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for dsmit - https://phabricator.wikimedia.org/T410426#11384832 (10Volans) Please refer to https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#Access_Levels to clarify which level access you need to the `analytics-privateda... [17:15:03] (03PS1) 10Bvibber: MediaViewer buckets reduction to all groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206924 (https://phabricator.wikimedia.org/T372165) [17:16:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, November 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206924 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [17:18:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:18:40] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11384868 (10KFrancis) Hi all, the NDA has been signed. Thanks! [17:22:31] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1206923 (owner: 10Alexandros Kosiaris) [17:25:02] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1016.eqiad.wmnet with reason: C/D Migration [17:25:44] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1016.eqiad.wmnet [17:25:48] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [17:26:17] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1016.eqiad.wmnet [17:26:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11384918 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-worker1016.eqiad.wmnet completed: - wikikube-w... [17:27:09] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1063.eqiad.wmnet with OS trixie [17:27:43] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=dns1006.wikimedia.org [17:29:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [17:30:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:31:09] !incidents [17:31:10] 7015 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:31:10] 7014 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [17:31:10] 7013 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [17:31:10] 7012 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [17:31:16] !ack 7015 [17:31:16] 7015 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [17:32:06] looking also Raine [17:33:11] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1065.eqiad.wmnet'] [17:35:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:35:55] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:06] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1167.eqiad.wmnet with reason: C/D Migration [17:37:09] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:37:44] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1181.eqiad.wmnet with reason: C/D Migration [17:38:50] 10ops-eqiad, 06DC-Ops: eno1 on wikikube-worker1016:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410434 (10phaultfinder) 03NEW [17:40:35] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1189.eqiad.wmnet with reason: C/D Migration [17:42:39] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases2003.codfw.wmnet with reason: failover [17:42:52] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on pc1013.eqiad.wmnet with reason: C/D Migration [17:42:54] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases1003.eqiad.wmnet with reason: failover [17:43:29] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS trixie [17:44:23] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1153.eqiad.wmnet with reason: C/D Migration [17:45:45] (03PS1) 10Clément Goubert: kubernetes::node: Use netmask to determine network topology [puppet] - 10https://gerrit.wikimedia.org/r/1206929 (https://phabricator.wikimedia.org/T405950) [17:46:02] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1206929 (https://phabricator.wikimedia.org/T405950) (owner: 10Clément Goubert) [17:46:09] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1221.eqiad.wmnet with reason: C/D Migration [17:46:30] (03CR) 10Tchanders: [C:03+1] "Looks good. Deployable once 1.46.0-wmf.3 is deployed everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206882 (https://phabricator.wikimedia.org/T409717) (owner: 10Reedy) [17:48:43] (03PS1) 10Dzahn: releases: debug jenkins service masking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1206930 [17:49:24] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1222.eqiad.wmnet with reason: C/D Migration [17:50:18] (03CR) 10Tchanders: [C:03+1] "We should also assign the ignore right to stewards via config, and undo the global rights change on the wikis: T409717#11385067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206882 (https://phabricator.wikimedia.org/T409717) (owner: 10Reedy) [17:52:18] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on ms-backup1002.eqiad.wmnet with reason: C/D Migration [17:52:40] !log andrew@cumin2002 START - Cookbook sre.ganeti.makevm for new host cloudidp2001-dev.wikimedia.org [17:52:42] !log andrew@cumin2002 START - Cookbook sre.dns.netbox [17:53:07] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1206929 (https://phabricator.wikimedia.org/T405950) (owner: 10Clément Goubert) [17:53:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11385091 (10VRiley-WMF) 05Open→03In progress Swapping now [17:53:57] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on backup1006.eqiad.wmnet with reason: C/D Migration [17:54:40] (03CR) 10Dzahn: [C:03+1] admin: Adding matthieulec to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1206422 (https://phabricator.wikimedia.org/T410291) (owner: 10Matthieulec) [17:55:23] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:55:24] !log andrew@cumin2002 START - Cookbook sre.dns.wipe-cache cloudidp2001-dev.wikimedia.org on all recursors [17:55:27] !log andrew@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudidp2001-dev.wikimedia.org on all recursors [17:55:47] (03CR) 10Clément Goubert: [C:03+2] kubernetes::node: Use netmask to determine network topology [puppet] - 10https://gerrit.wikimedia.org/r/1206929 (https://phabricator.wikimedia.org/T405950) (owner: 10Clément Goubert) [17:56:06] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1064.eqiad.wmnet with OS trixie [17:56:06] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidp2001-dev.wikimedia.org - andrew@cumin2002" [17:56:08] !log cgoubert@cumin1003:~$ sudo cumin 'A:wikikube-worker' "disable-puppet 'deploying network topology detection change - ${USER}'" [17:56:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cloudidp2001-dev.wikimedia.org - andrew@cumin2002" [17:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:43] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1199 - https://phabricator.wikimedia.org/T409060#11385103 (10VRiley-WMF) 05In progress→03Resolved Disk has been swapped [17:56:47] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on backup1007.eqiad.wmnet with reason: C/D Migration [17:56:58] !log andrew@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=94) for new host cloudidp2001-dev.wikimedia.org [17:58:48] 10ops-eqiad, 06DC-Ops: eno8303 on db1221:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410436 (10phaultfinder) 03NEW [17:58:59] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on db1184.eqiad.wmnet with reason: C/D Migration [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1800) [18:00:11] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [18:00:34] (03PS4) 10Andrew Bogott: Config for cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) [18:00:34] (03PS1) 10Andrew Bogott: cloudidp2001-dev: force to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1206932 [18:01:35] PROBLEM - Host lsw1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:01:41] (03CR) 10Andrew Bogott: [C:03+2] cloudidp2001-dev: force to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1206932 (owner: 10Andrew Bogott) [18:02:09] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on es1033.eqiad.wmnet with reason: C/D Migration [18:03:05] FIRING: OspfAdjError: OSPF Adjacency not formed on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjError [18:03:10] PROBLEM - Host lsw1-d6-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:16] (03CR) 10Volans: [C:03+2] admin: Adding matthieulec to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1206422 (https://phabricator.wikimedia.org/T410291) (owner: 10Matthieulec) [18:03:24] FIRING: [12x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:29] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage [18:04:04] FIRING: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [18:04:39] FIRING: [2x] CoreBGPDown: Core BGP session down between ssw1-d1-eqiad and lsw1-d6-eqiad (10.64.128.29) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:05:04] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1016.eqiad.wmnet [18:05:06] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1016.eqiad.wmnet [18:05:51] !log sudo cumin 'A:wikikube-worker' "enable-puppet 'deploying network topology detection change - ${USER}'" [18:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:15] !log cgoubert@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1016.eqiad.wmnet [18:06:16] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1016.eqiad.wmnet [18:08:04] RESOLVED: OspfAdjError: OSPF Adjacency not formed on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjError [18:08:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:59] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11385149 (10Dzahn) @EdErhart-WMF Gotcha! Alright, so we will replace that existing redirect from wikipedia25.org to the foundation site with a new microsite under... [18:09:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 2 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11385150 (10bking) [18:09:08] FIRING: [13x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:09:39] FIRING: [4x] CoreBGPDown: Core BGP session down between lsw1-d6-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:10:02] !incidents [18:10:03] 7016 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [18:10:03] 7015 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [18:10:03] 7014 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [18:10:03] 7013 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [18:10:04] 7012 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [18:10:06] topranks: errr that what we're working on rn? ^ [18:13:57] FIRING: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:28] !incidents [18:14:29] 7016 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [18:14:29] 7015 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [18:14:29] 7014 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [18:14:29] 7013 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [18:14:29] 7012 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [18:14:30] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS trixie [18:14:41] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS trixie [18:14:49] * topranks we appear to have an issue in eqiad rack D6 [18:14:56] Raine: jhathaway ^ [18:15:10] claime: thanks [18:15:13] thanks [18:17:00] topranks: what kind of issue? wondering if we should depool services there [18:17:23] yes depool is advised [18:17:34] ok, thank you [18:17:37] BGP EVPN is down between spine/leaf switches yet the spine is pingable [18:17:44] mmm tasty [18:18:24] FIRING: [14x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:18:57] RESOLVED: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:20:21] mutante: we just ran into a switch error on d7 and its hard down [18:20:25] Raine: you can depool aqs1022 [18:20:27] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11385203 (10Volans) p:05Triage→03Medium [18:20:28] so we may have to push your migration due to no fault of your own [18:20:33] we dont wanna migrate more hosts mid outage [18:20:59] There's a bunch of db hosts as well, is a dba around to advise if something needs to be done ? [18:21:03] https://netbox.wikimedia.org/dcim/devices/?location_id=8&rack_id=40&sort=name [18:21:19] robh: makes sense! ack, thank you [18:21:19] indeed LOTS of db hosts [18:21:20] two es nodes, es1056 and 1053 [18:21:26] marostegui / jynus ^ [18:21:29] Two pc nodes [18:21:36] pc1014 pc1018 [18:21:45] we're investigating the network issue and netops is trying to resolve just be aware [18:21:50] PROBLEM - Host db1221 #page is DOWN: PING CRITICAL - Packet loss = 100% [18:21:52] restbase1045 sessionstore1006 wdqs1028 [18:22:13] RECOVERY - Host lsw1-d6-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 0.77 ms [18:22:14] PROBLEM - MariaDB Replica Lag: x3 #page on db1258 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1341.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:22:15] PROBLEM - MariaDB Replica IO: x3 #page on db1258 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db1255.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db1255.eqiad.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:22:16] PROBLEM - MariaDB Replica IO: s2 #page on db1259 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2005, Errmsg: error reconnecting to master repl2024@db1222.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Unknown server host db1222.eqiad.wmnet (-3) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:22:20] PROBLEM - Restbase root url on restbase1045 is CRITICAL: connect to address 10.64.48.23 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [18:22:20] Hey what's up? [18:22:21] wheeeeeee [18:22:27] rack down marostegui [18:22:28] marostegui: rack d6 down in eqiad [18:22:31] marostegui: need to depool db hosts in D6 [18:22:39] Ok checking what's in there [18:22:47] marostegui: https://netbox.wikimedia.org/dcim/devices/?location_id=8&rack_id=40&sort=name [18:22:58] we may be back up, checking [18:23:00] Merci [18:23:05] PROBLEM - MariaDB Replica Lag: s2 #page on db1233 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 558.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:06] PROBLEM - MariaDB Replica Lag: s2 #page on db1259 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 413.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:09] we're in a hangout workign th eissue [18:23:15] RECOVERY - MariaDB Replica Lag: x3 #page on db1258 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:16] RECOVERY - MariaDB Replica IO: x3 #page on db1258 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:17] RECOVERY - MariaDB Replica IO: s2 #page on db1259 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:21] RECOVERY - Restbase root url on restbase1045 is OK: HTTP OK: HTTP/1.1 200 - 18662 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/RESTBase [18:23:21] OSPF was somehow not configured for the core-facing interfaces on lsw1-d6-eqiad somehow [18:23:28] ok seems to be recovering [18:23:32] re-adding the configuration has brought it back into operation seemingly [18:23:33] so may not need to fail things over [18:23:38] thanks :D [18:23:40] robh: that's good [18:23:45] RECOVERY - Host lsw1-d6-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [18:23:48] check if there're any masters just in case [18:24:04] RECOVERY - MariaDB Replica Lag: s2 #page on db1233 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:24:05] RECOVERY - MariaDB Replica Lag: s2 #page on db1259 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:24:38] db wise we don't have any masters, so that's good [18:24:45] checking the dbproxy to see if it was an active or sby [18:24:55] FIRING: [13x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:07] RESOLVED: [2x] OspfAdjDown: OSPF Adjacency down on ssw1-d1-eqiad interface ethernet-1/14.0 - https://wikitech.wikimedia.org/wiki/Network_monitoring#OSPF_status - https://grafana.wikimedia.org/d/b77db156-d852-4601-acc5-4065b888e5fe/ospf-status-nokia?orgId=1&var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DOspfAdjDown [18:25:11] cool, i think its back now we had a bad homer run where it stripped opsf off the ports for no reason [18:25:11] RESOLVED: [13x] ProbeDown: Service aqs1022-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:25:20] RESOLVED: [4x] CoreBGPDown: Core BGP session down between lsw1-d6-eqiad and ssw1-d1-eqiad (10.64.128.17) - group ibgp_evpn - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [18:25:32] (03PS1) 10Bking: wdqs: add recycled hosts for migration [puppet] - 10https://gerrit.wikimedia.org/r/1206935 (https://phabricator.wikimedia.org/T410406) [18:25:34] Ok, databases are all good [18:25:38] And the proxy too [18:25:50] thanks marostegui [18:26:02] (03CR) 10CI reject: [V:04-1] wdqs: add recycled hosts for migration [puppet] - 10https://gerrit.wikimedia.org/r/1206935 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [18:26:31] robh: from your side everything is up? [18:26:37] yep [18:26:39] now it is [18:26:47] so the homer script stripped opsf flags and it shouldn't have [18:26:55] good, I am going to go offline then [18:26:57] Thanks everyone [18:27:01] tyvm marostegui <3 [18:27:04] enjoy your evening [18:27:24] mmmm [18:27:27] db1221 is not back yet [18:27:41] robh: ^ [18:28:02] thats in d1... [18:28:13] Yeah I was just seeing that [18:28:14] (03PS1) 10Sergio Gimeno: fix(ReviseToneExperimentInteractionLogger): prevent breaking homepage for unsampled users [extensions/GrowthExperiments] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206936 (https://phabricator.wikimedia.org/T405177) [18:28:27] [19:21:50] <+icinga-wm> PROBLEM - Host db1221 #page is DOWN: PING CRITICAL - Packet loss = 100% [18:28:31] was it coincidence? [18:28:37] No it was moved [18:28:48] But something else seems to have happened, we're checking physical link [18:29:02] ok [18:29:08] marostegui: Ok, that one we werent supposed to move today and accidentally moved it [18:29:16] and it is indeed down when it was up post move [18:29:17] investigating [18:29:24] ok thanks [18:29:37] PROBLEM - MariaDB Replica Lag: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.78 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:29:38] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1044.eqiad.wmnet with OS trixie [18:29:51] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [18:29:56] coming back now [18:30:01] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [18:30:06] RECOVERY - Host db1221 #page is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [18:30:13] marostegui: bad sfp-t [18:30:15] all good yep, replication catching up [18:30:16] it worked post move then didnt... sorry [18:30:18] thanks robh [18:30:19] ^^ this one seemingly blipped due to bad sfp [18:30:24] again we werent supposed to move that today it was an accident!~ [18:30:27] sorry about that [18:30:32] we moved 2 of your tomorrow hosts today =P [18:30:33] no worries! [18:30:35] we got overzealous [18:30:46] ahead of schedule! [18:31:37] RECOVERY - MariaDB Replica Lag: s4 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:31:47] robh: no, I am checking, it was scheduled to be moved today :) [18:31:57] robh: tomorrow will be pc1014 and db1189 :) [18:32:09] db1155 may need some manual intervention, replication timedout [18:32:15] (03PS2) 10Bking: wdqs: add recycled hosts for migration [puppet] - 10https://gerrit.wikimedia.org/r/1206935 (https://phabricator.wikimedia.org/T410406) [18:32:29] jynus: refresh [18:32:31] connecting to db1221 [18:32:49] robh: we can postpone or still do it. your choice [18:32:52] 👍 [18:33:24] mutante: chekcing! [18:33:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [18:33:35] Going offline now, thanks everyone! [18:33:47] mutante: i think we're ok to move your three, chekcing [18:34:43] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on es1033.eqiad.wmnet with reason: C/D Migration [18:35:25] robh: joined the calendar meet.. no rush :) [18:35:40] mutante: sorry, that hangout is no good [18:35:47] feel free to oin the large hangout we're in with netops [18:35:48] sending the link [18:36:51] (03PS1) 10Andrew Bogott: Correct site.pp entry for cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206937 (https://phabricator.wikimedia.org/T410294) [18:37:14] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [18:37:44] (03PS5) 10Andrew Bogott: Config for cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) [18:37:57] (03CR) 10Andrew Bogott: [C:03+2] Correct site.pp entry for cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206937 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [18:38:47] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on gitlab-runner1004.eqiad.wmnet with reason: C/D Migration [18:39:07] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudidp2001-dev.wikimedia.org with OS trixie [18:41:15] (03PS6) 10Andrew Bogott: Config for cloudidp2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) [18:41:17] (03CR) 10Cwhite: [C:03+1] "LGTM from my side!" [dumps] - 10https://gerrit.wikimedia.org/r/1203410 (https://phabricator.wikimedia.org/T403482) (owner: 10Silvan Heintze) [18:41:19] folks I understand what happened with the D6 Rack problem, I will write up a task shortly [18:41:55] TL;DR we have a gap in the logic in our "move device attributes" Netbox script (https://netbox.wikimedia.org/extras/scripts/11/) [18:42:34] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on lists1004.wikimedia.org with reason: C/D Migration [18:43:16] that was done a few weeks ago to replace faulty hardware switch in D6, but it seems the moved objects (and I believe specifically the cable) did not all lose all their old attributes/attachments in the Netbox/Django backend [18:44:11] this caused an unintended consequence today when I deleted the (now sent back to vendor) old switch, Netbox deleted the links from its replacement to the spine switches [18:44:29] (03PS1) 10David Caro: prometheus_radosgw_quota_exporter.py: make prom file readable [puppet] - 10https://gerrit.wikimedia.org/r/1206940 [18:44:40] and then when homer was next run it ended up deleting the OSPF config for them [18:44:45] I'll write this up more sanely [18:44:55] (03PS1) 10Mstyles: Security-landing-page: bump image to 2025-11-18-151828 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206941 (https://phabricator.wikimedia.org/T408723) [18:45:10] <3 [18:45:25] (03CR) 10David Caro: [C:03+2] "Tested in cloudcontrol1007, this was not breaking the exporter itself, but the prometheus-node service that exposes the contents of the ge" [puppet] - 10https://gerrit.wikimedia.org/r/1206940 (owner: 10David Caro) [18:46:36] (03CR) 10SBassett: [C:03+2] Security-landing-page: bump image to 2025-11-18-151828 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206941 (https://phabricator.wikimedia.org/T408723) (owner: 10Mstyles) [18:47:26] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1062.eqiad.wmnet with reason: C/D Migration [18:48:02] (03CR) 10Ssingh: [C:03+1] "Merging this tomorrow morning, around 14:15 UTC" [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [18:49:03] (03Merged) 10jenkins-bot: Security-landing-page: bump image to 2025-11-18-151828 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206941 (https://phabricator.wikimedia.org/T408723) (owner: 10Mstyles) [18:49:59] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1063.eqiad.wmnet [18:50:35] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1063.eqiad.wmnet [18:50:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385387 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1063.eqiad.wmnet completed: - wikikube-worke... [18:51:08] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1063.eqiad.wmnet with reason: C/D Migration [18:51:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:52:29] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11385405 (10Dzahn) gitlab-runners and lists server have been moved. we verified lists server still up and activity in the mail log. it lost only 8 packe... [18:53:00] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1063.eqiad.wmnet [18:53:04] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1063.eqiad.wmnet [18:53:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385407 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1063.eqiad.wmnet completed: - wikikube-worker1... [18:53:26] (03PS2) 10Scott French: hiera: point codfw LVS back to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/1206922 (https://phabricator.wikimedia.org/T352245) [18:53:47] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1305.eqiad.wmnet [18:53:51] (03CR) 10Btullis: [C:03+1] wdqs: add recycled hosts for migration [puppet] - 10https://gerrit.wikimedia.org/r/1206935 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [18:53:55] (03CR) 10Ssingh: [C:03+1] hiera: point codfw LVS back to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/1206922 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [18:54:08] (03CR) 10Scott French: "And now without messed up tag order, heh." [puppet] - 10https://gerrit.wikimedia.org/r/1206922 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [18:54:20] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11385412 (10Dzahn) @LSobanski our part is done [18:54:23] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1305.eqiad.wmnet [18:54:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385413 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1305.eqiad.wmnet completed: - wikikube-worke... [18:54:31] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11385415 (10Dzahn) 05Open→03Resolved [18:54:32] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1313.eqiad.wmnet [18:55:05] (03CR) 10Muehlenhoff: cloudidp2001-dev: force to puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206932 (owner: 10Andrew Bogott) [18:55:08] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1313.eqiad.wmnet [18:55:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385420 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1313.eqiad.wmnet completed: - wikikube-worke... [18:55:18] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1157.eqiad.wmnet [18:55:54] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1157.eqiad.wmnet [18:56:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385422 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1157.eqiad.wmnet completed: - wikikube-worke... [18:56:19] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1305.eqiad.wmnet with reason: C/D Migration [18:57:25] log: netbox changes reverted so it is safe to run homer against lsw1-d6-eqiad now [18:57:56] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudidp2001-dev.wikimedia.org with reason: host reimage [19:00:05] brennen and andre: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T1900). [19:00:50] o/ [19:01:10] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1305.eqiad.wmnet [19:01:13] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1066.eqiad.wmnet with OS trixie [19:01:13] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1305.eqiad.wmnet [19:01:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385434 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1305.eqiad.wmnet completed: - wikikube-worker1... [19:01:30] andre: https://ichef.bbci.co.uk/ace/standard/976/cpsprodpb/5B73/production/_87211432_trainmathurreuters.jpg.webp [19:01:52] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1065.eqiad.wmnet with OS trixie [19:02:01] !log 1.46.0-wmf.3 train status (T408273): no current blockers, rolling to group0 [19:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:06] T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273 [19:02:06] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1313.eqiad.wmnet with reason: C/D Migration [19:02:09] mutante: the goal of spiderpig [19:02:14] andre: pre-train it [19:02:37] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206945 (https://phabricator.wikimedia.org/T408273) [19:02:39] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by brennen@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206945 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [19:02:46] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudidp2001-dev.wikimedia.org with reason: host reimage [19:03:04] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1313.eqiad.wmnet [19:03:07] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1313.eqiad.wmnet [19:03:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385458 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1313.eqiad.wmnet completed: - wikikube-worker1... [19:03:28] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206945 (https://phabricator.wikimedia.org/T408273) (owner: 10TrainBranchBot) [19:03:34] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1157.eqiad.wmnet [19:03:37] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1157.eqiad.wmnet [19:03:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385461 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1157.eqiad.wmnet completed: - wikikube-worke... [19:03:57] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1157.eqiad.wmnet with reason: C/D Migration [19:04:09] !log mstyles@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [19:04:38] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1157.eqiad.wmnet [19:04:41] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1157.eqiad.wmnet [19:04:48] !log mstyles@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [19:04:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385466 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1157.eqiad.wmnet completed: - wikikube-worker1... [19:04:53] !log mstyles@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [19:05:13] !log mstyles@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [19:05:36] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1254-1256].eqiad.wmnet [19:05:41] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1254-1256].eqiad.wmnet [19:05:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385475 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1254-1256].eqiad.wmnet completed: - wikikube-... [19:06:15] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1306.eqiad.wmnet [19:06:18] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1306.eqiad.wmnet [19:06:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385477 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1306.eqiad.wmnet completed: - wikikube-worker1... [19:06:30] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1306.eqiad.wmnet [19:07:06] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1306.eqiad.wmnet [19:07:13] (03CR) 10Bking: [C:03+2] wdqs: add recycled hosts for migration [puppet] - 10https://gerrit.wikimedia.org/r/1206935 (https://phabricator.wikimedia.org/T410406) (owner: 10Bking) [19:07:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385481 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1306.eqiad.wmnet completed: - wikikube-worke... [19:07:21] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1254-1256].eqiad.wmnet [19:08:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS trixie [19:08:51] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1068.eqiad.wmnet with OS trixie [19:09:04] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1254-1256].eqiad.wmnet [19:09:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385498 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1254-1256].eqiad.wmnet completed: - wikikub... [19:09:32] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1135.eqiad.wmnet with reason: C/D Migration [19:10:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11385500 (10bking) a:05bking→03Jclark-ctr [19:11:00] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1136.eqiad.wmnet with reason: C/D Migration [19:11:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10[28-32] - https://phabricator.wikimedia.org/T410406#11385515 (10bking) @Jclark-ctr I added the hosts to Puppet per https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Planned_-%3E_Active , assigning back over to you. Feel fr... [19:12:24] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1137.eqiad.wmnet with reason: C/D Migration [19:13:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for AnkitaM - https://phabricator.wikimedia.org/T409893#11385524 (10MGerlach) 05In progress→03Resolved a:03MGerlach @Volans looks like everything is working as expected. Thank you. [19:14:15] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1138.eqiad.wmnet with reason: C/D Migration [19:14:23] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.3 refs T408273 [19:14:27] T408273: 1.46.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T408273 [19:15:48] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1139.eqiad.wmnet with reason: C/D Migration [19:16:24] (03CR) 10Cwhite: "Any update to add? I'd like to get started scheduling the deployment." [puppet] - 10https://gerrit.wikimedia.org/r/1201773 (https://phabricator.wikimedia.org/T305223) (owner: 10Cwhite) [19:16:56] !log disable puppet on A:lvs-codfw for pybal config change - T352245 [19:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:00] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [19:17:38] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1154.eqiad.wmnet with reason: C/D Migration [19:17:49] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1206922 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [19:17:50] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudidp2001-dev.wikimedia.org with OS trixie [19:17:55] (03CR) 10Scott French: [C:03+2] hiera: point codfw LVS back to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/1206922 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [19:18:38] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11385549 (10Volans) @WMDECyn, in case @Chandra-WMDE's position is a fixed term contract, could you provide us with the expiration date so that we can add it to `data.yaml` to tr... [19:18:51] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1155.eqiad.wmnet with reason: C/D Migration [19:18:55] 06SRE, 10SRE-Access-Requests: Requesting access to Analytics_Privatedata for Chandra-WMDE - https://phabricator.wikimedia.org/T409707#11385550 (10Volans) [19:20:36] FYI, similar to yesterday, I'll be applying some config changes to LVS in codfw in a few minutes, which may temporarily produce some alert noise (e.g., `Check if Pybal has been restarted after pybal.conf was changed` checks) [19:22:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:22:30] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1156.eqiad.wmnet with reason: C/D Migration [19:23:22] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T352245) [19:23:26] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [19:23:45] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1068.eqiad.wmnet with reason: host reimage [19:23:51] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:23:54] !log robh@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on wikikube-worker1306.eqiad.wmnet with reason: C/D Migration [19:24:08] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T352245) [19:24:14] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [19:24:49] (03CR) 10Andrew Bogott: [C:03+2] Config for cloudidp2001-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1206448 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [19:24:49] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1068.eqiad.wmnet with reason: host reimage [19:25:24] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1254-1256].eqiad.wmnet [19:25:29] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1254-1256].eqiad.wmnet [19:25:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385581 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1254-1256].eqiad.wmnet completed: - wikikube-... [19:25:40] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1306.eqiad.wmnet [19:25:43] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1306.eqiad.wmnet [19:25:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385582 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1306.eqiad.wmnet completed: - wikikube-worker1... [19:25:54] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1206886 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [19:26:46] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:27:14] ^ expected - this is a persistently misconfigured service [19:27:55] (03PS1) 10Urbanecm: [Growth] Enable Add Link task pool generation for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206948 (https://phabricator.wikimedia.org/T407818) [19:28:23] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [19:29:15] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-codfw (T352245) [19:29:19] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [19:29:27] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1206885 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [19:29:40] swfrench-wmf re: Pybal alert my team is responsible for the dse-k8s-codfw cluster, if there is something I can help with LMK [19:29:46] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:30:03] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-codfw (T352245) [19:30:19] (03CR) 10Andrea Denisse: [C:03+1] "All of the patches in the chain LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1206884 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [19:30:40] inflatador: thanks! believe it's just that none of the backend hosts on that service in codfw actually have ingress running. I'll follow up with y'all later on :) [19:31:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11385630 (10RobH) Day 6 Update: * 31 hosts moved today, 77 hosts remain * got directions from Clement on how to move wikikube hosts effectively, moved half... [19:32:36] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:35:32] (03CR) 10Ssingh: "@bcornwall@wikimedia.org: please review!" [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735) (owner: 10Slyngshede) [19:37:06] I'm not supposed to see `error: unable to upgrade connection: container mediawiki-qdzz3ma1-app not found in pod mw-script.codfw.qdzz3ma1-tpbdt_mw-script` when running mwscript-k8s, right? [19:37:32] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-codfw (T352245) [19:37:36] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [19:37:52] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-codfw (T352245) [19:39:03] "error: cannot attach a container in a completed pod; current phase is Failed" [19:39:04] hmm... [19:40:47] something's wrong [19:41:06] swfrench-wmf ACK, SGTM [19:41:57] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1016:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410434#11385690 (10RobH) This is indeed detecting slow: {F70279116} [19:42:07] urbanecm: I'm wrapping up some work, but can take a look in a few minutes [19:42:17] swfrench-wmf: ty, ping me whenever ready [19:42:18] it sounds like the pod was never successfully created? [19:42:27] !log swfrench@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-codfw (T352245) [19:42:39] (03PS1) 10Andrew Bogott: trafficserver: redirect cloudidp-dev to the new server [puppet] - 10https://gerrit.wikimedia.org/r/1206951 (https://phabricator.wikimedia.org/T410294) [19:42:47] (03CR) 10Ssingh: "https://gitlab.wikimedia.org/slyngshede/meta-geomap" [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735) (owner: 10Slyngshede) [19:42:47] !log swfrench@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-codfw (T352245) [19:42:52] T352245: Migrate the etcd main cluster to cfssl-based PKI - https://phabricator.wikimedia.org/T352245 [19:44:10] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1016.eqiad.wmnet [19:44:46] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1016.eqiad.wmnet [19:44:52] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385707 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1016.eqiad.wmnet completed: - wikikube-worke... [19:45:08] (03PS1) 10BCornwall: bump changelog for wmf-debci:trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1206952 [19:45:51] * swfrench-wmf is done with LVS-related work [19:46:01] !log robh@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1016.eqiad.wmnet [19:46:04] !log robh@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1016.eqiad.wmnet [19:46:08] (03CR) 10Ssingh: [C:03+1] bump changelog for wmf-debci:trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1206952 (owner: 10BCornwall) [19:46:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385710 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1016.eqiad.wmnet completed: - wikikube-worker1... [19:46:13] 10ops-eqiad, 06SRE, 06DC-Ops: eno1 on wikikube-worker1016:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410434#11385711 (10RobH) 05Open→03Resolved a:03RobH optic swap by john fixed it. [19:46:43] 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1221:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410436#11385714 (10Jclark-ctr) a:03Jclark-ctr [19:46:46] (03CR) 10BCornwall: [V:03+2 C:03+2] bump changelog for wmf-debci:trixie [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1206952 (owner: 10BCornwall) [19:47:03] swfrench-wmf: i logged what i was trying to do at https://phabricator.wikimedia.org/T410451, incl. outputs. advice welcomed! [19:47:34] 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1221:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T410436#11385727 (10Jclark-ctr) 05Open→03Resolved Faulty optic error has cleared [19:47:45] (03CR) 10Andrew Bogott: [C:03+2] trafficserver: redirect cloudidp-dev to the new server [puppet] - 10https://gerrit.wikimedia.org/r/1206951 (https://phabricator.wikimedia.org/T410294) (owner: 10Andrew Bogott) [19:47:52] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [19:48:13] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [19:48:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11385733 (10RobH) a:05Clement_Goubert→03RobH IRC Discussion Update: We moved about half the wikikube workers today after a sync up with Clement and Cathal on the particular... [19:49:12] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1068.eqiad.wmnet with OS trixie [19:50:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11385742 (10Jclark-ctr) a:05klausman→03Jclark-ctr [19:50:47] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eqiad row C/D Collaboration Services host migrations - https://phabricator.wikimedia.org/T405940#11385745 (10Jclark-ctr) a:05LSobanski→03Jclark-ctr [19:51:29] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1067.eqiad.wmnet with OS trixie [19:51:30] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [19:51:41] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [19:51:43] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1069.eqiad.wmnet with OS trixie [19:52:10] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1071.eqiad.wmnet with OS trixie [19:54:05] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455 (10cmooney) 03NEW p:05Triage→03Low [19:56:31] urbanecm: so, it's definitely mwscript (i.e., the shell script that runs in the container) exiting immediately due to a command-line parsing problem. added some detail to the task you opened. [19:57:04] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: ulsfo switch refresh - https://phabricator.wikimedia.org/T410456 (10RobH) 03NEW p:05Triage→03Medium [19:57:40] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: ulsfo switch refresh - https://phabricator.wikimedia.org/T410456#11385835 (10RobH) 05Open→03Invalid dupe of T410456 [19:58:04] swfrench-wmf: thanks a lot! i didn't realise there might be output in the logs of the container. it looks like i should've used `--follow` instead :/. [19:58:23] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: ulsfo switch refresh - https://phabricator.wikimedia.org/T410456#11385838 (10RobH) 05Invalid→03Resolved dupe of T408510 [19:58:59] (03PS2) 10Dzahn: releases: debug jenkins service masking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1206930 [19:59:35] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11385845 (10RobH) [20:01:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:02:10] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11385856 (10RobH) [20:03:11] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11385859 (10ssingh) [20:03:26] urbanecm: so, I think the underlying issue may have just been the extra `--` in the argument list (https://phabricator.wikimedia.org/T410451#11385843) [20:03:37] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: ulsfo switch refresh - https://phabricator.wikimedia.org/T410456#11385866 (10Aklapper) →14Duplicate dup:03T408510 [20:03:40] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11385868 (10Aklapper) [20:04:59] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11385870 (10RobH) [20:05:06] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11385874 (10cmooney) [20:06:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:06:56] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1069.eqiad.wmnet with reason: host reimage [20:07:09] (03PS3) 10Dzahn: releases: debug jenkins service masking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1206930 [20:07:52] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1071.eqiad.wmnet with reason: host reimage [20:09:34] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11385892 (10cmooney) [20:11:58] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11385900 (10cmooney) [20:13:34] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11385909 (10cmooney) [20:14:03] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1071.eqiad.wmnet with reason: host reimage [20:15:55] (03PS4) 10Dzahn: releases: debug jenkins service masking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1206930 [20:16:47] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1069.eqiad.wmnet with reason: host reimage [20:18:33] (03PS1) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [20:21:14] (03PS2) 10Daniel Kinzler: rest-gateway: implement per-route rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206898 (https://phabricator.wikimedia.org/T409044) [20:21:52] (03PS2) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [20:25:28] (03PS5) 10Dzahn: releases: debug jenkins service masking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1206930 [20:32:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11385988 (10CDanis) @RobH please proceed at your convenience -- these two hosts are not in active service. In the future, you could run the cookb... [20:34:36] (03PS6) 10Dzahn: releases: debug jenkins service masking on releases1003 [puppet] - 10https://gerrit.wikimedia.org/r/1206930 (https://phabricator.wikimedia.org/T410422) [20:35:33] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11385997 (10cmooney) [20:35:43] (03PS1) 10Dzahn: Revert "releases: control jenkins service by DC name, not host name" [puppet] - 10https://gerrit.wikimedia.org/r/1206957 [20:37:27] (03PS2) 10Dzahn: Revert "releases: control jenkins service by DC name, not host name" [puppet] - 10https://gerrit.wikimedia.org/r/1206957 [20:38:56] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:40:15] (03CR) 10Dzahn: [V:03+1 C:03+1] "this IS a difference in the compiler! so using host name vs just "eqiad" is NOT the same. It's about the Hiera lookups." [puppet] - 10https://gerrit.wikimedia.org/r/1206957 (owner: 10Dzahn) [20:41:35] 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11386004 (10Dzahn) [20:42:39] 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11386020 (10Dzahn) Tagging collab; since we are probably the ones who need to get back to this nowadays. Sorry for the delay; slipped off the radar. [20:43:06] (03PS3) 10Dzahn: Revert "releases: control jenkins service by DC name, not host name" [puppet] - 10https://gerrit.wikimedia.org/r/1206957 (https://phabricator.wikimedia.org/T391578) [20:44:23] (03CR) 10Dzahn: [C:03+2] "going back to the status before today" [puppet] - 10https://gerrit.wikimedia.org/r/1206957 (https://phabricator.wikimedia.org/T391578) (owner: 10Dzahn) [20:45:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: eqiad row C/D Infrastructure Foundations host migrations - https://phabricator.wikimedia.org/T405945#11386024 (10RobH) Thank you for the update, we'll likely move these two hosts tomorrow! [20:53:14] (03PS1) 10Dzahn: releases: control jenkins service by host name [puppet] - 10https://gerrit.wikimedia.org/r/1206959 (https://phabricator.wikimedia.org/T392127) [20:53:22] (03PS2) 10Dzahn: releases: control jenkins service by host name [puppet] - 10https://gerrit.wikimedia.org/r/1206959 (https://phabricator.wikimedia.org/T392127) [20:57:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1071.eqiad.wmnet with OS trixie [20:58:33] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1206959/7646/releases1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1206959 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [20:59:50] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1072.eqiad.wmnet with OS trixie [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T2100). [21:00:05] bvibber: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:40] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1069.eqiad.wmnet with OS trixie [21:02:09] here [21:02:18] i can go ahead and spiderpig this :) [21:02:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206924 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [21:03:52] (03Merged) 10jenkins-bot: MediaViewer buckets reduction to all groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206924 (https://phabricator.wikimedia.org/T372165) (owner: 10Bvibber) [21:04:26] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1206924|MediaViewer buckets reduction to all groups (T372165)]] [21:04:30] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:06:37] !log import pcre3 8.45-1~deb13+wmf1 into trixie-wikimedia - T401832 [21:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:42] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [21:08:49] !log bvibber@deploy2002 bvibber: Backport for [[gerrit:1206924|MediaViewer buckets reduction to all groups (T372165)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:09:08] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:09:30] !log bvibber@deploy2002 bvibber: Continuing with sync [21:11:07] (03Abandoned) 10Dzahn: releases: debug jenkins service masking on releases1003 [puppet] - 10https://gerrit.wikimedia.org/r/1206930 (https://phabricator.wikimedia.org/T410422) (owner: 10Dzahn) [21:13:43] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1206924|MediaViewer buckets reduction to all groups (T372165)]] (duration: 09m 17s) [21:13:47] T372165: Reduce number of bucketsizes for MediaViewer - https://phabricator.wikimedia.org/T372165 [21:13:50] all done [21:15:29] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1073.eqiad.wmnet with OS trixie [21:15:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [21:15:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [21:15:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:15:54] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1072.eqiad.wmnet with reason: host reimage [21:19:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1072.eqiad.wmnet with reason: host reimage [21:22:12] (03CR) 10Ladsgroup: [C:03+2] mysql: Rename cookbook to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [21:23:09] (03CR) 10Dzahn: [C:03+2] fail-over releases.wikimedia.org backend [dns] - 10https://gerrit.wikimedia.org/r/1204684 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [21:23:20] !log dzahn@dns1004 START - running authdns-update [21:23:37] !log switching backend of releases.wikimedia.org to codfw [21:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:42] !log dzahn@dns1004 END - running authdns-update [21:25:00] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11386177 (10Ladsgroup) [21:25:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11386182 (10Scott_French) @RLazarus and I were looking into verifying workloads returning to the migrated workers, and ran into a few surprises. Going by what's been marked com... [21:27:27] !log import trafficserver 9.2.11-1wm1 into trixie-wikimedia - T401832 [21:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:31] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [21:28:58] (03Merged) 10jenkins-bot: mysql: Rename cookbook to be kebab-case instead of snake_case [cookbooks] - 10https://gerrit.wikimedia.org/r/1202150 (owner: 10Ladsgroup) [21:30:17] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1073.eqiad.wmnet with reason: host reimage [21:33:18] (03PS1) 10Kosta Harlan: hCaptcha: Record A/B test experiment group [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) [21:34:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1073.eqiad.wmnet with reason: host reimage [21:34:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:35:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [21:35:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [21:35:47] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:36:10] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:19] (03CR) 10CI reject: [V:04-1] hCaptcha: Record A/B test experiment group [extensions/WikimediaEvents] (wmf/1.46.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1206960 (https://phabricator.wikimedia.org/T410354) (owner: 10Kosta Harlan) [21:39:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:43:30] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1072.eqiad.wmnet with OS trixie [21:44:01] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1074.eqiad.wmnet with OS trixie [21:47:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11386255 (10RobH) >>! In T405950#11386181, @Scott_French wrote: > @RLazarus and I were looking into verifying workloads returning to the migrated workers, and ran into a few sur... [21:50:28] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1075.eqiad.wmnet with OS trixie [21:50:47] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11386261 (10ATitkov) Hello @Dzahn! I am the person responsible for developing the microsite. It is a static html with a bunch of media assets. The tech stack is... [21:53:52] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11386273 (10ATitkov) Forgot to add the current repo [[ https://gitlab.wikimedia.org/toolforge-repos/wikipedia25-years-of-wikipedia | https://gitlab.wikimedia.org/t... [21:56:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11386300 (10Scott_French) Got it - thanks for clarifying, @RobH! Alright, in that case, let us know if you'd like a second pair of eyes on anything ahead of the next wave of mig... [21:56:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [21:57:13] !incidents [21:57:13] 7023 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [21:57:13] 7017 (RESOLVED) Host db1221 (paged) [21:57:14] 7022 (RESOLVED) db1233 (paged)/MariaDB Replica Lag: s2 (paged) [21:57:14] 7021 (RESOLVED) db1259 (paged)/MariaDB Replica Lag: s2 (paged) [21:57:14] 7020 (RESOLVED) db1259 (paged)/MariaDB Replica IO: s2 (paged) [21:57:14] 7019 (RESOLVED) db1258 (paged)/MariaDB Replica IO: x3 (paged) [21:57:14] 7018 (RESOLVED) db1258 (paged)/MariaDB Replica Lag: x3 (paged) [21:57:15] 7016 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [21:57:15] 7015 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [21:57:15] 7014 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [21:57:16] 7013 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [21:57:16] 7012 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [21:57:43] !ack 7023 [21:57:44] 7023 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [21:57:48] (03PS1) 10Daimona Eaytoy: Enable $wgCampaignEventsEnableContributionTracking in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206964 (https://phabricator.wikimedia.org/T404904) [21:58:52] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1073.eqiad.wmnet with OS trixie [21:59:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1206964 (https://phabricator.wikimedia.org/T404904) (owner: 10Daimona Eaytoy) [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251118T2200) [22:01:39] andrew@cumin2002 reimage (PID 4108761) is awaiting input [22:01:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:05:07] same deal? [22:05:23] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1075.eqiad.wmnet with reason: host reimage [22:06:28] seems so quickly resolved on its own [22:06:50] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11386337 (10Dzahn) @ATitkov Hello, thanks for the details! All should be no problem as long as we can just let git clone the repo contents into a document root of... [22:07:11] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1076.eqiad.wmnet with OS trixie [22:07:18] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1074.eqiad.wmnet with OS trixie [22:07:59] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt1074.eqiad.wmnet with OS trixie [22:08:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:08:12] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [22:08:55] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1075.eqiad.wmnet with reason: host reimage [22:08:56] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:09:20] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [22:12:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:12:12] ^^indeed [22:13:52] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:21:28] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:21:41] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:22:05] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1076.eqiad.wmnet with reason: host reimage [22:24:19] andrew@cumin2002 reimage (PID 4123369) is awaiting input [22:29:12] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1076.eqiad.wmnet with reason: host reimage [22:32:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad row C/D Service Ops host migrations - https://phabricator.wikimedia.org/T405950#11386421 (10RobH) I think it'll be ok when we move things tomorrow, since I know exactly the mistake I made I don't think I'll make it again for a few months minimum ; D The cu... [22:34:47] 10SRE-SLO, 10Citoid, 10VisualEditor, 06Editing-team (Tracking): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11386437 (10ppelberg) a:03elukey @elukey: it //sounds// like the next step is for you. So, I... [22:36:42] (03PS1) 10Bking: dse-k8s-codfw: set minimum resources for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206969 (https://phabricator.wikimedia.org/T408643) [22:37:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [22:37:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [22:37:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:43:51] (03CR) 10CI reject: [V:04-1] dse-k8s-codfw: set minimum resources for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206969 (https://phabricator.wikimedia.org/T408643) (owner: 10Bking) [22:45:15] (03PS1) 10Bking: opensearch on k8s: Add CODFW environment to helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206973 (https://phabricator.wikimedia.org/T408643) [22:50:45] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11386485 (10ATitkov) > we can just let git clone the repo contents into a document root of a standard Apache and it doesn't require installing additional software... [22:51:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:51:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [22:52:07] !incidents [22:52:07] 7024 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:52:07] 7023 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:52:08] 7017 (RESOLVED) Host db1221 (paged) [22:52:08] 7022 (RESOLVED) db1233 (paged)/MariaDB Replica Lag: s2 (paged) [22:52:08] 7021 (RESOLVED) db1259 (paged)/MariaDB Replica Lag: s2 (paged) [22:52:08] 7020 (RESOLVED) db1259 (paged)/MariaDB Replica IO: s2 (paged) [22:52:08] 7019 (RESOLVED) db1258 (paged)/MariaDB Replica IO: x3 (paged) [22:52:09] 7018 (RESOLVED) db1258 (paged)/MariaDB Replica Lag: x3 (paged) [22:52:09] 7016 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:52:09] 7015 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:52:10] 7014 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [22:52:10] 7013 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr1-codfw.wikimedia.org) [22:52:11] 7012 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:52:12] !incidents [22:52:13] 7024 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:52:13] 7023 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:52:13] 7017 (RESOLVED) Host db1221 (paged) [22:52:51] !ack 7024 [22:52:52] 7024 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:55:52] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [22:56:15] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [22:57:03] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1075.eqiad.wmnet with OS trixie [22:57:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:29] !incidents [22:58:30] 7024 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:58:30] 7025 (UNACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:58:30] 7023 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:58:30] 7017 (RESOLVED) Host db1221 (paged) [22:58:31] 7022 (RESOLVED) db1233 (paged)/MariaDB Replica Lag: s2 (paged) [22:58:31] 7021 (RESOLVED) db1259 (paged)/MariaDB Replica Lag: s2 (paged) [22:58:31] 7020 (RESOLVED) db1259 (paged)/MariaDB Replica IO: s2 (paged) [22:58:31] 7019 (RESOLVED) db1258 (paged)/MariaDB Replica IO: x3 (paged) [22:58:31] 7018 (RESOLVED) db1258 (paged)/MariaDB Replica Lag: x3 (paged) [22:58:32] 7016 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:58:32] 7015 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqiad) [22:58:33] !ack 7025 [22:58:33] 7014 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr3-eqsin.wikimedia.org) [23:03:24] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:05:46] (03PS6) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) [23:06:10] FIRING: BFDdown: BFD session down between cr2-eqord and 208.80.154.208 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:06:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:07:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:24] RESOLVED: [2x] ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:10:38] (03CR) 10JHathaway: UEFI: dup partition on MD RAID boxes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [23:11:10] RESOLVED: BFDdown: BFD session down between cr2-eqord and 208.80.154.208 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:11:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [23:22:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [23:22:44] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [23:22:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:23:14] FIRING: KubernetesDeploymentUnavailableReplicas: ... [23:23:14] Deployment mobileapps-production in mobileapps at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mobileapps&var-deployment=mobileapps-production - ... [23:23:14] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [23:27:16] 06SRE, 06Infrastructure-Foundations, 10netops: lsw1-d6-eqiad outage Nov 18 2025 - https://phabricator.wikimedia.org/T410455#11386635 (10cmooney) To try to verify what happened here I tried to make the same change in netbox-next, (with [[ https://netbox-next.wikimedia.org/dcim/devices/6359/ | this ]] being th... [23:32:07] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11386657 (10MatthewVernon) [23:42:32] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1076.eqiad.wmnet with OS trixie [23:43:01] (03PS4) 10Jforrester: wikifunctions: Bump the orchestrator timeout down a skosh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1205263 (https://phabricator.wikimedia.org/T407503) (owner: 10Cory Massaro) [23:43:01] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-11-08-223341 to 2025-11-18-175356 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206981 (https://phabricator.wikimedia.org/T305612) [23:44:35] (03PS1) 10Pppery: Update source strings to latest upstream [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1206983 [23:52:37] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1074.eqiad.wmnet with OS trixie [23:54:16] PROBLEM - Host sretest1006 is DOWN: PING CRITICAL - Packet loss = 100% [23:56:46] RECOVERY - Host sretest1006 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms