[00:04:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:05:25] FIRING: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:39:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220032 [00:39:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220032 (owner: 10TrainBranchBot) [00:45:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:51:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220032 (owner: 10TrainBranchBot) [01:00:38] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:09:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220034 [01:09:51] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220034 (owner: 10TrainBranchBot) [01:32:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:33:20] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220034 (owner: 10TrainBranchBot) [01:47:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:50:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:55:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:32:15] (03PS1) 10Papaul: Add bgp config for mr1-codfw and lsw1-a3-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1220035 (https://phabricator.wikimedia.org/T410717) [02:59:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:09:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:14:33] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.013e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [03:15:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:35:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:36:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:38:57] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 5496.90 ms [03:39:05] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 50%, RTA = 1104.10 ms [03:41:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:05:25] FIRING: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:10:17] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 6513.04 ms [04:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:11:09] RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [04:12:43] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:13:43] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:42:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:45:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:02:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:32:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:34:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:53:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:55:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:05:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:10:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:10:59] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11480498 (10Marostegui) Thanks - I just checked that the UEFI partman recipe is assigned to it. [06:13:06] (03CR) 10Marostegui: [C:03+2] data.yaml: Add akhatun to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1219964 (https://phabricator.wikimedia.org/T413140) (owner: 10Marostegui) [06:16:16] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11480506 (10Marostegui) 05Open→03Resolved a:03Marostegui Patch merged and principal created in Kerberos - you should've received an email to res... [06:25:13] (03PS1) 10Marostegui: installserver: Placeholder for reuse-db-efi [puppet] - 10https://gerrit.wikimedia.org/r/1220052 [06:29:46] (03CR) 10Marostegui: [C:03+2] installserver: Placeholder for reuse-db-efi [puppet] - 10https://gerrit.wikimedia.org/r/1220052 (owner: 10Marostegui) [06:33:36] (03PS1) 10Marostegui: installserver: Format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1220053 (https://phabricator.wikimedia.org/T412807) [06:37:52] (03CR) 10Marostegui: [C:03+2] installserver: Format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1220053 (https://phabricator.wikimedia.org/T412807) (owner: 10Marostegui) [06:38:07] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480523 (10Marostegui) I talked about this with @MoritzMuehlenhoff past Friday. Looks like the issue is the fact that the h... [06:38:56] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480525 (10Marostegui) I just merged a patch to keep es2028 entirely formatted to make sure it is all fixed there and if it... [06:49:27] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [07:05:27] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [07:09:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:09:28] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage [07:14:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:17:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:29:12] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:32:39] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS trixie [07:33:10] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480547 (10Marostegui) 05Open→03Resolved a:03cmooney The host was reimaged correctly (formatting everything) so closing this. Thank you... [07:33:39] 06SRE, 06Data-Persistence, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480551 (10Marostegui) [07:34:12] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:39:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:44:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:47:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:48:13] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:49:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:49:54] 06SRE, 06Data-Persistence, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480570 (10Marostegui) [07:53:13] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:54:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251222T0800) [08:02:20] (03PS1) 10Elukey: Fix requirements and rebuild for Trixie [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1220201 [08:02:43] (03CR) 10Elukey: [V:03+2 C:03+2] Fix requirements and rebuild for Trixie [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1220201 (owner: 10Elukey) [08:03:22] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@1664255]: (no justification provided) [08:03:32] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@1664255]: (no justification provided) (duration: 00m 11s) [08:04:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:04:41] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@1664255]: (no justification provided) [08:04:47] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@1664255]: (no justification provided) (duration: 00m 07s) [08:05:25] FIRING: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:10:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:12:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:14:05] !log elukey@deploy2002 Started deploy [docker-pkg/deploy@1664255]: (no justification provided) [08:14:11] !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@1664255]: (no justification provided) (duration: 00m 08s) [08:14:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:19:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:22:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:26:02] (03PS1) 10Elukey: sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) [08:28:04] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:29:12] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:31:27] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11480594 (10elukey) The sync seems now complete! Nice work :) Next steps before closing: update documentati... [08:31:45] (03CR) 10CI reject: [V:04-1] sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) (owner: 10Elukey) [08:32:28] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:36:54] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11480595 (10MatthewVernon) I think we're almost-but-not-quite-entirely caught up, but yes, I think the repor... [08:38:07] (03PS2) 10Elukey: sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) [08:38:41] (03PS1) 10Kevin Bazira: ml-services: add embeddings isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) [08:39:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:40:10] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:40:13] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:45:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:46] (03PS3) 10Elukey: sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) [08:47:11] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:47:13] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:49:12] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:54:12] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:32] (03CR) 10Ozge: ml-services: add embeddings isvc to the experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [08:57:23] (03PS4) 10Elukey: sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) [08:57:46] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:58:19] (03CR) 10Dpogorzelski: [C:03+2] ml-builder: clone production images [puppet] - 10https://gerrit.wikimedia.org/r/1219553 (owner: 10Dpogorzelski) [08:59:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:00:04] (03CR) 10Elukey: "Tested on db2249, worked fine :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) (owner: 10Elukey) [09:02:47] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [09:03:05] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11480603 (10elukey) @Jhancock.wm Hi! I filed https://gerrit.wikimedia.org/r/1220311 to hopefully avoid this in the future, the provision cookbook should be smarte... [09:05:58] (03PS2) 10Kevin Bazira: ml-services: add embeddings isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) [09:07:22] (03CR) 10Kevin Bazira: ml-services: add embeddings isvc to the experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [09:12:23] (03PS8) 10Arnaudb: mailman: add UpstreamTlsContext on tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/1219770 (https://phabricator.wikimedia.org/T286066) [09:12:23] (03CR) 10Arnaudb: "This will unblock the connectivity between envoy and httpd." [puppet] - 10https://gerrit.wikimedia.org/r/1219770 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb) [09:13:35] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11480605 (10ABran-WMF) [09:14:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:04] (03CR) 10Ozge: [C:03+2] ml-services: add embeddings isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [09:24:12] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11480607 (10elukey) 05Open→03Resolved a:03elukey Added https://wikitech.wikimedia.org/wiki/Docker-... [09:24:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:26:16] (03Merged) 10jenkins-bot: ml-services: add embeddings isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [09:29:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:31:19] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:34:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:39:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:49:12] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:50:51] (03PS1) 10Elukey: Move wqds10[28-1032] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) [09:54:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:04] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [09:55:33] (03PS2) 10Elukey: Move wqds10[28-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) [09:56:34] (03PS1) 10Dpogorzelski: ml-build: fix image repo clone path [puppet] - 10https://gerrit.wikimedia.org/r/1220317 [09:56:52] (03CR) 10Dpogorzelski: [C:03+2] ml-build: fix image repo clone path [puppet] - 10https://gerrit.wikimedia.org/r/1220317 (owner: 10Dpogorzelski) [09:56:54] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-build: fix image repo clone path [puppet] - 10https://gerrit.wikimedia.org/r/1220317 (owner: 10Dpogorzelski) [09:57:17] (03Merged) 10jenkins-bot: rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [10:02:25] (03CR) 10D3r1ck01: [C:03+1] EditWatchlistPaginate feature flag has been removed from MW code, so remove it from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) (owner: 10Cparle) [10:04:55] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:05:07] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:07:03] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:07:51] (03CR) 10Gehel: [C:04-1] "wdqs1028 is already reimaged and being used to test qlever. if you can exclude it from this CR, the rest should be ok." [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey) [10:11:31] (03PS3) 10Elukey: Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) [10:11:58] (03PS4) 10Elukey: Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) [10:12:01] (03CR) 10CI reject: [V:04-1] Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey) [10:12:27] (03CR) 10CI reject: [V:04-1] Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey) [10:12:49] (03PS5) 10Elukey: Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) [10:13:33] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [10:14:09] (03PS6) 10Elukey: Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) [10:14:28] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey) [10:14:52] (03PS1) 10Clément Goubert: api-gateway: Fix host_rewrite_path_regex substitution [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220318 (https://phabricator.wikimedia.org/T396807) [10:16:33] (03CR) 10Elukey: "PCC looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey) [10:18:27] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:18:44] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Fix host_rewrite_path_regex substitution [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220318 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [10:20:48] (03Merged) 10jenkins-bot: api-gateway: Fix host_rewrite_path_regex substitution [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220318 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [10:23:15] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:23:19] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:27:53] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:28:03] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:28:16] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [10:28:35] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [10:35:23] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:37:00] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:40:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:41:39] (03PS5) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [10:42:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:59] (03CR) 10Clément Goubert: [C:03+2] restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219604 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [10:46:09] (03CR) 10CI reject: [V:04-1] sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [10:48:30] (03PS6) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [10:59:15] (03CR) 10Elukey: [C:03+2] Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey) [11:02:26] (03PS7) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [11:04:47] !log fceratto@cumin1003 START - Cookbook sre.mysql.newdepool es2051 - test T383674 [11:04:52] T383674: Abstract away different database depooling mechanisms into a cookbook - https://phabricator.wikimedia.org/T383674 [11:05:39] (03CR) 10Btullis: [C:03+1] Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey) [11:06:41] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 2615.58 ms [11:07:15] RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [11:08:25] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11480693 (10elukey) Random thoughts after reading: * Is sretest2003 the only one that shows this behavior, or do we have others? I am particularly interested... [11:09:59] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [11:10:16] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:11:39] (03PS1) 10Clément Goubert: Revert "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1220320 (https://phabricator.wikimedia.org/T396807) [11:12:38] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) es2051 - test T383674 [11:12:42] T383674: Abstract away different database depooling mechanisms into a cookbook - https://phabricator.wikimedia.org/T383674 [11:13:27] (03CR) 10CI reject: [V:04-1] Revert "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1220320 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [11:13:50] (03PS2) 10Clément Goubert: Revert "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1220320 (https://phabricator.wikimedia.org/T396807) [11:14:36] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:15:53] (03CR) 10Clément Goubert: [C:03+2] Revert "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1220320 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert) [11:26:15] (03CR) 10Clément Goubert: [C:03+1] imagecatalog: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219876 (owner: 10Muehlenhoff) [11:26:42] (03PS1) 10Kevin Bazira: ml-services: remove revise-tone-task-generator from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220321 (https://phabricator.wikimedia.org/T412338) [11:32:13] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [11:32:59] (03CR) 10Marostegui: "This is a long review, so I've started with the not much modified files." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [11:38:28] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [11:43:12] (03CR) 10Ozge: [C:03+2] ml-services: remove revise-tone-task-generator from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220321 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [11:43:53] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:45:24] (03Merged) 10jenkins-bot: ml-services: remove revise-tone-task-generator from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220321 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira) [11:56:37] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie [11:57:07] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:59:56] (03PS1) 10Elukey: installserver: remove pause/debug from wdqs10[29-32] [puppet] - 10https://gerrit.wikimedia.org/r/1220324 (https://phabricator.wikimedia.org/T412451) [12:02:18] (03CR) 10Elukey: [C:03+2] installserver: remove pause/debug from wdqs10[29-32] [puppet] - 10https://gerrit.wikimedia.org/r/1220324 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey) [12:05:25] FIRING: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:21:00] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:30:31] !log fceratto@cumin1003 START - Cookbook sre.mysql.newpool es2051 gradually with 4 steps - test T383674 [12:30:35] T383674: Abstract away different database depooling mechanisms into a cookbook - https://phabricator.wikimedia.org/T383674 [12:34:30] (03CR) 10Federico Ceratto: "I renamed the cookbook as discussed and fixed the tests." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [12:39:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:39:34] (03CR) 10Marostegui: sre.mysql.newpool: [de]pool various section kinds (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [12:39:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (195.200.68.152) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:44:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [12:44:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (195.200.68.152) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:45:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:01] FIRING: SLOMetricAbsent: wdqs-main-availability magru - https://slo.wikimedia.org/?search=wdqs-main-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:47:18] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1198 gradually with 4 steps - repooling [12:52:01] FIRING: [2x] SLOMetricAbsent: wdqs-main-availability magru - https://slo.wikimedia.org/?search=wdqs-main-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:02:01] RESOLVED: SLOMetricAbsent: wdqs-scholarly-availability magru - https://slo.wikimedia.org/?search=wdqs-scholarly-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:59] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) es2051 gradually with 4 steps - test T383674 [13:16:03] T383674: Abstract away different database depooling mechanisms into a cookbook - https://phabricator.wikimedia.org/T383674 [13:23:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11480841 (10Jclark-ctr) a:05Jhancock.wm→03Jclark-ctr [13:32:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1198 gradually with 4 steps - repooling [13:37:46] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11480863 (10Jclark-ctr) @RKemper the replacement drive should arrive today. is there any way you or @BTullis can prep this to be replaced ? [13:39:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11480865 (10BTullis) [13:39:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11480867 (10Jclark-ctr) @RKemper @BTullis if anyone is able to prep this so when disk arrives i can replace it. it should arrive hopefully tuesday [13:44:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11480872 (10Jclark-ctr) a:03Jclark-ctr [13:53:44] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie [13:59:02] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1200 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [13:59:03] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1200 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T413360 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [13:59:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360 (10ops-monitoring-bot) 03NEW [14:01:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11480910 (10Jclark-ctr) Dell Sr 220434974 [14:11:59] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [14:13:54] (03PS1) 10Marostegui: Revert "installserver: Format /srv on es2028" [puppet] - 10https://gerrit.wikimedia.org/r/1220343 [14:15:51] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [14:16:00] (03CR) 10Marostegui: [C:03+2] Revert "installserver: Format /srv on es2028" [puppet] - 10https://gerrit.wikimedia.org/r/1220343 (owner: 10Marostegui) [14:18:50] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [14:18:58] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [14:20:43] (03PS8) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [14:22:40] (03CR) 10Urbanecm: "FWIW, this is blocked by the _second_ train departing (rather than first train finishing). This is because we need to be at the point of n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno) [14:23:57] (03CR) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [14:25:51] (03CR) 10CI reject: [V:04-1] sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [14:28:03] 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1258:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T413320#11480943 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced optic [14:32:16] !log serveraction powercycle restbase2034 (down, unresponsive) [14:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:25] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1030.eqiad.wmnet with OS trixie [14:35:23] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie [14:35:25] (03CR) 10Marostegui: sre.mysql.newpool: [de]pool various section kinds (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [14:36:57] (03PS1) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [14:37:18] RECOVERY - Host restbase2034 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [14:38:02] PROBLEM - Check whether ferm is active by checking the default input chain on restbase2034 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:40:31] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:42:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:12] RESOLVED: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:26] (03PS1) 10Bking: opensearch-cluster: remove broken setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220356 (https://phabricator.wikimedia.org/T412447) [14:52:21] (03PS2) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [14:53:49] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [14:55:10] (03PS3) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [14:57:06] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364 (10KReid-WMF) 03NEW [14:58:06] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [15:01:11] (03CR) 10Bking: [C:03+2] opensearch-cluster: remove broken setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220356 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [15:01:29] (03CR) 10Btullis: [C:03+1] opensearch-cluster: remove broken setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220356 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking) [15:04:59] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris) [15:08:01] RECOVERY - Check whether ferm is active by checking the default input chain on restbase2034 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:19] (03Merged) 10jenkins-bot: kube-state-metrics: Remove limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris) [15:24:12] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11481045 (10Novem_Linguae) [15:34:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:37] !log akosiaris@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:36:55] !log akosiaris@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:37:49] !log akosiaris@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:38:05] !log akosiaris@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:38:11] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:38:26] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:38:33] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:38:35] !log remove limits from kube-state-metrics in wikikube and wikikube-staging clusters, no point in resource limits this workload, it's an important cluster component [15:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:00] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:39:32] (03PS4) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524) [15:41:06] hello, i'd like to do an emergency deployment for this patch: 1220016: SpecialPageLanguage: Use OOUI infuse if language selector is present | https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1220016. [15:42:04] (03CR) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [15:42:24] abijeet: No objection from me, Emperor fabfur, heads up [15:44:29] !log akosiaris@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:44:48] !log akosiaris@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:44:53] !log akosiaris@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:44:58] !log remove limits from kube-state-metrics in ml-serve-{eqiad,codfw} ml-staging-codfw dse-k8s-{eqiad,codfw} aux-k8s-{eqiad,codfw} kubernetes clusters. No point in resource limits for this workload, it's an important cluster component. [15:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:09] !log akosiaris@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:45:15] !log akosiaris@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:45:22] (03CR) 10Marostegui: sre.mysql.newpool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [15:45:31] !log akosiaris@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:45:37] !log akosiaris@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:47:35] !log akosiaris@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:47:40] !log akosiaris@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:47:55] !log akosiaris@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:48:00] !log akosiaris@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:48:18] !log akosiaris@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:48:22] !log akosiaris@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [15:48:38] !log akosiaris@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:49:44] !log jclark@cumin1003 START - Cookbook sre.dns.netbox [15:50:53] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1031.eqiad.wmnet with OS trixie [15:51:23] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie [15:53:32] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube1328-34 servers - jclark@cumin1003" [15:53:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube1328-34 servers - jclark@cumin1003" [15:53:37] !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:55:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:56:25] abijeet: Are you going to deploy yourself or does that patch need a deployer? [15:58:25] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1328.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [15:59:24] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1328.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:01:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11481183 (10Jclark-ctr) these are racked cabled and setup running into error with provisioning same as T407991 ` Retrieving the BMC's firmware version. BMC firmware release... [16:02:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481190 (10Jclark-ctr) [16:02:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481192 (10Jclark-ctr) [16:05:05] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:05:42] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:06:01] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:06:18] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:06:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481198 (10Clement_Goubert) Hmm it's very possible these hosts need UEFI and the partman recipe are wrong, on top of needing @elukey's [[ https://gerrit.wikimedia.org/r/c/... [16:06:37] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:07:03] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1332.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:07:18] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:07:46] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1332.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:08:35] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1333.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:08:50] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1334.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:09:16] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1333.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:09:22] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1334.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:09:23] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [16:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:10:35] (03PS1) 10Btullis: Revert changes to the principal for the spark-thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220369 (https://phabricator.wikimedia.org/T410017) [16:10:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481210 (10Jclark-ctr) @Clement_Goubert thanks for checking these. I had not even looked at the patch yet. these would be UEFI [16:14:17] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [16:14:46] (03CR) 10JavierMonton: [C:03+2] Revert changes to the principal for the spark-thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220369 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [16:15:42] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1328.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:15:58] (03PS1) 10Clément Goubert: partman: New wikikube-worker need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220370 (https://phabricator.wikimedia.org/T408749) [16:16:53] (03Merged) 10jenkins-bot: Revert changes to the principal for the spark-thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220369 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [16:19:18] (03CR) 10Elukey: [C:03+1] partman: New wikikube-worker need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220370 (https://phabricator.wikimedia.org/T408749) (owner: 10Clément Goubert) [16:19:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11481248 (10Clement_Goubert) These are supermicro nodes and require UEFI. Patch underway. [16:20:00] (03CR) 10Clément Goubert: [C:03+2] partman: New wikikube-worker need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220370 (https://phabricator.wikimedia.org/T408749) (owner: 10Clément Goubert) [16:20:59] (03PS2) 10Sbisson: Fix section loading on desktop [extensions/ContentTranslation] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219900 (https://phabricator.wikimedia.org/T413305) [16:21:56] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:21:58] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:22:21] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:22:28] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:23:08] Hi Amir1, fabfur, requesting permission to deploy Content Translation UBN T413305 [16:23:09] T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305 [16:25:07] (03PS1) 10Clément Goubert: partman: New mc nodes need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220373 (https://phabricator.wikimedia.org/T412255) [16:26:17] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1328.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:28:17] hi stephanebisson ok for me [16:29:10] FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:29:10] fabfur, ok thanks, going ahead now [16:29:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219900 (https://phabricator.wikimedia.org/T413305) (owner: 10Sbisson) [16:30:36] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:32:16] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1031.eqiad.wmnet with OS trixie [16:33:55] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [16:34:10] RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:36:09] (03PS1) 10Fabfur: Add TXT record for jamf [dns] - 10https://gerrit.wikimedia.org/r/1220380 (https://phabricator.wikimedia.org/T413259) [16:40:29] (03Merged) 10jenkins-bot: Fix section loading on desktop [extensions/ContentTranslation] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219900 (https://phabricator.wikimedia.org/T413305) (owner: 10Sbisson) [16:41:08] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]] [16:41:12] T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305 [16:41:35] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:42:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:43:33] (03PS2) 10Fabfur: Add TXT record for jamf [dns] - 10https://gerrit.wikimedia.org/r/1220380 (https://phabricator.wikimedia.org/T413259) [16:43:55] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:44:34] (03CR) 10BCornwall: [C:03+1] Add TXT record for jamf [dns] - 10https://gerrit.wikimedia.org/r/1220380 (https://phabricator.wikimedia.org/T413259) (owner: 10Fabfur) [16:45:01] (03CR) 10Fabfur: [C:03+2] Add TXT record for jamf [dns] - 10https://gerrit.wikimedia.org/r/1220380 (https://phabricator.wikimedia.org/T413259) (owner: 10Fabfur) [16:45:29] !log fabfur@dns1004 START - running authdns-update [16:46:32] !log fabfur@dns1004 END - running authdns-update [16:46:53] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T413259#11481405 (10Fabfur) The entry has been added with https://gerrit.wikimedia.org/r/c/operations/dns/+/1220380 and should be propagated shortly [16:46:55] (03CR) 10VolkerE: [C:04-1] "I'd want to see an optimized SVG according to our SVG opt guidelines to have a better representation of production environment, otherwise " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219865 (https://phabricator.wikimedia.org/T413217) (owner: 10Aude) [16:49:45] !log lvextend /dev/vg0/srv on titan1001, titan1002, titan2002. T410152 [16:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:49] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [16:52:22] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage [16:53:22] (03PS1) 10Tiziano Fogli: thanos-compact: reduce concurrency to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1220385 (https://phabricator.wikimedia.org/T410152) [16:54:30] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:54:55] (03CR) 10Herron: [C:03+1] thanos-compact: reduce concurrency to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1220385 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [16:55:44] (03CR) 10Tiziano Fogli: [C:03+2] thanos-compact: reduce concurrency to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1220385 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli) [16:58:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage [17:00:37] elukey@cumin1003 provision (PID 3950609) is awaiting input [17:00:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:02:50] (03PS9) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [17:11:19] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:13:21] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1332.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:14:23] Amir1, fabfur: deployment has been stuck for a while. According to the logs, it looks like pushing the image to the registry has failed. Any idea what to do now? [17:17:35] stephanebisson: o/ I've seen occurrences of very long deployments stuck in the pushing step, even 40/50 mins. [17:18:07] have you already waited that amount of time? If so I'd suggest to retry the deployment, if possible, because there is no easy solution [17:18:32] we have https://phabricator.wikimedia.org/T412951 planned for early Q3 that should make things better [17:18:47] elukey yeah it was stuck at the pushing step for 22 minutes but then: subprocess.CalledProcessError: Command '['sudo', '/usr/local/bin/docker-pusher', '-q', 'docker-registry.discovery.wmnet/restricted/mediawiki-singleversion:2025-12-22-010154-publish-83-next']' returned non-zero exit status 1. [17:19:20] I'll retry a little later if it's ok [17:20:21] stephanebisson: yeah it seems good, I may not be around and other SRE folks are probably too (basically on-call for pages). How urgent is this deployment? [17:20:47] elukey Content Translation UBN [17:20:58] lovely [17:22:43] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]] [17:22:47] T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305 [17:24:00] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1332.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:24:44] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1333.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:26:28] !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [17:26:59] !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [17:29:37] !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [17:29:51] !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [17:30:27] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1032.eqiad.wmnet with OS trixie [17:32:41] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 3618.00 ms [17:33:03] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 483.17 ms [17:35:16] RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:35:20] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1333.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:35:34] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1334.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:36:59] It failed again to push the image [17:46:09] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1334.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:50:58] !log sbisson@deploy2002 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.5,1.46.0-wmf.7,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/med [17:50:58] iawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.230.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawi [17:50:58] ki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.230.0) (duration: 28m 14s) [17:57:06] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11481622 (10VRiley-WMF) a:03VRiley-WMF [18:00:49] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1328.eqiad.wmnet with OS trixie [18:01:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481625 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1328.eqiad.wmnet with OS trixie [18:02:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:10:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385 (10ops-monitoring-bot) 03NEW [18:12:10] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481688 (10Marostegui) This is a sanitarium host. Can we get a disk to replace the failed one? Thanks [18:12:23] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481689 (10Marostegui) p:05Triage→03High [18:16:44] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11481709 (10wiki_willy) a:03VRiley-WMF [18:22:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481731 (10Jclark-ctr) @elukey Ran into another provisioning issue. It looks like IPv4 PXE was disabled. The screenshot was taken after I changed it, and that was the only... [18:28:43] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage [18:33:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage [18:36:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:47:05] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1334.eqiad.wmnet with OS trixie [18:47:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1334.eqiad.wmnet with OS trixie [18:48:56] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1329.eqiad.wmnet with OS trixie [18:49:06] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]] [18:49:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1329.eqiad.wmnet with OS trixie [18:49:10] T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305 [18:49:46] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [18:49:51] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1330.eqiad.wmnet with OS trixie [18:50:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1330.eqiad.wmnet with OS trixie [18:50:16] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [18:50:17] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1328.eqiad.wmnet with OS trixie [18:50:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1328.eqiad.wmnet with OS trixie completed: - wikiku... [18:50:41] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1331.eqiad.wmnet with OS trixie [18:50:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1331.eqiad.wmnet with OS trixie [18:52:18] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1332.eqiad.wmnet with OS trixie [18:52:25] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1333.eqiad.wmnet with OS trixie [18:52:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1332.eqiad.wmnet with OS trixie [18:52:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1333.eqiad.wmnet with OS trixie [18:58:33] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1334.eqiad.wmnet with reason: host reimage [18:59:59] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1329.eqiad.wmnet with reason: host reimage [19:01:01] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1330.eqiad.wmnet with reason: host reimage [19:01:52] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1331.eqiad.wmnet with reason: host reimage [19:02:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:03:33] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1332.eqiad.wmnet with reason: host reimage [19:03:48] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1333.eqiad.wmnet with reason: host reimage [19:03:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1334.eqiad.wmnet with reason: host reimage [19:05:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481816 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty can it be replaced at any time? Solid State Disk 0:1:2 Removed 2 1787.88 GB Not Capable SATA SSD No 99% [19:06:22] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481819 (10Marostegui) Yes, you can go for it any time [19:07:00] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481820 (10Jclark-ctr) I am not on site at this moment but i am going to wipe an old drive from decom server prior to installing [19:07:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1329.eqiad.wmnet with reason: host reimage [19:07:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481822 (10Marostegui) Thanks! [19:11:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:11:39] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11481828 (10Jclark-ctr) 05Open→03Resolved a:05VRiley-WMF→03Jclark-ctr The servers VRiley had were not at a point where th... [19:14:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1332.eqiad.wmnet with reason: host reimage [19:18:24] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1331.eqiad.wmnet with reason: host reimage [19:20:01] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:20:18] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:20:18] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]] [19:20:19] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1334.eqiad.wmnet with OS trixie [19:20:24] T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305 [19:20:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481842 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1334.eqiad.wmnet with OS trixie completed: - wikiku... [19:22:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1330.eqiad.wmnet with reason: host reimage [19:22:41] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:23:00] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:23:01] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1329.eqiad.wmnet with OS trixie [19:23:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1329.eqiad.wmnet with OS trixie completed: - wikiku... [19:26:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1333.eqiad.wmnet with reason: host reimage [19:26:41] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:26:44] T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305 [19:28:15] !log sbisson@deploy2002 sbisson: Continuing with sync [19:30:28] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:30:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:30:48] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1332.eqiad.wmnet with OS trixie [19:30:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1332.eqiad.wmnet with OS trixie completed: - wikiku... [19:34:25] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:34:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:34:48] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1331.eqiad.wmnet with OS trixie [19:34:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481886 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1331.eqiad.wmnet with OS trixie completed: - wikiku... [19:36:02] (03PS1) 10Mforns: Bump up the page-analytics service image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220390 (https://phabricator.wikimedia.org/T405041) [19:39:28] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:39:46] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:39:47] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1330.eqiad.wmnet with OS trixie [19:39:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481900 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1330.eqiad.wmnet with OS trixie completed: - wikiku... [19:40:43] (03CR) 10Santiago Faci: [C:03+2] Bump up the page-analytics service image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220390 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [19:41:02] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]] (duration: 20m 44s) [19:41:07] T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305 [19:42:50] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:42:53] (03Merged) 10jenkins-bot: Bump up the page-analytics service image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220390 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns) [19:43:08] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [19:43:09] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1333.eqiad.wmnet with OS trixie [19:43:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1333.eqiad.wmnet with OS trixie completed: - wikiku... [19:44:02] !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [19:44:16] !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [19:44:34] !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [19:44:49] !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [19:45:05] !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [19:45:17] !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [19:47:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481922 (10Jclark-ctr) [19:48:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481923 (10Jclark-ctr) 05Open→03Resolved [19:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:32:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:37:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:43:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1275:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1275 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:06:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482291 (10VRiley-WMF) wikikube-worker1360 B2 U18 CableID 5003 Port 27 wikikube-worker1361 B4 U36 CableID 5369 Port 47 wikikube-worker1362 C3 U37 CableID 230304500071 Por... [21:20:20] (03CR) 10CDanis: [C:03+2] P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 (owner: 10Slyngshede) [21:23:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1275:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1275 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:24:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:24:27] (03PS1) 10CDanis: spur_feeds: fix path typo [puppet] - 10https://gerrit.wikimedia.org/r/1220398 [21:25:08] (03CR) 10CDanis: [C:03+2] spur_feeds: fix path typo [puppet] - 10https://gerrit.wikimedia.org/r/1220398 (owner: 10CDanis) [21:29:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:33:08] (03PS1) 10CDanis: spur_feeds: fix outfile/outdir confusion [puppet] - 10https://gerrit.wikimedia.org/r/1220401 [21:34:10] FIRING: [3x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:34:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:35:00] (03CR) 10CDanis: [C:03+2] spur_feeds: fix outfile/outdir confusion [puppet] - 10https://gerrit.wikimedia.org/r/1220401 (owner: 10CDanis) [21:35:25] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [21:39:10] FIRING: [7x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:39:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:40:25] RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:56] (03PS1) 10CDanis: spur_feeds: use root user [puppet] - 10https://gerrit.wikimedia.org/r/1220402 [21:43:25] FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:07] (03CR) 10CDanis: [C:03+2] spur_feeds: use root user [puppet] - 10https://gerrit.wikimedia.org/r/1220402 (owner: 10CDanis) [21:44:10] FIRING: [11x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:44:39] RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:46:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:47:55] 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409 (10phaultfinder) 03NEW [21:49:10] FIRING: [15x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:51:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:52:17] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, and 2 others: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11482403 (10AKanji-WMF) [21:53:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:53:25] RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:53:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:53:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:54:10] FIRING: [16x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:54:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:55:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:57:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:57:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [21:58:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [21:59:10] FIRING: [13x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [21:59:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:02:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:02:31] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:04:10] FIRING: [14x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:04:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:06:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:06:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:09:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:09:10] FIRING: [13x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:11:39] RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:12:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:13:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:13:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:14:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:14:10] RESOLVED: [11x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:14:40] FIRING: [5x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:700:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:15:10] FIRING: [6x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:15:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:19:40] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:19:45] FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:20:10] FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:20:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:23:00] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:23:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:23:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:24:40] FIRING: [11x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:25:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:25:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:27:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:27:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:29:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:29:40] RESOLVED: [7x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:30:55] FIRING: [4x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:700:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:31:10] FIRING: [5x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:32:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:32:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:33:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:33:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:34:40] FIRING: [12x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:36:10] FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:38:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:39:40] FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:41:10] FIRING: [11x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:43:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:44:40] FIRING: [12x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:45:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:46:10] FIRING: [12x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:48:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:49:40] RESOLVED: [12x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:49:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:51:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:51:55] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:700:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:53:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:54:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:54:40] RESOLVED: [6x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [22:54:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:55:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:55:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:58:39] RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [22:59:40] FIRING: [5x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:01:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:01:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:02:30] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:02:55] FIRING: [7x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:04:40] FIRING: [6x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:05:00] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:06:39] RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:07:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:07:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:07:55] FIRING: [7x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:07:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:09:40] FIRING: [8x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:10:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:12:55] FIRING: [8x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:14:40] FIRING: [9x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:15:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:17:39] RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:17:55] RESOLVED: [8x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:18:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:19:40] FIRING: [8x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:22:55] FIRING: [8x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:23:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:24:40] RESOLVED: [7x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:25:02] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:25:46] (03PS1) 10Cwhite: logstash: put logging-sd200[567] in service [puppet] - 10https://gerrit.wikimedia.org/r/1220405 (https://phabricator.wikimedia.org/T413414) [23:25:48] (03PS1) 10Cwhite: logstash: put logging-sd100[567] in service [puppet] - 10https://gerrit.wikimedia.org/r/1220406 (https://phabricator.wikimedia.org/T413414) [23:26:40] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:700:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:26:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:27:55] FIRING: [8x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:28:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:29:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:31:02] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:31:40] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:32:55] FIRING: [11x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:34:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:36:40] FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:37:55] FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:39:41] (03CR) 10Cwhite: [C:03+2] logstash: put logging-sd200[567] in service [puppet] - 10https://gerrit.wikimedia.org/r/1220405 (https://phabricator.wikimedia.org/T413414) (owner: 10Cwhite) [23:39:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:40:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:41:40] FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:42:02] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:42:55] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:43:58] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:44:02] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:46:40] FIRING: [9x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:47:04] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:47:55] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:50:39] FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:51:40] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:51:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:52:04] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:52:55] FIRING: [9x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:54:59] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:55:07] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:56:40] FIRING: [9x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:57:55] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [23:57:59] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status