[00:04:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 19.66% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:05:25] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:09:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.62% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[00:10:25] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[00:39:51] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220032
[00:39:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220032 (owner: 10TrainBranchBot)
[00:45:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:51:14] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220032 (owner: 10TrainBranchBot)
[01:00:38] <logmsgbot>	 !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
[01:09:51] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220034
[01:09:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220034 (owner: 10TrainBranchBot)
[01:32:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:33:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220034 (owner: 10TrainBranchBot)
[01:47:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:50:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[01:55:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[02:32:15] <wikibugs>	 (03PS1) 10Papaul: Add bgp config for mr1-codfw and lsw1-a3-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1220035 (https://phabricator.wikimedia.org/T410717)
[02:59:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[03:09:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[03:14:33] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.013e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[03:15:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[03:35:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[03:36:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[03:38:57] <icinga-wm>	 PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 5496.90 ms
[03:39:05] <icinga-wm>	 RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 50%, RTA = 1104.10 ms
[03:41:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:05:25] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:10:17] <icinga-wm>	 PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 6513.04 ms
[04:10:25] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[04:11:09] <icinga-wm>	 RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[04:12:43] <icinga-wm>	 PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:13:43] <icinga-wm>	 RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:42:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:45:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:02:43] <jinxer-wm>	 FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:09:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:32:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:34:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:38:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:53:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:55:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:05:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:10:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[06:10:59] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11480498 (10Marostegui) Thanks - I just checked that the UEFI partman recipe is assigned to it.
[06:13:06] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] data.yaml: Add akhatun to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1219964 (https://phabricator.wikimedia.org/T413140) (owner: 10Marostegui)
[06:16:16] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for akhatun - https://phabricator.wikimedia.org/T413140#11480506 (10Marostegui) 05Open→03Resolved a:03Marostegui Patch merged and principal created in Kerberos - you should've received an email to res...
[06:25:13] <wikibugs>	 (03PS1) 10Marostegui: installserver: Placeholder for reuse-db-efi [puppet] - 10https://gerrit.wikimedia.org/r/1220052
[06:29:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Placeholder for reuse-db-efi [puppet] - 10https://gerrit.wikimedia.org/r/1220052 (owner: 10Marostegui)
[06:33:36] <wikibugs>	 (03PS1) 10Marostegui: installserver: Format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1220053 (https://phabricator.wikimedia.org/T412807)
[06:37:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1220053 (https://phabricator.wikimedia.org/T412807) (owner: 10Marostegui)
[06:38:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480523 (10Marostegui) I talked about this with @MoritzMuehlenhoff past Friday. Looks like the issue is the fact that the h...
[06:38:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480525 (10Marostegui) I just merged a patch to keep es2028 entirely formatted to make sure it is all fixed there and if it...
[06:49:27] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
[07:05:27] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[07:09:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.08% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:09:28] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
[07:14:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 22.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:17:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 21.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:22:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[07:29:12] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:32:39] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS trixie
[07:33:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480547 (10Marostegui) 05Open→03Resolved a:03cmooney The host was reimaged correctly (formatting everything) so closing this. Thank you...
[07:33:39] <wikibugs>	 06SRE, 06Data-Persistence, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480551 (10Marostegui)
[07:34:12] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:39:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:44:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:47:43] <jinxer-wm>	 RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:48:13] <jinxer-wm>	 FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:49:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:49:54] <wikibugs>	 06SRE, 06Data-Persistence, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11480570 (10Marostegui)
[07:53:13] <jinxer-wm>	 RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:54:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251222T0800)
[08:02:20] <wikibugs>	 (03PS1) 10Elukey: Fix requirements and rebuild for Trixie [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1220201
[08:02:43] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Fix requirements and rebuild for Trixie [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1220201 (owner: 10Elukey)
[08:03:22] <logmsgbot>	 !log elukey@deploy2002 Started deploy [docker-pkg/deploy@1664255]: (no justification provided)
[08:03:32] <logmsgbot>	 !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@1664255]: (no justification provided) (duration: 00m 11s)
[08:04:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:04:41] <logmsgbot>	 !log elukey@deploy2002 Started deploy [docker-pkg/deploy@1664255]: (no justification provided)
[08:04:47] <logmsgbot>	 !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@1664255]: (no justification provided) (duration: 00m 07s)
[08:05:25] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:09:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:10:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:10:25] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:12:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:14:05] <logmsgbot>	 !log elukey@deploy2002 Started deploy [docker-pkg/deploy@1664255]: (no justification provided)
[08:14:11] <logmsgbot>	 !log elukey@deploy2002 Finished deploy [docker-pkg/deploy@1664255]: (no justification provided) (duration: 00m 08s)
[08:14:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:19:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:22:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[08:26:02] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991)
[08:28:04] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:29:12] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:31:27] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11480594 (10elukey) The sync seems now complete! Nice work :)  Next steps before closing: update documentati...
[08:31:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) (owner: 10Elukey)
[08:32:28] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:36:54] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11480595 (10MatthewVernon) I think we're almost-but-not-quite-entirely caught up, but yes, I think the repor...
[08:38:07] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991)
[08:38:41] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: add embeddings isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338)
[08:39:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:40:10] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:40:13] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:45:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:46:46] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991)
[08:47:11] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:47:13] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:49:12] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:54:12] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:55:32] <wikibugs>	 (03CR) 10Ozge: ml-services: add embeddings isvc to the experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira)
[08:57:23] <wikibugs>	 (03PS4) 10Elukey: sre.hosts.provision: make some Supermicro checks dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991)
[08:57:46] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[08:58:19] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-builder: clone production images [puppet] - 10https://gerrit.wikimedia.org/r/1219553 (owner: 10Dpogorzelski)
[08:59:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:00:04] <wikibugs>	 (03CR) 10Elukey: "Tested on db2249, worked fine :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1220311 (https://phabricator.wikimedia.org/T407991) (owner: 10Elukey)
[09:02:47] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2249.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[09:03:05] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q2:rack/setup/install db2249 - https://phabricator.wikimedia.org/T407991#11480603 (10elukey) @Jhancock.wm Hi! I filed https://gerrit.wikimedia.org/r/1220311 to hopefully avoid this in the future, the provision cookbook should be smarte...
[09:05:58] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: add embeddings isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338)
[09:07:22] <wikibugs>	 (03CR) 10Kevin Bazira: ml-services: add embeddings isvc to the experimental namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira)
[09:12:23] <wikibugs>	 (03PS8) 10Arnaudb: mailman: add UpstreamTlsContext on tlsproxy::envoy [puppet] - 10https://gerrit.wikimedia.org/r/1219770 (https://phabricator.wikimedia.org/T286066)
[09:12:23] <wikibugs>	 (03CR) 10Arnaudb: "This will unblock the connectivity between envoy and httpd." [puppet] - 10https://gerrit.wikimedia.org/r/1219770 (https://phabricator.wikimedia.org/T286066) (owner: 10Arnaudb)
[09:13:35] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11480605 (10ABran-WMF)
[09:14:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:24:04] <wikibugs>	 (03CR) 10Ozge: [C:03+2] ml-services: add embeddings isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira)
[09:24:12] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 06serviceops, and 2 others: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008#11480607 (10elukey) 05Open→03Resolved a:03elukey Added https://wikitech.wikimedia.org/wiki/Docker-...
[09:24:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:24:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:26:16] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add embeddings isvc to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220313 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira)
[09:29:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:31:19] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:34:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:39:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[09:49:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:50:51] <wikibugs>	 (03PS1) 10Elukey: Move wqds10[28-1032] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451)
[09:54:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:55:04] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[09:55:33] <wikibugs>	 (03PS2) 10Elukey: Move wqds10[28-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451)
[09:56:34] <wikibugs>	 (03PS1) 10Dpogorzelski: ml-build: fix image repo clone path [puppet] - 10https://gerrit.wikimedia.org/r/1220317
[09:56:52] <wikibugs>	 (03CR) 10Dpogorzelski: [C:03+2] ml-build: fix image repo clone path [puppet] - 10https://gerrit.wikimedia.org/r/1220317 (owner: 10Dpogorzelski)
[09:56:54] <wikibugs>	 (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-build: fix image repo clone path [puppet] - 10https://gerrit.wikimedia.org/r/1220317 (owner: 10Dpogorzelski)
[09:57:17] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[10:02:25] <wikibugs>	 (03CR) 10D3r1ck01: [C:03+1] EditWatchlistPaginate feature flag has been removed from MW code, so remove it from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214584 (https://phabricator.wikimedia.org/T410908) (owner: 10Cparle)
[10:04:55] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:05:07] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:07:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:07:51] <wikibugs>	 (03CR) 10Gehel: [C:04-1] "wdqs1028 is already reimaged and being used to test qlever. if you can exclude it from this CR, the rest should be ok." [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey)
[10:11:31] <wikibugs>	 (03PS3) 10Elukey: Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451)
[10:11:58] <wikibugs>	 (03PS4) 10Elukey: Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451)
[10:12:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey)
[10:12:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey)
[10:12:49] <wikibugs>	 (03PS5) 10Elukey: Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451)
[10:13:33] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[10:14:09] <wikibugs>	 (03PS6) 10Elukey: Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451)
[10:14:28] <wikibugs>	 (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey)
[10:14:52] <wikibugs>	 (03PS1) 10Clément Goubert: api-gateway: Fix host_rewrite_path_regex substitution [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220318 (https://phabricator.wikimedia.org/T396807)
[10:16:33] <wikibugs>	 (03CR) 10Elukey: "PCC looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey)
[10:18:27] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:18:44] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] api-gateway: Fix host_rewrite_path_regex substitution [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220318 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[10:20:48] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Fix host_rewrite_path_regex substitution [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220318 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[10:23:15] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:23:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:27:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[10:28:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[10:28:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[10:28:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[10:35:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:37:00] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:40:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[10:41:39] <wikibugs>	 (03PS5) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573)
[10:42:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:42:59] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1219604 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz)
[10:46:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[10:48:30] <wikibugs>	 (03PS6) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573)
[10:59:15] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey)
[11:02:26] <wikibugs>	 (03PS7) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573)
[11:04:47] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.newdepool es2051 - test T383674
[11:04:52] <stashbot>	 T383674: Abstract away different database depooling mechanisms into a cookbook - https://phabricator.wikimedia.org/T383674
[11:05:39] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Move wqds10[29-32] to insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1220315 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey)
[11:06:41] <icinga-wm>	 PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 2615.58 ms
[11:07:15] <icinga-wm>	 RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[11:08:25] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11480693 (10elukey) Random thoughts after reading:  * Is sretest2003 the only one that shows this behavior, or do we have others? I am particularly interested...
[11:09:59] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie
[11:10:16] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[11:11:39] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1220320 (https://phabricator.wikimedia.org/T396807)
[11:12:38] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) es2051 - test T383674
[11:12:42] <stashbot>	 T383674: Abstract away different database depooling mechanisms into a cookbook - https://phabricator.wikimedia.org/T383674
[11:13:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1220320 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[11:13:50] <wikibugs>	 (03PS2) 10Clément Goubert: Revert "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1220320 (https://phabricator.wikimedia.org/T396807)
[11:14:36] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[11:15:53] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert "restgateway: migrate the /api/rest_v1/ sandbox to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1220320 (https://phabricator.wikimedia.org/T396807) (owner: 10Clément Goubert)
[11:26:15] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] imagecatalog: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219876 (owner: 10Muehlenhoff)
[11:26:42] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: remove revise-tone-task-generator from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220321 (https://phabricator.wikimedia.org/T412338)
[11:32:13] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[11:32:59] <wikibugs>	 (03CR) 10Marostegui: "This is a long review, so I've started with the not much modified files." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[11:38:28] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage
[11:43:12] <wikibugs>	 (03CR) 10Ozge: [C:03+2] ml-services: remove revise-tone-task-generator from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220321 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira)
[11:43:53] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[11:45:24] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: remove revise-tone-task-generator from experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220321 (https://phabricator.wikimedia.org/T412338) (owner: 10Kevin Bazira)
[11:56:37] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS trixie
[11:57:07] <logmsgbot>	 !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[11:59:56] <wikibugs>	 (03PS1) 10Elukey: installserver: remove pause/debug from wdqs10[29-32] [puppet] - 10https://gerrit.wikimedia.org/r/1220324 (https://phabricator.wikimedia.org/T412451)
[12:02:18] <wikibugs>	 (03CR) 10Elukey: [C:03+2] installserver: remove pause/debug from wdqs10[29-32] [puppet] - 10https://gerrit.wikimedia.org/r/1220324 (https://phabricator.wikimedia.org/T412451) (owner: 10Elukey)
[12:05:25] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:10:25] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:21:00] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[12:30:31] <logmsgbot>	 !log fceratto@cumin1003 START - Cookbook sre.mysql.newpool es2051 gradually with 4 steps - test T383674
[12:30:35] <stashbot>	 T383674: Abstract away different database depooling mechanisms into a cookbook - https://phabricator.wikimedia.org/T383674
[12:34:30] <wikibugs>	 (03CR) 10Federico Ceratto: "I renamed the cookbook as discussed and fixed the tests." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[12:39:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:39:34] <wikibugs>	 (03CR) 10Marostegui: sre.mysql.newpool: [de]pool various section kinds (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[12:39:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (195.200.68.152) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:44:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:44:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (195.200.68.152) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:45:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:47:01] <jinxer-wm>	 FIRING: SLOMetricAbsent: wdqs-main-availability magru - https://slo.wikimedia.org/?search=wdqs-main-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[12:47:18] <logmsgbot>	 !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1198 gradually with 4 steps - repooling
[12:52:01] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-main-availability magru - https://slo.wikimedia.org/?search=wdqs-main-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:02:01] <jinxer-wm>	 RESOLVED: SLOMetricAbsent: wdqs-scholarly-availability magru - https://slo.wikimedia.org/?search=wdqs-scholarly-availability   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:15:59] <logmsgbot>	 !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) es2051 gradually with 4 steps - test T383674
[13:16:03] <stashbot>	 T383674: Abstract away different database depooling mechanisms into a cookbook - https://phabricator.wikimedia.org/T383674
[13:23:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11480841 (10Jclark-ctr) a:05Jhancock.wm→03Jclark-ctr
[13:32:45] <logmsgbot>	 !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1198 gradually with 4 steps - repooling
[13:37:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11480863 (10Jclark-ctr) @RKemper  the replacement drive should arrive today.   is there any way you or @BTullis  can prep this to be replaced ?
[13:39:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11480865 (10BTullis)
[13:39:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11480867 (10Jclark-ctr) @RKemper  @BTullis  if anyone is able to prep this so when disk arrives i can replace it.  it should arrive hopefully tuesday
[13:44:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11480872 (10Jclark-ctr) a:03Jclark-ctr
[13:53:44] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS trixie
[13:59:02] <icinga-wm>	 PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1200 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[13:59:03] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1200 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T413360 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[13:59:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360 (10ops-monitoring-bot) 03NEW
[14:01:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11480910 (10Jclark-ctr) Dell Sr 220434974
[14:11:59] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage
[14:13:54] <wikibugs>	 (03PS1) 10Marostegui: Revert "installserver: Format /srv on es2028" [puppet] - 10https://gerrit.wikimedia.org/r/1220343
[14:15:51] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage
[14:16:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "installserver: Format /srv on es2028" [puppet] - 10https://gerrit.wikimedia.org/r/1220343 (owner: 10Marostegui)
[14:18:50] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply
[14:18:58] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply
[14:20:43] <wikibugs>	 (03PS8) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573)
[14:22:40] <wikibugs>	 (03CR) 10Urbanecm: "FWIW, this is blocked by the _second_ train departing (rather than first train finishing). This is because we need to be at the point of n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219541 (https://phabricator.wikimedia.org/T411479) (owner: 10Sergio Gimeno)
[14:23:57] <wikibugs>	 (03CR) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[14:25:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[14:28:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eno8303 on db1258:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T413320#11480943 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Replaced optic
[14:32:16] <urandom>	 !log serveraction powercycle restbase2034 (down, unresponsive)
[14:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:25] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1030.eqiad.wmnet with OS trixie
[14:35:23] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie
[14:35:25] <wikibugs>	 (03CR) 10Marostegui: sre.mysql.newpool: [de]pool various section kinds (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[14:36:57] <wikibugs>	 (03PS1) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[14:37:18] <icinga-wm>	 RECOVERY - Host restbase2034 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms
[14:38:02] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on restbase2034 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:40:31] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[14:42:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:44:12] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:49:26] <wikibugs>	 (03PS1) 10Bking: opensearch-cluster: remove broken setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220356 (https://phabricator.wikimedia.org/T412447)
[14:52:21] <wikibugs>	 (03PS2) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[14:53:49] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage
[14:55:10] <wikibugs>	 (03PS3) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[14:57:06] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364 (10KReid-WMF) 03NEW
[14:58:06] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage
[15:01:11] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch-cluster: remove broken setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220356 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking)
[15:01:29] <wikibugs>	 (03CR) 10Btullis: [C:03+1] opensearch-cluster: remove broken setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220356 (https://phabricator.wikimedia.org/T412447) (owner: 10Bking)
[15:04:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris)
[15:08:01] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on restbase2034 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:09:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:12:19] <wikibugs>	 (03Merged) 10jenkins-bot: kube-state-metrics: Remove limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1219536 (owner: 10Alexandros Kosiaris)
[15:24:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for kareid - https://phabricator.wikimedia.org/T413364#11481045 (10Novem_Linguae)
[15:34:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:36:37] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:36:55] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:37:49] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:38:05] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:38:11] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:38:26] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:38:33] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:38:35] <akosiaris>	 !log remove limits from kube-state-metrics in wikikube and wikikube-staging clusters, no point in resource limits this workload, it's an important cluster component
[15:38:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:00] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:39:32] <wikibugs>	 (03PS4) 10Dpogorzelski: docker registry: add ml build user password [puppet] - 10https://gerrit.wikimedia.org/r/1220352 (https://phabricator.wikimedia.org/T412524)
[15:41:06] <abijeet>	 hello, i'd like to do an emergency deployment for this patch: 1220016: SpecialPageLanguage: Use OOUI infuse if language selector is present | https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1220016.
[15:42:04] <wikibugs>	 (03CR) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[15:42:24] <claime>	 abijeet: No objection from me, Emperor fabfur, heads up
[15:44:29] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:44:48] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:44:53] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[15:44:58] <akosiaris>	 !log remove limits from kube-state-metrics in ml-serve-{eqiad,codfw} ml-staging-codfw dse-k8s-{eqiad,codfw} aux-k8s-{eqiad,codfw} kubernetes clusters. No point in resource limits for this workload, it's an important cluster component.
[15:45:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:09] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[15:45:15] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:45:22] <wikibugs>	 (03CR) 10Marostegui: sre.mysql.newpool: [de]pool various section kinds (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto)
[15:45:31] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:45:37] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:47:35] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:47:40] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
[15:47:55] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
[15:48:00] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:48:18] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:48:22] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[15:48:38] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[15:49:44] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.dns.netbox
[15:50:53] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1031.eqiad.wmnet with OS trixie
[15:51:23] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS trixie
[15:53:32] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube1328-34 servers - jclark@cumin1003"
[15:53:37] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube1328-34 servers - jclark@cumin1003"
[15:53:37] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:55:16] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[15:56:25] <claime>	 abijeet: Are you going to deploy yourself or does that patch need a deployer?
[15:58:25] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1328.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[15:59:24] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1328.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:01:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11481183 (10Jclark-ctr) these are racked cabled and setup running into error with provisioning  same as T407991    ` Retrieving the BMC's firmware version. BMC firmware release...
[16:02:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481190 (10Jclark-ctr)
[16:02:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481192 (10Jclark-ctr)
[16:05:05] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:05:42] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:06:01] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:06:18] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:06:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481198 (10Clement_Goubert) Hmm it's very possible these hosts need UEFI and the partman recipe are wrong, on top of needing @elukey's [[ https://gerrit.wikimedia.org/r/c/...
[16:06:37] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:07:03] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1332.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:07:18] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:07:46] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1332.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:08:35] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1333.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:08:50] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1334.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:09:16] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1333.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:09:22] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1334.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:09:23] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage
[16:10:25] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[16:10:35] <wikibugs>	 (03PS1) 10Btullis: Revert changes to the principal for the spark-thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220369 (https://phabricator.wikimedia.org/T410017)
[16:10:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481210 (10Jclark-ctr) @Clement_Goubert  thanks for checking these. I had not even looked at the patch yet. these would be UEFI
[16:14:17] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage
[16:14:46] <wikibugs>	 (03CR) 10JavierMonton: [C:03+2] Revert changes to the principal for the spark-thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220369 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[16:15:42] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1328.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:15:58] <wikibugs>	 (03PS1) 10Clément Goubert: partman: New wikikube-worker need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220370 (https://phabricator.wikimedia.org/T408749)
[16:16:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert changes to the principal for the spark-thriftserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220369 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis)
[16:19:18] <wikibugs>	 (03CR) 10Elukey: [C:03+1] partman: New wikikube-worker need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220370 (https://phabricator.wikimedia.org/T408749) (owner: 10Clément Goubert)
[16:19:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install mc1055-72 - https://phabricator.wikimedia.org/T412255#11481248 (10Clement_Goubert) These are supermicro nodes and require UEFI. Patch underway.
[16:20:00] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] partman: New wikikube-worker need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220370 (https://phabricator.wikimedia.org/T408749) (owner: 10Clément Goubert)
[16:20:59] <wikibugs>	 (03PS2) 10Sbisson: Fix section loading on desktop [extensions/ContentTranslation] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219900 (https://phabricator.wikimedia.org/T413305)
[16:21:56] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:21:58] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:22:21] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:22:28] <logmsgbot>	 !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:23:08] <stephanebisson>	 Hi Amir1, fabfur, requesting permission to deploy Content Translation UBN T413305
[16:23:09] <stashbot>	 T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305
[16:25:07] <wikibugs>	 (03PS1) 10Clément Goubert: partman: New mc nodes need UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1220373 (https://phabricator.wikimedia.org/T412255)
[16:26:17] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1328.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:28:17] <fabfur>	 hi stephanebisson ok for me
[16:29:10] <jinxer-wm>	 FIRING: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:29:10] <stephanebisson>	 fabfur, ok thanks, going ahead now
[16:29:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219900 (https://phabricator.wikimedia.org/T413305) (owner: 10Sbisson)
[16:30:36] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:32:16] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1031.eqiad.wmnet with OS trixie
[16:33:55] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie
[16:34:10] <jinxer-wm>	 RESOLVED: BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[16:36:09] <wikibugs>	 (03PS1) 10Fabfur: Add TXT record for jamf [dns] - 10https://gerrit.wikimedia.org/r/1220380 (https://phabricator.wikimedia.org/T413259)
[16:40:29] <wikibugs>	 (03Merged) 10jenkins-bot: Fix section loading on desktop [extensions/ContentTranslation] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1219900 (https://phabricator.wikimedia.org/T413305) (owner: 10Sbisson)
[16:41:08] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]]
[16:41:12] <stashbot>	 T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305
[16:41:35] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:42:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:43:33] <wikibugs>	 (03PS2) 10Fabfur: Add TXT record for jamf [dns] - 10https://gerrit.wikimedia.org/r/1220380 (https://phabricator.wikimedia.org/T413259)
[16:43:55] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:44:34] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Add TXT record for jamf [dns] - 10https://gerrit.wikimedia.org/r/1220380 (https://phabricator.wikimedia.org/T413259) (owner: 10Fabfur)
[16:45:01] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Add TXT record for jamf [dns] - 10https://gerrit.wikimedia.org/r/1220380 (https://phabricator.wikimedia.org/T413259) (owner: 10Fabfur)
[16:45:29] <logmsgbot>	 !log fabfur@dns1004 START - running authdns-update
[16:46:32] <logmsgbot>	 !log fabfur@dns1004 END - running authdns-update
[16:46:53] <wikibugs>	 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T413259#11481405 (10Fabfur) The entry has been added with https://gerrit.wikimedia.org/r/c/operations/dns/+/1220380 and should be propagated shortly
[16:46:55] <wikibugs>	 (03CR) 10VolkerE: [C:04-1] "I'd want to see an optimized SVG according to our SVG opt guidelines to have a better representation of production environment, otherwise " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1219865 (https://phabricator.wikimedia.org/T413217) (owner: 10Aude)
[16:49:45] <tappof>	 !log lvextend /dev/vg0/srv on titan1001, titan1002, titan2002. T410152
[16:49:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:49] <stashbot>	 T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152
[16:52:22] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage
[16:53:22] <wikibugs>	 (03PS1) 10Tiziano Fogli: thanos-compact: reduce concurrency to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1220385 (https://phabricator.wikimedia.org/T410152)
[16:54:30] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1330.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[16:54:55] <wikibugs>	 (03CR) 10Herron: [C:03+1] thanos-compact: reduce concurrency to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1220385 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli)
[16:55:44] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] thanos-compact: reduce concurrency to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1220385 (https://phabricator.wikimedia.org/T410152) (owner: 10Tiziano Fogli)
[16:58:58] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage
[17:00:37] <logmsgbot>	 elukey@cumin1003 provision (PID 3950609) is awaiting input
[17:00:44] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:02:50] <wikibugs>	 (03PS9) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573)
[17:11:19] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1331.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:13:21] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1332.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:14:23] <stephanebisson>	 Amir1, fabfur: deployment has been stuck for a while. According to the logs, it looks like pushing the image to the registry has failed. Any idea what to do now?
[17:17:35] <elukey>	 stephanebisson: o/ I've seen occurrences of very long deployments stuck in the pushing step, even 40/50 mins. 
[17:18:07] <elukey>	 have you already waited that amount of time? If so I'd suggest to retry the deployment, if possible, because there is no easy solution
[17:18:32] <elukey>	 we have https://phabricator.wikimedia.org/T412951 planned for early Q3 that should make things better
[17:18:47] <stephanebisson>	 elukey yeah it was stuck at the pushing step for 22 minutes but then: subprocess.CalledProcessError: Command '['sudo', '/usr/local/bin/docker-pusher', '-q', 'docker-registry.discovery.wmnet/restricted/mediawiki-singleversion:2025-12-22-010154-publish-83-next']' returned non-zero exit status 1.
[17:19:20] <stephanebisson>	 I'll retry a little later if it's ok
[17:20:21] <elukey>	 stephanebisson: yeah it seems good, I may not be around and other SRE folks are probably too (basically on-call for pages). How urgent is this deployment? 
[17:20:47] <stephanebisson>	 elukey Content Translation UBN
[17:20:58] <elukey>	 lovely
[17:22:43] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]]
[17:22:47] <stashbot>	 T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305
[17:24:00] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1332.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:24:44] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1333.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:26:28] <logmsgbot>	 !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[17:26:59] <logmsgbot>	 !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[17:29:37] <logmsgbot>	 !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
[17:29:51] <logmsgbot>	 !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
[17:30:27] <logmsgbot>	 !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1032.eqiad.wmnet with OS trixie
[17:32:41] <icinga-wm>	 PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 77%, RTA = 3618.00 ms
[17:33:03] <icinga-wm>	 RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 483.17 ms
[17:35:16] <jinxer-wm>	 RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:35:20] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1333.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:35:34] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1334.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:36:59] <stephanebisson>	 It failed again to push the image
[17:46:09] <logmsgbot>	 !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1334.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[17:50:58] <logmsgbot>	 !log sbisson@deploy2002 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.5,1.46.0-wmf.7,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/med
[17:50:58] <logmsgbot>	 iawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.230.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawi
[17:50:58] <logmsgbot>	 ki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.230.0) (duration: 28m 14s)
[17:57:06] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1200 - https://phabricator.wikimedia.org/T413360#11481622 (10VRiley-WMF) a:03VRiley-WMF
[18:00:49] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1328.eqiad.wmnet with OS trixie
[18:01:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481625 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1328.eqiad.wmnet with OS trixie
[18:02:37] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:10:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385 (10ops-monitoring-bot) 03NEW
[18:12:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481688 (10Marostegui) This is a sanitarium host. Can we get a disk to replace the failed one?  Thanks
[18:12:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481689 (10Marostegui) p:05Triage→03High
[18:16:44] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11481709 (10wiki_willy) a:03VRiley-WMF
[18:22:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481731 (10Jclark-ctr) @elukey Ran into another provisioning issue. It looks like IPv4 PXE was disabled. The screenshot was taken after I changed it, and that was the only...
[18:28:43] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage
[18:33:46] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage
[18:36:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[18:47:05] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1334.eqiad.wmnet with OS trixie
[18:47:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481797 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1334.eqiad.wmnet with OS trixie
[18:48:56] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1329.eqiad.wmnet with OS trixie
[18:49:06] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]]
[18:49:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1329.eqiad.wmnet with OS trixie
[18:49:10] <stashbot>	 T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305
[18:49:46] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[18:49:51] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1330.eqiad.wmnet with OS trixie
[18:50:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1330.eqiad.wmnet with OS trixie
[18:50:16] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[18:50:17] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1328.eqiad.wmnet with OS trixie
[18:50:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481801 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1328.eqiad.wmnet with OS trixie completed: - wikiku...
[18:50:41] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1331.eqiad.wmnet with OS trixie
[18:50:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481802 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1331.eqiad.wmnet with OS trixie
[18:52:18] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1332.eqiad.wmnet with OS trixie
[18:52:25] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1333.eqiad.wmnet with OS trixie
[18:52:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1332.eqiad.wmnet with OS trixie
[18:52:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1333.eqiad.wmnet with OS trixie
[18:58:33] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1334.eqiad.wmnet with reason: host reimage
[18:59:59] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1329.eqiad.wmnet with reason: host reimage
[19:01:01] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1330.eqiad.wmnet with reason: host reimage
[19:01:52] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1331.eqiad.wmnet with reason: host reimage
[19:02:37] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:03:33] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1332.eqiad.wmnet with reason: host reimage
[19:03:48] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1333.eqiad.wmnet with reason: host reimage
[19:03:58] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1334.eqiad.wmnet with reason: host reimage
[19:05:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481816 (10Jclark-ctr) a:03Jclark-ctr This server is out of warranty  can it be replaced at any time?    Solid State Disk 0:1:2  Removed  2  1787.88 GB Not Capable  SATA  SSD  No  99%
[19:06:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481819 (10Marostegui) Yes, you can go for it any time
[19:07:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481820 (10Jclark-ctr) I am not on site at this moment but i am going to wipe an old drive from decom server prior to installing
[19:07:04] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1329.eqiad.wmnet with reason: host reimage
[19:07:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11481822 (10Marostegui) Thanks!
[19:11:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[19:11:39] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: root user not on newest batches of supermicro servers. - https://phabricator.wikimedia.org/T412458#11481828 (10Jclark-ctr) 05Open→03Resolved a:05VRiley-WMF→03Jclark-ctr The servers VRiley had were not at a point where th...
[19:14:09] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1332.eqiad.wmnet with reason: host reimage
[19:18:24] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1331.eqiad.wmnet with reason: host reimage
[19:20:01] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:20:18] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:20:18] <logmsgbot>	 !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]]
[19:20:19] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1334.eqiad.wmnet with OS trixie
[19:20:24] <stashbot>	 T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305
[19:20:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481842 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1334.eqiad.wmnet with OS trixie completed: - wikiku...
[19:22:14] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1330.eqiad.wmnet with reason: host reimage
[19:22:41] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:23:00] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:23:01] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1329.eqiad.wmnet with OS trixie
[19:23:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1329.eqiad.wmnet with OS trixie completed: - wikiku...
[19:26:04] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1333.eqiad.wmnet with reason: host reimage
[19:26:41] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[19:26:44] <stashbot>	 T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305
[19:28:15] <logmsgbot>	 !log sbisson@deploy2002 sbisson: Continuing with sync
[19:30:28] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:30:47] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:30:48] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1332.eqiad.wmnet with OS trixie
[19:30:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1332.eqiad.wmnet with OS trixie completed: - wikiku...
[19:34:25] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:34:47] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:34:48] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1331.eqiad.wmnet with OS trixie
[19:34:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481886 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1331.eqiad.wmnet with OS trixie completed: - wikiku...
[19:36:02] <wikibugs>	 (03PS1) 10Mforns: Bump up the page-analytics service image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220390 (https://phabricator.wikimedia.org/T405041)
[19:39:28] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:39:46] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:39:47] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1330.eqiad.wmnet with OS trixie
[19:39:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481900 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1330.eqiad.wmnet with OS trixie completed: - wikiku...
[19:40:43] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Bump up the page-analytics service image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220390 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns)
[19:41:02] <logmsgbot>	 !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1219900|Fix section loading on desktop (T413305)]] (duration: 20m 44s)
[19:41:07] <stashbot>	 T413305: ContentTranslation shows wrong text of section of source page at section translation - https://phabricator.wikimedia.org/T413305
[19:42:50] <logmsgbot>	 !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:42:53] <wikibugs>	 (03Merged) 10jenkins-bot: Bump up the page-analytics service image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220390 (https://phabricator.wikimedia.org/T405041) (owner: 10Mforns)
[19:43:08] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003"
[19:43:09] <logmsgbot>	 !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1333.eqiad.wmnet with OS trixie
[19:43:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1333.eqiad.wmnet with OS trixie completed: - wikiku...
[19:44:02] <logmsgbot>	 !log mforns@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply
[19:44:16] <logmsgbot>	 !log mforns@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
[19:44:34] <logmsgbot>	 !log mforns@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
[19:44:49] <logmsgbot>	 !log mforns@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
[19:45:05] <logmsgbot>	 !log mforns@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply
[19:45:17] <logmsgbot>	 !log mforns@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply
[19:47:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481922 (10Jclark-ctr)
[19:48:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11481923 (10Jclark-ctr) 05Open→03Resolved
[19:58:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:03:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:10:25] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:32:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[20:37:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[20:43:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1275:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1275 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:06:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482291 (10VRiley-WMF) wikikube-worker1360 B2 U18 CableID 5003 Port 27  wikikube-worker1361 B4 U36 CableID 5369 Port 47  wikikube-worker1362 C3 U37 CableID 230304500071 Por...
[21:20:20] <wikibugs>	 (03CR) 10CDanis: [C:03+2] P:puppetserver::volatile add Spur anonymous-residential feed [puppet] - 10https://gerrit.wikimedia.org/r/1219881 (owner: 10Slyngshede)
[21:23:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1275:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1275 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:24:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[21:24:27] <wikibugs>	 (03PS1) 10CDanis: spur_feeds: fix path typo [puppet] - 10https://gerrit.wikimedia.org/r/1220398
[21:25:08] <wikibugs>	 (03CR) 10CDanis: [C:03+2] spur_feeds: fix path typo [puppet] - 10https://gerrit.wikimedia.org/r/1220398 (owner: 10CDanis)
[21:29:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[21:33:08] <wikibugs>	 (03PS1) 10CDanis: spur_feeds: fix outfile/outdir confusion [puppet] - 10https://gerrit.wikimedia.org/r/1220401
[21:34:10] <jinxer-wm>	 FIRING: [3x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:34:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:35:00] <wikibugs>	 (03CR) 10CDanis: [C:03+2] spur_feeds: fix outfile/outdir confusion [puppet] - 10https://gerrit.wikimedia.org/r/1220401 (owner: 10CDanis)
[21:35:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:37:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[21:39:10] <jinxer-wm>	 FIRING: [7x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:39:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:40:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:41:56] <wikibugs>	 (03PS1) 10CDanis: spur_feeds: use root user [puppet] - 10https://gerrit.wikimedia.org/r/1220402
[21:43:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:44:07] <wikibugs>	 (03CR) 10CDanis: [C:03+2] spur_feeds: use root user [puppet] - 10https://gerrit.wikimedia.org/r/1220402 (owner: 10CDanis)
[21:44:10] <jinxer-wm>	 FIRING: [11x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:44:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:46:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:47:55] <wikibugs>	 10ops-magru: Inbound errors on interface cr1-magru:xe-0/1/1 (Transport: cr2-eqiad:xe-1/0/1:3 (Telxius, CRT-008508) {#70089}) - https://phabricator.wikimedia.org/T413409 (10phaultfinder) 03NEW
[21:49:10] <jinxer-wm>	 FIRING: [15x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:51:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:52:17] <wikibugs>	 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, and 2 others: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11482403 (10AKanji-WMF)
[21:53:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:53:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: dump_proxy_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:53:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:53:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:54:10] <jinxer-wm>	 FIRING: [16x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:54:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:55:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:57:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[21:57:31] <jinxer-wm>	 FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[21:58:39] <jinxer-wm>	 RESOLVED: [3x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[21:59:10] <jinxer-wm>	 FIRING: [13x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[21:59:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:02:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[22:02:31] <jinxer-wm>	 RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh
[22:04:10] <jinxer-wm>	 FIRING: [14x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:04:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:06:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:06:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:09:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:09:10] <jinxer-wm>	 FIRING: [13x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:11:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:12:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:13:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:13:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:14:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:14:10] <jinxer-wm>	 RESOLVED: [11x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:14:40] <jinxer-wm>	 FIRING: [5x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:700:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:15:10] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:15:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:19:40] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:19:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[22:20:10] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:20:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:23:00] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:23:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:23:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:24:40] <jinxer-wm>	 FIRING: [11x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:25:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:25:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:27:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[22:27:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:29:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[22:29:40] <jinxer-wm>	 RESOLVED: [7x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:30:55] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:700:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:31:10] <jinxer-wm>	 FIRING: [5x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:32:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:32:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:33:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:33:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:34:40] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:36:10] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:38:39] <jinxer-wm>	 RESOLVED: [3x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:39:40] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:41:10] <jinxer-wm>	 FIRING: [11x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:43:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:44:40] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:45:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:46:10] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:48:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:49:40] <jinxer-wm>	 RESOLVED: [12x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:49:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in magru - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[22:51:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:51:55] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:700:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:53:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:54:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[22:54:40] <jinxer-wm>	 RESOLVED: [6x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[22:54:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:55:16] <jinxer-wm>	 FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[22:55:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:58:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[22:59:40] <jinxer-wm>	 FIRING: [5x] BFDdown: BFD session down between cr1-magru and fe80::8618:88ff:fe0d:d947 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:01:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:01:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:02:30] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:02:55] <jinxer-wm>	 FIRING: [7x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:04:40] <jinxer-wm>	 FIRING: [6x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:05:00] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:06:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:07:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:07:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:07:55] <jinxer-wm>	 FIRING: [7x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:07:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:09:40] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:10:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:12:55] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:14:40] <jinxer-wm>	 FIRING: [9x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:15:16] <jinxer-wm>	 RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 <no value> - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[23:17:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:17:55] <jinxer-wm>	 RESOLVED: [8x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:18:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:19:40] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:22:55] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:23:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:24:40] <jinxer-wm>	 RESOLVED: [7x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:25:02] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:25:46] <wikibugs>	 (03PS1) 10Cwhite: logstash: put logging-sd200[567] in service [puppet] - 10https://gerrit.wikimedia.org/r/1220405 (https://phabricator.wikimedia.org/T413414)
[23:25:48] <wikibugs>	 (03PS1) 10Cwhite: logstash: put logging-sd100[567] in service [puppet] - 10https://gerrit.wikimedia.org/r/1220406 (https://phabricator.wikimedia.org/T413414)
[23:26:40] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 2a02:ec80:700:fe0a::2 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:26:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:27:55] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:28:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:29:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:31:02] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:31:40] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:32:55] <jinxer-wm>	 FIRING: [11x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:34:39] <jinxer-wm>	 RESOLVED: CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-magru (2a02:ec80:700:fe0a::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqiad:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr1-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:36:40] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:37:55] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:39:41] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: put logging-sd200[567] in service [puppet] - 10https://gerrit.wikimedia.org/r/1220405 (https://phabricator.wikimedia.org/T413414) (owner: 10Cwhite)
[23:39:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:40:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:41:40] <jinxer-wm>	 FIRING: [12x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:42:02] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:42:55] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:43:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:44:02] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:46:40] <jinxer-wm>	 FIRING: [9x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:47:04] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:47:55] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:50:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr1-magru:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[23:51:40] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:51:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:52:04] <icinga-wm>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:52:55] <jinxer-wm>	 FIRING: [9x] BFDdown: BFD session down between cr1-magru and 2a02:ec80:700:fe0a::1 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:54:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:55:07] <icinga-wm>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:56:40] <jinxer-wm>	 FIRING: [9x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:57:55] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[23:57:59] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status