[00:00:07] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:00:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:01:40] FIRING: [11x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:02:55] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:03:37] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:04:59] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:05:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:06:40] FIRING: [11x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:06:59] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:07:55] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:08:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:09:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:09:59] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:10:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:11:07] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:11:40] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:12:55] FIRING: [8x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:13:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:14:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:14:59] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:15:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (195.200.68.150) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:16:12] 10ops-magru, 06Infrastructure-Foundations, 10netops: Flapping OSFP between cr1-magru and cr2-eqiad - https://phabricator.wikimedia.org/T413415 (10Papaul) 03NEW [00:16:40] FIRING: [9x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:17:07] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:17:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:17:55] FIRING: [10x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:19:07] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:20:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:21:40] RESOLVED: [7x] BFDdown: BFD session down between cr1-magru and 195.200.68.150 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [00:25:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-magru and cr2-eqiad (2a02:ec80:700:fe0a::1) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [00:27:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [00:32:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:37:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:39:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220408 [00:39:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220408 (owner: 10TrainBranchBot) [00:42:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [00:44:07] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 75%, RTA = 7256.71 ms [00:44:47] I'm finding consistent session loss errors when editing [00:44:53] RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 471.81 ms [00:44:56] 10SRE-Access-Requests, 06Release-Engineering-Team (Radar): Add yubikey ssh keys for thcipriani - https://phabricator.wikimedia.org/T413416 (10thcipriani) 03NEW [00:50:36] (03PS1) 10Thcipriani: Yubikey-SSH-FIDO: add new keys for thcipriani [puppet] - 10https://gerrit.wikimedia.org/r/1220409 (https://phabricator.wikimedia.org/T413416) [00:50:38] (03PS1) 10Thcipriani: Yubikey-SSH-FIDO: remove old key for thcipriani [puppet] - 10https://gerrit.wikimedia.org/r/1220410 (https://phabricator.wikimedia.org/T413416) [00:52:46] (03CR) 10Thcipriani: [C:04-1] "Would like to verify access before this is removed." [puppet] - 10https://gerrit.wikimedia.org/r/1220410 (https://phabricator.wikimedia.org/T413416) (owner: 10Thcipriani) [00:54:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1220408 (owner: 10TrainBranchBot) [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220034 (owner: 10TrainBranchBot) [01:10:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220411 [01:10:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220411 (owner: 10TrainBranchBot) [01:20:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [01:31:13] RECOVERY - dump of matomo in eqiad on backupmon1001 is OK: Last dump for matomo at eqiad (db1208) taken on 2025-12-23 01:07:56 (435 MiB, -0.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:37:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1220411 (owner: 10TrainBranchBot) [02:14:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:15:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:16:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:20:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:23:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:28:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:33:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:38:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:43:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:51:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:52:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:57:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:02:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:03:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:07:38] (03CR) 10Anzx: [C:04-1] "use tab instead of spaces in the beginning of line" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [03:08:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:21:10] (03PS3) 10Nvdtn19: Configuration for viwikivoyage per T405724 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 [03:24:27] (03CR) 10Nvdtn19: "I applied the fix." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [03:27:23] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [03:39:36] (03CR) 10Anzx: [C:03+1] "clearer description of of task would be recommended, and please schedule your using schedule backport button below, at any upcoming backpo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [03:43:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:48:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [03:51:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [03:53:30] (03PS4) 10Nvdtn19: viwikivoyage: enable relatedarticle and pop-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 [04:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:36:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:41:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:46:15] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:51:15] RESOLVED: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:57:00] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:01:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:01:45] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:02:37] !log mwpresync@deploy2002 Pruned MediaWiki: 1.46.0-wmf.4 (duration: 02m 34s) [05:07:45] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:08:00] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:45] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:13:00] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:17:45] RESOLVED: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:18:58] (03CR) 10Nvdtn19: "Why in https://schedule-deployment.toolforge.org/backport/1216721, the backport window dropdown have nothing to chosen? It doesn't let me " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [05:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:30] (03CR) 10Anzx: "It's because of Year end holidays, please after January 5 2026, till that date no deploying will be done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216721 (owner: 10Nvdtn19) [05:34:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:47:20] (03PS1) 10Marostegui: mariadb: Decommission es2028. [puppet] - 10https://gerrit.wikimedia.org/r/1220415 (https://phabricator.wikimedia.org/T408407) [05:50:57] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2028.codfw.wmnet [05:51:07] (03CR) 10Marostegui: [C:03+2] mariadb: Decommission es2028. [puppet] - 10https://gerrit.wikimedia.org/r/1220415 (https://phabricator.wikimedia.org/T408407) (owner: 10Marostegui) [05:56:43] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T413004#11482739 (10VRiley-WMF) a:03VRiley-WMF [05:56:45] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:00:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:00:28] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface lswtest-d8-eqiad:mgmt0 () - https://phabricator.wikimedia.org/T413004#11482746 (10VRiley-WMF) →14Duplicate dup:03T412733 [06:00:31] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11482748 (10VRiley-WMF) [06:00:37] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2028.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:01:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2028.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:01:29] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:01:30] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2028.codfw.wmnet [06:01:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:02:15] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es2028 - https://phabricator.wikimedia.org/T408407#11482768 (10Marostegui) a:05Marostegui→03None [06:06:13] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission es2028 - https://phabricator.wikimedia.org/T408407#11482774 (10Marostegui) This is ready for #dc-ops [06:06:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:07:41] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [06:09:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.newdepool pc1013 - test [06:09:44] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:09:52] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:09:53] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newdepool (exit_code=0) pc1013 - test [06:10:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.newpool pc1013 gradually with 4 steps - test [06:10:30] !log marostegui@cumin1003 START - Cookbook sre.mysql.parsercache [06:10:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.parsercache (exit_code=0) [06:10:43] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.newpool (exit_code=0) pc1013 gradually with 4 steps - test [06:11:24] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1360 - vriley@cumin1003" [06:11:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1360 - vriley@cumin1003" [06:11:29] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:11:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482775 (10VRiley-WMF) [06:12:02] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1360 [06:13:18] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1360 [06:14:59] (03CR) 10Marostegui: "I did a first test with pc1013 and it correctly depooled pc1." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [06:16:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:16:58] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1360.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [06:22:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:25:25] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1360.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [06:26:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482785 (10VRiley-WMF) [06:28:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:34:08] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1360.eqiad.wmnet with OS trixie [06:34:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11482787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1360.eqiad.wmnet with OS trixie [06:38:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:39:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:44:15] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:45:26] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1360.eqiad.wmnet with reason: host reimage [06:48:32] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1360.eqiad.wmnet with reason: host reimage [06:49:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:49:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482825 (10VRiley-WMF) [06:50:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:58:37] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [07:03:13] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1361 - vriley@cumin1003" [07:03:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1361 - vriley@cumin1003" [07:03:17] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:03:49] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1361 [07:04:37] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [07:05:01] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1361 [07:05:47] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:05:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:06:39] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [07:06:40] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1360.eqiad.wmnet with OS trixie [07:06:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11482833 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1360.eqiad.wmnet with OS trixie completed: - wi... [07:07:06] !incidents [07:07:07] 7220 (UNACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:07:07] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:07:07] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [07:07:16] !ack 7720 [07:07:16] Attempt to ack incident 7720 failed. [07:07:20] what's up [07:07:39] !ack 7220 [07:07:39] 7220 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:10:43] (on my computer now) [07:13:52] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:15:05] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [07:15:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:16:01] !incidents [07:16:01] 7220 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:16:02] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:16:02] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [07:17:42] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:18:28] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1362 - vriley@cumin1003" [07:18:33] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1362 - vriley@cumin1003" [07:18:33] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:19:03] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1362 [07:19:07] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:20:41] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1362 [07:22:05] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1362.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:22:09] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:23:29] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:25:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:25:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:29:28] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:30:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:30:42] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:30:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1362.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:32:22] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1362.eqiad.wmnet with OS trixie [07:32:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1362.eqiad.wmnet with OS trixie [07:34:05] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 100% [07:34:59] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [07:35:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in eqsin #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:36:29] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:37:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [07:37:43] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:38:26] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:38:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:39:10] !incidents [07:39:11] 7221 (UNACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:39:11] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:39:11] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:39:12] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [07:39:18] !ack 7221 [07:39:18] 7221 (ACKED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:39:49] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:40:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:41:53] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [07:43:33] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1362.eqiad.wmnet with reason: host reimage [07:43:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=codfw&var-cluster=text&var-origin=mw-web-ro.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:44:32] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:44:59] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1361 [07:45:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:46:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1361 [07:48:03] !incidents [07:48:03] 7221 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:48:03] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:48:03] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:48:04] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [07:49:16] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1362.eqiad.wmnet with reason: host reimage [07:49:59] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:50:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:50:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482846 (10VRiley-WMF) [07:51:41] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1361.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [07:55:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [07:56:01] !incidents [07:56:01] 7222 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [07:56:01] 7221 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:56:02] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [07:56:02] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:56:02] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [07:56:09] and now it's saturation [07:56:10] !ack 7222 [07:56:10] 7222 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [07:58:14] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [07:59:06] (03PS1) 10Slyngshede: data.yaml: Offboarding resquito [puppet] - 10https://gerrit.wikimedia.org/r/1220609 [08:02:04] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1363 - vriley@cumin1003" [08:02:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1363 - vriley@cumin1003" [08:02:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:02:56] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1363 [08:03:06] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1363 [08:03:59] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1363.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:04:20] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Resquito out of all services on: 2444 hosts [08:05:24] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [08:05:45] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [08:05:46] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1362.eqiad.wmnet with OS trixie [08:06:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1362.eqiad.wmnet with OS trixie completed: - wikikub... [08:06:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482850 (10VRiley-WMF) [08:07:18] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Resquito out of all services on: 1 hosts [08:07:25] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Resquito out of all services on: 1 hosts [08:07:33] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Resquito out of all services on: 1 hosts [08:07:42] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Resquito out of all services on: 1 hosts [08:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:10:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [08:11:55] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Resquito out of all services on: 1 hosts [08:12:15] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1363.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:12:36] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding resquito [puppet] - 10https://gerrit.wikimedia.org/r/1220609 (owner: 10Slyngshede) [08:17:24] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1363.eqiad.wmnet with OS trixie [08:17:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1363.eqiad.wmnet with OS trixie [08:22:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [08:23:02] !incidents [08:23:02] 7223 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:23:02] 7222 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:23:02] 7221 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [08:23:03] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [08:23:03] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [08:23:03] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [08:23:05] !ack 7223 [08:23:06] 7223 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:25:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:30:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:35:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:38:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:44:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:45:04] !incidents [08:45:04] 7223 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:45:04] 7224 (UNACKED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [08:45:04] 7222 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:45:05] 7221 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [08:45:05] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [08:45:05] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [08:45:05] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [08:45:17] !ack 7224 [08:45:18] 7224 (ACKED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [08:45:20] !incidents [08:45:21] 7223 (ACKED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:45:21] 7224 (ACKED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [08:45:21] 7222 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [08:45:21] 7221 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [08:45:21] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [08:45:22] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [08:45:22] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [08:49:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [08:53:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:56:27] (03CR) 10Dpogorzelski: [C:03+2] ml: add ml specific config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219552 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [08:56:29] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml: add ml specific config [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1219552 (https://phabricator.wikimedia.org/T394778) (owner: 10Dpogorzelski) [08:59:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:07:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [09:09:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:16:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [09:16:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [09:18:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:18:40] PROBLEM - Host wikikube-worker1053 is DOWN: PING CRITICAL - Packet loss = 80%, RTA = 4200.66 ms [09:18:44] RECOVERY - Host wikikube-worker1053 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [09:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [09:37:39] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1363.eqiad.wmnet with OS trixie [09:37:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1363.eqiad.wmnet with OS trixie executed with errors... [09:43:19] !incidents [09:43:19] 7224 (ACKED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [09:43:19] 7223 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [09:43:20] 7222 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [09:43:20] 7221 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [09:43:20] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [09:43:20] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [09:43:20] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [09:51:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [09:51:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:05:58] 06SRE, 10DNS, 06Traffic: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T413259#11482958 (10Fabfur) 05Open→03Resolved a:03Fabfur [10:09:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:09:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:10:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [10:10:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:12:11] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11482964 (10Peachey88) [10:15:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:15:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:15:44] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [10:15:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:20:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:30:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:32:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [10:32:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:34:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:35:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:37:23] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1363.eqiad.wmnet with OS trixie [10:37:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11482977 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1363.eqiad.wmnet with OS trixie [10:37:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [10:37:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:40:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:42:11] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1361.eqiad.wmnet with OS trixie [10:42:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11482980 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1361.eqiad.wmnet with OS trixie [10:44:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet in codfw #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [10:45:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:50:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:53:24] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1361.eqiad.wmnet with reason: host reimage [10:55:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:57:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11482989 (10elukey) 05Resolved→03Open >>! In T408749#11481731, @Jclark-ctr wrote: > @elukey Ran into another provisioning issue. It looks like IPv4 PXE was disabled. Th... [10:59:49] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1361.eqiad.wmnet with reason: host reimage [10:59:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11482994 (10Jclark-ctr) It was all of them failed to image over http but when I enabled ode they imagined [11:00:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:00:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:00:26] (03PS1) 10Clément Goubert: changeprop-jobqueue: Halve refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220622 [11:00:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11483006 (10VRiley-WMF) [11:00:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "let’s try it IMHO" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220622 (owner: 10Clément Goubert) [11:01:06] (03CR) 10Jelto: [C:03+1] "lgtm, lets try this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220622 (owner: 10Clément Goubert) [11:03:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:05:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:05:33] (03CR) 10Marostegui: [C:03+1] changeprop-jobqueue: Halve refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220622 (owner: 10Clément Goubert) [11:05:52] (03CR) 10Clément Goubert: [C:03+2] changeprop-jobqueue: Halve refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220622 (owner: 10Clément Goubert) [11:07:48] (03Merged) 10jenkins-bot: changeprop-jobqueue: Halve refreshLinks concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220622 (owner: 10Clément Goubert) [11:08:12] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:09:31] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:09:51] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:10:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:11:03] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:12:13] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:12:26] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:14:43] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11483022 (10TheDJ) Refreshlinks reduced via https://gerrit.wikimedia.org/r/1220622 [11:15:48] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [11:16:08] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [11:16:09] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1361.eqiad.wmnet with OS trixie [11:16:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11483024 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1361.eqiad.wmnet with OS trixie completed: - wikiku... [11:19:23] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11483026 (10TheDJ) Noting that SQL traffic had a significant increase over the night: {F71212645} https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-7d&to=now&timezone=utc&var-site=codfw&var-g... [11:19:26] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11483027 (10TheDJ) Noticeable edit traffic (as reported by @Sjoerddebruin, discovered via wikiscan). This may be related, but could be totally UNRELATED, no judgement.. 111k edits in the last 24 hours https://www.wikidat... [11:22:24] PROBLEM - Thanos swift https on thanos-fe1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [11:23:43] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11483029 (10Marostegui) >>! In T413426#11483026, @TheDJ wrote: > Noting that SQL traffic had a significant increase over the night: > > {F71212645} > > https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?orgId=... [11:25:14] RECOVERY - Thanos swift https on thanos-fe1007 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Thanos [11:57:37] !log vriley@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1363.eqiad.wmnet with OS trixie [11:57:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11483077 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1363.eqiad.wmnet with OS trixie executed with errors... [12:03:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:06:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:15:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:20:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:23:15] FIRING: [7x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:27:41] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:27:49] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:28:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:31:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:33:15] RESOLVED: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:34:55] (03PS10) 10Federico Ceratto: sre.mysql.newpool: [de]pool various section kinds [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) [12:36:31] (03PS1) 10Clément Goubert: cp-jobqueue: Partition wikibase-addUsagesForPage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220626 [12:37:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:37:17] (03CR) 10Marostegui: [C:03+1] cp-jobqueue: Partition wikibase-addUsagesForPage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220626 (owner: 10Clément Goubert) [12:37:33] (03CR) 10Federico Ceratto: "I removed the incorrect phabricator update. I did a good cleanup of the phabricator updating logic and added more tests on the expected ou" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [12:37:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [12:38:12] !incidents [12:38:13] 7225 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:38:13] 7224 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [12:38:13] 7223 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [12:38:13] 7222 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [12:38:13] 7221 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [12:38:14] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [12:38:14] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [12:38:14] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [12:39:07] (03CR) 10Clément Goubert: [C:03+2] cp-jobqueue: Partition wikibase-addUsagesForPage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220626 (owner: 10Clément Goubert) [12:39:46] !ack 7225 [12:39:47] 7225 (ACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:41:17] (03Merged) 10jenkins-bot: cp-jobqueue: Partition wikibase-addUsagesForPage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220626 (owner: 10Clément Goubert) [12:41:49] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:42:01] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:42:36] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:43:05] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:43:09] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:43:47] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:46:29] !incidents [12:46:29] 7225 (ACKED) TransitPeeringTransportOutSaturation network sre (cr3-eqsin:9804 Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016} xe-0/1/3 gnmi eqsin) [12:46:30] 7224 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [12:46:30] 7223 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [12:46:30] 7222 (RESOLVED) TransitPeeringTransportOutSaturation network sre (cr1-codfw:9804 Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1} xe-1/1/1:0 gnmi codfw) [12:46:30] 7221 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [12:46:30] 7220 (RESOLVED) ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet codfw) [12:46:31] 7219 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [12:46:31] 7218 (RESOLVED) [2x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [12:53:14] (03PS1) 10Jgiannelos: wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) [12:55:30] (03CR) 10CI reject: [V:04-1] wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [12:57:43] (03PS1) 10Clément Goubert: Revert "cp-jobqueue: Partition wikibase-addUsagesForPage" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220630 [12:59:24] (03CR) 10Marostegui: [C:03+1] Revert "cp-jobqueue: Partition wikibase-addUsagesForPage" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220630 (owner: 10Clément Goubert) [13:00:28] (03CR) 10Clément Goubert: [C:03+2] Revert "cp-jobqueue: Partition wikibase-addUsagesForPage" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220630 (owner: 10Clément Goubert) [13:01:36] (03PS2) 10Jgiannelos: wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) [13:02:02] (03Merged) 10jenkins-bot: Revert "cp-jobqueue: Partition wikibase-addUsagesForPage" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220630 (owner: 10Clément Goubert) [13:03:23] (03CR) 10CI reject: [V:04-1] wikifeeds: Add request definition for page analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:40] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [13:12:50] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [13:14:05] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:14:17] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:14:21] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:14:40] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:14:47] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:15:03] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:21:14] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11483147 (10TheDJ) This is likely unrelated, but i found an interesting increase in mediawiki reported db errors since essentially the last train {F71214117} [13:23:07] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11483148 (10Marostegui) p:05Unbreak!→03High We are still troubleshooting the issue, but it looks like the errors have stopped for now. [13:37:31] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [13:39:56] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [13:40:49] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [13:41:01] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1364 - vriley@cumin1003" [13:41:05] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt wikikube-worker1364 - vriley@cumin1003" [13:41:05] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:41:45] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1364 [13:42:07] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1364 [13:42:24] 06SRE: Unrecognised file under /srv/deployment-charts - https://phabricator.wikimedia.org/T413433 (10Urbanecm_WMF) 03NEW [13:42:51] !log vriley@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1364.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:43:37] (03PS1) 10Clément Goubert: cp-jobqueue: Revert refreshLinks throttle, throttle addUsagesForPage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220634 [13:47:27] (03CR) 10Clément Goubert: [C:03+2] cp-jobqueue: Revert refreshLinks throttle, throttle addUsagesForPage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220634 (owner: 10Clément Goubert) [13:49:33] (03Merged) 10jenkins-bot: cp-jobqueue: Revert refreshLinks throttle, throttle addUsagesForPage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220634 (owner: 10Clément Goubert) [13:50:13] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1364.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [13:52:16] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:53:19] marostegui is this a known problem https://logstash.wikimedia.org/goto/6171c455bbd00a05da7ab829441cd463 ? [13:53:22] !log vriley@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1364.eqiad.wmnet with OS trixie [13:53:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11483201 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1003 for host wikikube-worker1364.eqiad.wmnet with OS trixie [13:53:51] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:54:05] thedj: that's probably also coming from the current issue cc claime [13:54:09] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:54:13] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:54:32] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:55:52] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:56:05] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:57:16] RESOLVED: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:04:38] !log vriley@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1364.eqiad.wmnet with reason: host reimage [14:06:52] 10ops-magru, 06Infrastructure-Foundations, 10netops: Flapping OSFP between cr1-magru and cr2-eqiad - https://phabricator.wikimedia.org/T413415#11483229 (10Papaul) 05Open→03Invalid [14:08:01] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1364.eqiad.wmnet with reason: host reimage [14:09:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11483248 (10VRiley-WMF) [14:15:40] (03PS1) 10STran: Disable GeoIP2 lookups from WikimediaEvents on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220635 (https://phabricator.wikimedia.org/T413100) [14:22:24] (03CR) 10STran: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220635 (https://phabricator.wikimedia.org/T413100) (owner: 10STran) [14:23:19] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:23:40] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11483270 (10Papaul) I am still waiting for Nokia to get back in touch with me. [14:23:41] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1003" [14:23:42] !log vriley@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1364.eqiad.wmnet with OS trixie [14:23:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11483271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1003 for host wikikube-worker1364.eqiad.wmnet with OS trixie completed: - wikikub... [14:25:54] (03CR) 10STran: "I think this config patch is right? Tests say I didn't make a change but I'm reasonably sure I'm overriding this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1220635 (https://phabricator.wikimedia.org/T413100) (owner: 10STran) [14:29:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11483278 (10elukey) @Jclark-ctr got it, do you remember more precisely the error while doing HTTP boot? Did you see any failure registered when the host tried to do it etc..? [14:37:02] PROBLEM - Host cp3079 is DOWN: PING CRITICAL - Packet loss = 100% [14:37:02] PROBLEM - Host cp3075 is DOWN: PING CRITICAL - Packet loss = 100% [14:37:22] RECOVERY - Host cp3075 is UP: PING OK - Packet loss = 0%, RTA = 81.29 ms [14:37:22] RECOVERY - Host cp3079 is UP: PING OK - Packet loss = 0%, RTA = 78.30 ms [14:39:43] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wikikube-worker1328.eqiad.wmnet with OS trixie [14:39:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11483302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wikikube-worker1328.eqiad.wmnet with OS trixie [14:43:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11483304 (10Jclark-ctr) @elukey Although it was not working yesterday, it seems to be working now. I will revert all the servers back to PXE off and reimage them to verify. [14:45:17] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:45:25] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1329.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [14:51:50] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage [14:57:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:58:49] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1328.eqiad.wmnet with reason: host reimage [15:06:26] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11483358 (10Raine) 05Open→03Resolved a:03Raine Looks like the cause has been identified and dealt with, so closing this. Thanks everyone! [15:09:13] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:22] (03CR) 10Jgiannelos: "I was hoping there was a listener already for page-analytics but i don't think its defined somewhere." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220629 (https://phabricator.wikimedia.org/T411769) (owner: 10Jgiannelos) [15:12:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [15:14:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:15:04] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1328.eqiad.wmnet with OS trixie [15:15:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11483371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wikikube-worker1328.eqiad.wmnet with OS trixie completed: - wikiku... [15:15:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [15:19:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:19:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker13[28-34] - https://phabricator.wikimedia.org/T408749#11483395 (10Jclark-ctr) 05Open→03Resolved [15:25:04] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:25:54] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:34:13] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:13] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11483450 (10Jclark-ctr) This drive has arrived @RKemper and @BTullis [15:42:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1198 - https://phabricator.wikimedia.org/T413336#11483451 (10Jclark-ctr) This replacement drive has arrived @RKemper @BTullis [16:04:32] (03PS1) 10Federico Ceratto: mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) [16:05:48] (03CR) 10CI reject: [V:04-1] mariadb: monitor GTID usage in replication [alerts] - 10https://gerrit.wikimedia.org/r/1220640 (https://phabricator.wikimedia.org/T315642) (owner: 10Federico Ceratto) [16:09:37] 06SRE, 06DBA: High rate of DB errors on prod - https://phabricator.wikimedia.org/T413426#11483513 (10Marostegui) a:05Raine→03Clement_Goubert [16:10:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:24:10] FIRING: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:24:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:29:10] RESOLVED: BFDdown: BFD session down between cr2-drmrs and fe80::ee38:7300:1ae8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:54:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:00:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:34:52] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 50%, RTA = 2728.73 ms [17:35:08] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 244.75 ms [17:36:43] (03PS1) 10Btullis: Add a kyuubi deployment to the spark-support chart for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220644 (https://phabricator.wikimedia.org/T410017) [17:38:27] (03CR) 10CI reject: [V:04-1] Add a kyuubi deployment to the spark-support chart for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220644 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [17:40:41] (03PS2) 10Btullis: Add a kyuubi deployment to the spark-support chart for analytics-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1220644 (https://phabricator.wikimedia.org/T410017) [17:48:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:54:11] (03CR) 10Cwhite: [C:03+2] opensearch/curator: Remove support for buster [puppet] - 10https://gerrit.wikimedia.org/r/1219879 (owner: 10Muehlenhoff) [18:10:35] PROBLEM - Host cp3075 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:35] PROBLEM - Host cp3069 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:38] PROBLEM - Host ganeti3008 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:42] RECOVERY - Host cp3075 is UP: PING OK - Packet loss = 0%, RTA = 80.08 ms [18:10:44] RECOVERY - Host cp3069 is UP: PING WARNING - Packet loss = 60%, RTA = 79.64 ms [18:11:00] RECOVERY - Host ganeti3008 is UP: PING OK - Packet loss = 0%, RTA = 78.48 ms [18:13:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and 208.80.153.214 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:18:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and 208.80.153.214 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [18:33:01] (03CR) 10Cwhite: [C:03+2] logstash: put logging-sd100[567] in service [puppet] - 10https://gerrit.wikimedia.org/r/1220406 (https://phabricator.wikimedia.org/T413414) (owner: 10Cwhite) [18:35:34] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11483659 (10Jclark-ctr) Crypto erase on drive prior to adding but the failed drive has been replaced on db1155 slot 2 [18:37:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:56:07] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1155 - https://phabricator.wikimedia.org/T413385#11483675 (10Jclark-ctr) Drive in progress of rebuilding ` RAID Information Progress 10% Used RAID Disk Space 1787.88 GB Available RAID Disk Space 0 GB Non RAID Disk Cache Policy Not Applicable ` [19:37:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:38:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:43:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [19:49:47] 10SRE-swift-storage, 06Commons, 10media-backups: File not found: /v1/AUTH_mw/wikipedia-commons-local-public ... for 3 files - https://phabricator.wikimedia.org/T400567#11483740 (10Nemo_bis) This is happening now for https://upload.wikimedia.org/wikipedia/commons/6/64/2025-11-16_ONEW_concert_032.jpg : > File... [19:58:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [20:10:25] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:58:31] (03PS9) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [21:59:12] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:13:23] (03PS10) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:14:05] (03CR) 10CI reject: [V:04-1] prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:15:30] (03PS11) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:19:13] (03PS12) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [22:42:00] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7847/console" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [22:45:16] FIRING: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:49:33] (03PS13) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) [23:33:48] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:12] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 33%, RTA = 957.42 ms [23:35:16] RESOLVED: ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:38:17] (03CR) 10BCornwall: "Great start! I'm wondering if the schema could be fleshed out a little more, though. Glancing at https://config-master.wikimedia.org/pools" [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641) (owner: 10CDobbins) [23:45:26] (03PS14) 10CDobbins: prometheus: add depooled cp* host check [puppet] - 10https://gerrit.wikimedia.org/r/1219634 (https://phabricator.wikimedia.org/T406641)