[00:02:17] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:02:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:06:49] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:09:31] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:09:41] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:14:05] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:18:45] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:19:39] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:20:51] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:21:51] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:23:07] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:24:09] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:29:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:31:51] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:36:43] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:40:01] (BlazegraphJvmQuakeWarnGC) firing: (6) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [00:50:17] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:50:27] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:52:03] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:52:35] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:54:11] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:56:21] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:00:51] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:08:07] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:10:13] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:12:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:14:57] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:16:45] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:19:19] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:23:51] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:28:07] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:34:53] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:37:11] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:40:22] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:09] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:43:57] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:44:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [01:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [01:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:22] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:49] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:48:57] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:50:41] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:50:47] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:51:05] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:57:13] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:57:29] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:57:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:59:47] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:01:29] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:01:49] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:03:42] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:06:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:08:15] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:12:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:14:55] (03PS1) 10Ladsgroup: Add add_lu_attachment_method_T305300.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/776462 (https://phabricator.wikimedia.org/T305300) [02:16:31] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:16:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:17:08] (03PS2) 10Ladsgroup: Add add_lu_attachment_method_T305300.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/776462 (https://phabricator.wikimedia.org/T305300) [02:18:39] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:23:13] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:29:51] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:31:33] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:31:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [02:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [02:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:47] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:34:09] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:36:31] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:40:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:42:53] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:43:21] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:45:39] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:50:07] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:52:25] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:52:29] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:59:19] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:03:49] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:08:27] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:14:55] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:15:11] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:17:33] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:19:53] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:20:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [03:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [03:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:39] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:26:39] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:27:07] PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:29:17] RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 1.018 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [03:33:33] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 3.735 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:35:19] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:37:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:42:31] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:44:25] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:47:05] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:49:19] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:51:11] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:53:29] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:56:11] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:02:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:04:55] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:05:15] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:05:21] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:06:06] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10Aitolkyn) [04:09:35] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:11:43] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:11:51] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:12:01] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:12:09] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:14:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:15:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [04:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [04:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24023 and previous config saved to /var/cache/conftool/dbconfig/20220404-041545-ladsgroup.json [04:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:18:41] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:20:49] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:20:55] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:27:39] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:28:05] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:29:05] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:32:13] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:34:31] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:40:16] (BlazegraphJvmQuakeWarnGC) firing: (6) Blazegraph instance wdqs1004:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [04:43:35] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:45:41] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:45:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:47:03] (03PS2) 10KartikMistry: Enable Content and Section Translation for Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775829 (https://phabricator.wikimedia.org/T296475) [04:50:07] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:54:33] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:54:43] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:56:09] (03CR) 10Marostegui: [C: 03+1] Add add_lu_attachment_method_T305300.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/776462 (https://phabricator.wikimedia.org/T305300) (owner: 10Ladsgroup) [04:56:34] (03CR) 10Marostegui: [C: 03+1] dbtools: Drop unused control-mariadb files [software] - 10https://gerrit.wikimedia.org/r/776235 (owner: 10Ladsgroup) [04:57:05] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:59:01] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:03:37] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:04:22] (03CR) 10Ladsgroup: [C: 03+2] dbtools: Drop unused control-mariadb files [software] - 10https://gerrit.wikimedia.org/r/776235 (owner: 10Ladsgroup) [05:04:31] (03CR) 10Ladsgroup: [C: 03+2] Add add_lu_attachment_method_T305300.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/776462 (https://phabricator.wikimedia.org/T305300) (owner: 10Ladsgroup) [05:05:24] (03Merged) 10jenkins-bot: dbtools: Drop unused control-mariadb files [software] - 10https://gerrit.wikimedia.org/r/776235 (owner: 10Ladsgroup) [05:05:26] (03Merged) 10jenkins-bot: Add add_lu_attachment_method_T305300.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/776462 (https://phabricator.wikimedia.org/T305300) (owner: 10Ladsgroup) [05:08:29] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:10:27] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:11:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1130.eqiad.wmnet with OS bullseye [05:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:41] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:15:19] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:17:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:19:39] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:19:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:20:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24024 and previous config saved to /var/cache/conftool/dbconfig/20220404-052026-ladsgroup.json [05:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:30] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:20:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1130.eqiad.wmnet with reason: host reimage [05:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1130.eqiad.wmnet with reason: host reimage [05:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:10] (03PS1) 10Marostegui: Revert "db1130: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/776335 [05:35:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24025 and previous config saved to /var/cache/conftool/dbconfig/20220404-053531-ladsgroup.json [05:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:35:47] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:36:59] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:39:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1130.eqiad.wmnet with OS bullseye [05:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:15] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:45:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:50:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P24026 and previous config saved to /var/cache/conftool/dbconfig/20220404-055037-ladsgroup.json [05:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:53] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:52:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:52:41] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:54:39] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:54:47] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:58:07] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:58:15] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:59:25] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:02:17] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:03:33] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:05:13] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:05:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24027 and previous config saved to /var/cache/conftool/dbconfig/20220404-060542-ladsgroup.json [06:05:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:05:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [06:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:07] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:08:15] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:08:50] (03PS7) 10Urbanecm: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [06:09:27] 10SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434 (10fgiunchedi) [06:09:33] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:09:56] 10SRE, 10serviceops, 10Developer Productivity, 10Performance-Team (Radar), 10Release-Engineering-Team (Radar): Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10fgiunchedi) [06:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:13:23] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:13:36] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:15:17] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:15:29] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:17:41] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:18:35] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:21:41] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:23:07] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:23:49] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:24:06] (03CR) 10Urbanecm: [C: 03+1] "overall LGTM, but I don't understand why the config is not removed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [06:26:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:27:36] jouncebot: next [06:27:36] In 0 hour(s) and 32 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T0700) [06:27:43] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:28:22] (03CR) 10Urbanecm: "actually, a q inline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [06:29:51] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:34:51] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:37:03] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:38:33] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:44:45] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:45:33] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:47:47] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:50:29] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:54:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:25] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:57:50] 10SRE, 10Developer Productivity, 10Performance-Team (Radar), 10Release-Engineering-Team (Radar): Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Joe) Removing serviceops as this is not actually a production issue and is l... [07:00:05] Amir1, awight, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T0700). Please do the needful. [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] o/ [07:00:28] kart_: do you want to self-service? [07:00:49] taavi: Sure. [07:00:53] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:01:03] cool, just let me know when you're done [07:01:09] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:01:39] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:01:43] taavi: i'm an idiot and put patches on last week [07:01:45] i here [07:01:57] ok, can you move them to the correct window? [07:02:16] taavi: done [07:02:16] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for prometheus-atlas-exporter [puppet] - 10https://gerrit.wikimedia.org/r/775861 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:02:18] (03CR) 10KartikMistry: [C: 03+2] Enable Content and Section Translation for Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775829 (https://phabricator.wikimedia.org/T296475) (owner: 10KartikMistry) [07:02:37] thanks! I'll deploy those after kart_ is done [07:02:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:03:12] (03Merged) 10jenkins-bot: Enable Content and Section Translation for Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775829 (https://phabricator.wikimedia.org/T296475) (owner: 10KartikMistry) [07:03:16] (03CR) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:03:26] (03CR) 10Elukey: [C: 03+1] Use *.k8s-staging.discovery.wmnet for staging certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/776162 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [07:03:39] ty! [07:04:49] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:05:03] (03CR) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:06:50] (03PS8) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [07:06:52] (03PS2) 10Kosta Harlan: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T303240) [07:06:59] (03CR) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [07:07:06] (03CR) 10Elukey: [C: 03+1] Use *.k8s-staging.discovery.wmnet for staging Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/776163 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [07:07:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47967 bytes in 6.459 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:07:16] (03PS1) 10Muehlenhoff: Remove LDAP access for jrobell [puppet] - 10https://gerrit.wikimedia.org/r/776678 [07:07:51] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:08:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for jrobell [puppet] - 10https://gerrit.wikimedia.org/r/776678 (owner: 10Muehlenhoff) [07:08:33] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:775829|Enable Content and Section Translation for Persian Wikipedia (T296475)]] (duration: 00m 51s) [07:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:36] T296475: Enable Content and Section Translation for Persian Wikipedia - https://phabricator.wikimedia.org/T296475 [07:08:52] taavi: I'm done. [07:09:09] thanks! [07:10:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:36] (03PS3) 10Majavah: Revert "fawiki: Set new year celebration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776329 (https://phabricator.wikimedia.org/T304314) (owner: 10RhinosF1) [07:10:37] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:10:41] (03CR) 10Majavah: [C: 03+2] Revert "fawiki: Set new year celebration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776329 (https://phabricator.wikimedia.org/T304314) (owner: 10RhinosF1) [07:10:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:25] (03Merged) 10jenkins-bot: Revert "fawiki: Set new year celebration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776329 (https://phabricator.wikimedia.org/T304314) (owner: 10RhinosF1) [07:11:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:04] RhinosF1: can you test the first one on mwdebug1001 please? [07:12:24] taavi: lgtm [07:12:34] throttle is noop so can't be tested [07:12:52] fawiki old vector has gone [07:13:02] back to old [07:13:05] ok, syncing [07:13:32] (03PS3) 10Majavah: Revert "fawiki: Set celebration logo for new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776330 (https://phabricator.wikimedia.org/T304314) (owner: 10RhinosF1) [07:13:51] !log taavi@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:776329|Revert "fawiki: Set new year celebration" (T304314)]] (duration: 00m 51s) [07:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:53] T304314: Requesting temporary logo change for fa.wikipedia.org - https://phabricator.wikimedia.org/T304314 [07:14:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:14:47] !log taavi@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:776329|Revert "fawiki: Set new year celebration" (T304314)]] (duration: 00m 50s) [07:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:20] (03CR) 10Majavah: [C: 03+2] Revert "fawiki: Set celebration logo for new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776330 (https://phabricator.wikimedia.org/T304314) (owner: 10RhinosF1) [07:15:41] !log taavi@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:776329|Revert "fawiki: Set new year celebration" (T304314)]] (duration: 00m 50s) [07:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:04] (03Merged) 10jenkins-bot: Revert "fawiki: Set celebration logo for new vector" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776330 (https://phabricator.wikimedia.org/T304314) (owner: 10RhinosF1) [07:16:32] (03PS9) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) [07:16:34] (03PS3) 10Kosta Harlan: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T303240) [07:16:35] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:16:48] RhinosF1: second one available for testing too [07:16:53] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:17:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:05] taavi: lgtm [07:18:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:04] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:776330|Revert "fawiki: Set celebration logo for new vector" (T304314)]] (duration: 00m 50s) [07:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:33] (03PS1) 10JMeybohm: Include latest ingress helper update into miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/776720 [07:18:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:14] !log taavi@deploy1002 Synchronized static/images/mobile/copyright/: Config: [[gerrit:776330|Revert "fawiki: Set celebration logo for new vector" (T304314)]] (duration: 00m 49s) [07:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:16] (03PS3) 10Majavah: throttle: removed expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776332 (https://phabricator.wikimedia.org/T304836) (owner: 10RhinosF1) [07:19:16] T304314: Requesting temporary logo change for fa.wikipedia.org - https://phabricator.wikimedia.org/T304314 [07:19:31] (03CR) 10Majavah: [C: 03+2] throttle: removed expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776332 (https://phabricator.wikimedia.org/T304836) (owner: 10RhinosF1) [07:19:37] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:19:52] (03CR) 10JMeybohm: [C: 03+2] Use *.k8s-staging.discovery.wmnet for staging certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/776162 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [07:20:03] (03CR) 10JMeybohm: [C: 03+2] Use *.k8s-staging.discovery.wmnet for staging Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/776163 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [07:20:12] (03Merged) 10jenkins-bot: throttle: removed expired rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776332 (https://phabricator.wikimedia.org/T304836) (owner: 10RhinosF1) [07:21:03] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:21:29] !log taavi@deploy1002 Synchronized wmf-config/throttle.php: Config: [[gerrit:776332|throttle: removed expired rule (T304836)]] (duration: 00m 49s) [07:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:31] T304836: IP throttle lift request for Czech Wikigap 2022 in Brno - https://phabricator.wikimedia.org/T304836 [07:21:46] ok, that should be all unless someone has a last-minute patch [07:22:05] thanks taavi [07:22:45] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:40] !log UTC morning deployments done [07:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:16] (03Merged) 10jenkins-bot: Use *.k8s-staging.discovery.wmnet for staging certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/776162 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [07:24:18] (03Merged) 10jenkins-bot: Use *.k8s-staging.discovery.wmnet for staging Ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/776163 (https://phabricator.wikimedia.org/T300740) (owner: 10JMeybohm) [07:26:17] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:28:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:14] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10JMeybohm) [07:30:19] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:30:47] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:32:01] (03CR) 10JMeybohm: [C: 03+2] Include latest ingress helper update into miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/776720 (owner: 10JMeybohm) [07:32:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:34:49] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:35:17] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:35:57] (03Merged) 10jenkins-bot: Include latest ingress helper update into miscweb [deployment-charts] - 10https://gerrit.wikimedia.org/r/776720 (owner: 10JMeybohm) [07:38:21] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [07:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:03] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:39:04] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [07:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:11] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [07:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:41] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [07:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:47] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:40:41] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:41:17] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:41:33] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:41:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [07:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [07:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [07:42:01] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:46] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [07:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:02] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [07:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:15] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [07:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:26] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [07:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:46] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [07:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:21] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [07:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:05] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:46:18] (03PS2) 10Hashar: ci: docker system prune on ci::master [puppet] - 10https://gerrit.wikimedia.org/r/773784 [07:46:22] (03CR) 10Hashar: "Typo fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/773784 (owner: 10Hashar) [07:47:13] (03CR) 10Hashar: "That follows "docker: move pruning to new profile docker::prune" https://gerrit.wikimedia.org/r/c/operations/puppet/+/773641/" [puppet] - 10https://gerrit.wikimedia.org/r/773784 (owner: 10Hashar) [07:49:01] (03PS3) 10Giuseppe Lavagetto: cache::base: add check to netpmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) [07:49:08] (03PS1) 10Volans: interactive: catch Ctrl+c / Ctrl+d on ask_input() [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776852 [07:49:10] (03PS1) 10Volans: prometheus: add support for other instances [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 [07:49:12] (03PS1) 10Volans: prometheus: add support for Thanos [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776854 [07:49:38] (03CR) 10Giuseppe Lavagetto: cache::base: add check to netpmapper modification (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [07:49:43] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:54:23] !log imported scap 4.6.0 to stretch-/buster-/bullseye-wikimedia - T305250 [07:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:25] T305250: Deploy Scap version 4.6.0 - https://phabricator.wikimedia.org/T305250 [07:57:05] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:57:17] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:59:31] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:00:59] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:01:33] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:01:46] !log jayme@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [08:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:00] !log jayme@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 14s) [08:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:26] (03PS1) 10Giuseppe Lavagetto: Add log command [software/conftool] - 10https://gerrit.wikimedia.org/r/776855 [08:04:20] (03CR) 10jerkins-bot: [V: 04-1] Add log command [software/conftool] - 10https://gerrit.wikimedia.org/r/776855 (owner: 10Giuseppe Lavagetto) [08:05:25] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:12:51] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:14:29] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:15:18] I downtimed the singtel alerts for 2 more days [08:15:36] they say the circuit is fixed, icinga disagree [08:16:12] (03PS1) 10Elukey: Add a namespace selector to helmfile_istio-proxy's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/776856 (https://phabricator.wikimedia.org/T297612) [08:18:15] (03PS1) 10DCausse: wdqs: tune jvmquake settings (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/776857 (https://phabricator.wikimedia.org/T293862) [08:18:48] (03CR) 10jerkins-bot: [V: 04-1] wdqs: tune jvmquake settings (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/776857 (https://phabricator.wikimedia.org/T293862) (owner: 10DCausse) [08:19:06] !log depool cp5003 for reimage - T290005 [08:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:10] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:19:31] jouncebot: nowandnext [08:19:32] No deployments scheduled for the next 4 hour(s) and 40 minute(s) [08:19:32] In 4 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T1300) [08:19:35] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:19:44] (03PS1) 10Urbanecm: Revert "cswiki: Add celebration logo for 500k" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776858 [08:19:54] (03CR) 10Urbanecm: [C: 03+2] Revert "cswiki: Add celebration logo for 500k" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776858 (owner: 10Urbanecm) [08:20:26] (03CR) 10jerkins-bot: [V: 04-1] Add a namespace selector to helmfile_istio-proxy's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/776856 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:20:28] (03PS2) 10DCausse: wdqs: tune jvmquake settings (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/776857 (https://phabricator.wikimedia.org/T293862) [08:20:35] (03Merged) 10jenkins-bot: Revert "cswiki: Add celebration logo for 500k" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776858 (owner: 10Urbanecm) [08:21:05] (03PS2) 10MMandere: site: Reimage cp5003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775327 (https://phabricator.wikimedia.org/T290005) [08:23:23] (03CR) 10Ayounsi: [C: 03+1] "Can't review the unit test but lgtm otherwise." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776852 (owner: 10Volans) [08:23:32] (03PS2) 10Elukey: Add a namespace selector to helmfile_istio-proxy's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/776856 (https://phabricator.wikimedia.org/T297612) [08:23:55] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: 158e0ce: Revert "cswiki: Add celebration logo for 500k" (1/3) (duration: 00m 51s) [08:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:59] (03CR) 10MMandere: [C: 03+2] site: Reimage cp5003 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775327 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [08:24:17] (03CR) 10Vgutierrez: [C: 03+1] cache::base: add check to netpmapper modification [puppet] - 10https://gerrit.wikimedia.org/r/773451 (https://phabricator.wikimedia.org/T302471) (owner: 10Giuseppe Lavagetto) [08:24:32] (03CR) 10Vgutierrez: [C: 03+1] "typos aside, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/773454 (owner: 10Giuseppe Lavagetto) [08:24:45] (03CR) 10Vgutierrez: [C: 03+1] varnish::frontend: remove normalization for parameter [puppet] - 10https://gerrit.wikimedia.org/r/773455 (owner: 10Giuseppe Lavagetto) [08:24:46] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 158e0ce: Revert "cswiki: Add celebration logo for 500k" (2/3) (duration: 00m 50s) [08:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:36] !log urbanecm@deploy1002 Synchronized logos/config.yaml: 158e0ce: Revert "cswiki: Add celebration logo for 500k" (3/3) (duration: 00m 50s) [08:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:50] * urbanecm done [08:27:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:09] (03CR) 10Elukey: [C: 03+2] Add a namespace selector to helmfile_istio-proxy's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/776856 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [08:28:30] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [08:28:42] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5003.eqsin.wmnet with OS buster [08:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24029 and previous config saved to /var/cache/conftool/dbconfig/20220404-083031-ladsgroup.json [08:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:34] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:31:18] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5003.eqsin.wmnet with OS buster [08:31:30] (03PS2) 10MMandere: site: Reimage cp6008 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775328 (https://phabricator.wikimedia.org/T290005) [08:31:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:45] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:09] (03PS1) 10Muehlenhoff: Add library hint for zlib [puppet] - 10https://gerrit.wikimedia.org/r/776860 [08:34:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:28] (03PS2) 10Jakob: Use wgRestAPIAdditionalRouteFiles for WB REST API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 [08:35:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:04] (03CR) 10Ayounsi: [C: 03+1] prometheus: add support for other instances (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 (owner: 10Volans) [08:36:18] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for zlib [puppet] - 10https://gerrit.wikimedia.org/r/776860 (owner: 10Muehlenhoff) [08:37:29] !log depool cp6008 for reimage - T290005 [08:37:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:31] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [08:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:34] !log installing flac security updates [08:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:41] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6008 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/775328 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [08:41:49] (03CR) 10Volans: "replied to comment" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 (owner: 10Volans) [08:42:05] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add ProbeDown paging alert for enabled services [alerts] - 10https://gerrit.wikimedia.org/r/773747 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:42:10] (03PS3) 10Filippo Giunchedi: sre: add ProbeDown paging alert for enabled services [alerts] - 10https://gerrit.wikimedia.org/r/773747 (https://phabricator.wikimedia.org/T291946) [08:42:14] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS buster [08:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6008.drmrs.wmnet with OS buster [08:43:14] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:37] (03CR) 10Volans: [C: 03+2] interactive: catch Ctrl+c / Ctrl+d on ask_input() [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776852 (owner: 10Volans) [08:45:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P24030 and previous config saved to /var/cache/conftool/dbconfig/20220404-084523-root.json [08:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:31] (03CR) 10Marostegui: [C: 03+2] Revert "db1130: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/776335 (owner: 10Marostegui) [08:47:05] (03Merged) 10jenkins-bot: interactive: catch Ctrl+c / Ctrl+d on ask_input() [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776852 (owner: 10Volans) [08:55:41] (03CR) 10David Caro: [C: 03+2] wmcs-backups: exclude integration-castor04, that vm has no disk image [puppet] - 10https://gerrit.wikimedia.org/r/774854 (https://phabricator.wikimedia.org/T304916) (owner: 10David Caro) [08:55:45] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5003.eqsin.wmnet with reason: host reimage [08:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 (owner: 10Volans) [08:56:22] !log installing glibc updates from buster 10.12 point release [08:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:19] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Start backing up orchestrator & rename section db_inventory [puppet] - 10https://gerrit.wikimedia.org/r/776169 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [08:58:23] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776854 (owner: 10Volans) [08:59:16] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5003.eqsin.wmnet with reason: host reimage [08:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:51] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [08:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:35] (03CR) 10Jcrespo: "I will create a guide at https://wikitech.wikimedia.org/wiki/SRE/Data_Persistence/Backups/User_guides to document the procedure, as this i" [puppet] - 10https://gerrit.wikimedia.org/r/776169 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [09:01:37] (03PS2) 10Volans: prometheus: add support for other instances [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 [09:01:39] (03PS2) 10Volans: prometheus: add support for Thanos [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776854 [09:01:56] (03CR) 10Volans: "addressed comment" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 (owner: 10Volans) [09:03:20] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [09:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:56] 7 [09:11:17] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add support for other instances [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 (owner: 10Volans) [09:11:28] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add support for Thanos [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776854 (owner: 10Volans) [09:11:46] (03CR) 10Volans: [C: 03+2] prometheus: add support for other instances [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 (owner: 10Volans) [09:12:04] !log installing openssl updates from Buster 10.12 point release [09:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:22] (JobUnavailable) firing: (5) Reduced availability for job cache_haproxy_tls in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:15:26] (03Merged) 10jenkins-bot: prometheus: add support for other instances [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776853 (owner: 10Volans) [09:16:00] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [09:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:54] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-uk.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:41] (03CR) 10Volans: [C: 03+2] prometheus: add support for Thanos [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776854 (owner: 10Volans) [09:17:50] jelto: re: jobunavailable above, I see gitlab and gitlab-runner failing, known/expected ? [09:20:18] (03Merged) 10jenkins-bot: prometheus: add support for Thanos [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776854 (owner: 10Volans) [09:20:22] (JobUnavailable) firing: (5) Reduced availability for job cache_haproxy_tls in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:23:08] (03PS1) 10Ayounsi: Network report: warning only for "no-mon" interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776863 [09:23:19] godog: yes "expected". It's only about the two GitLab Runners which are not used publicly. Fix is in review https://gerrit.wikimedia.org/r/c/operations/puppet/+/775821 [09:24:02] (03CR) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [09:24:17] (03CR) 10jerkins-bot: [V: 04-1] Network report: warning only for "no-mon" interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776863 (owner: 10Ayounsi) [09:24:27] (03PS2) 10Ayounsi: Network report: warning only for "no-mon" interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776863 [09:25:33] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5003.eqsin.wmnet with OS buster [09:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:41] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5003.eqsin.wmnet with OS buster com... [09:26:27] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [09:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:47] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6008.drmrs.wmnet with OS buster [09:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:56] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6008.drmrs.wmnet with OS buster com... [09:26:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet [09:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:27] (03PS3) 10Ayounsi: Network report: warning only for "no-mon" interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776863 [09:28:10] (03PS4) 10Ayounsi: Network report: warning only for "no-mon" interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776863 [09:29:06] jelto: ack, thanks [09:29:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet [09:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:56] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776863 (owner: 10Ayounsi) [09:30:11] (03CR) 10Ayounsi: [C: 03+2] Network report: warning only for "no-mon" interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776863 (owner: 10Ayounsi) [09:30:49] (03Merged) 10jenkins-bot: Network report: warning only for "no-mon" interfaces [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776863 (owner: 10Ayounsi) [09:31:43] !log rolling restart of FPM/Apache on mw canaries to pick up updated zlib/glibc/openssl/libxml [09:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:27] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Peter) Hi @vgutierrez the performance team continuously runs synthetic tests where we test the performance of a couple of Wikipedia p... [09:35:00] (03PS1) 10MMandere: site: Reimage cp3054 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776865 (https://phabricator.wikimedia.org/T290005) [09:35:02] (03PS1) 10MMandere: site: Reimage cp4028 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776866 (https://phabricator.wikimedia.org/T290005) [09:35:04] (03PS1) 10MMandere: site: Reimage cp3055 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776867 (https://phabricator.wikimedia.org/T290005) [09:35:06] (03PS1) 10MMandere: site: Reimage cp4022 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776868 (https://phabricator.wikimedia.org/T290005) [09:35:08] (03PS1) 10MMandere: site: Reimage cp5008 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776869 (https://phabricator.wikimedia.org/T290005) [09:35:10] (03PS1) 10MMandere: site: Reimage cp6015 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776870 (https://phabricator.wikimedia.org/T290005) [09:35:12] (03PS1) 10MMandere: site: Reimage cp5002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776871 (https://phabricator.wikimedia.org/T290005) [09:35:15] (03PS1) 10MMandere: site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776872 (https://phabricator.wikimedia.org/T290005) [09:40:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24031 and previous config saved to /var/cache/conftool/dbconfig/20220404-094053-ladsgroup.json [09:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:57] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) >>! In T290005#7828034, @Peter wrote: > Hi @vgutierrez the performance team continuously runs synthetic tests where we te... [09:40:57] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:43:21] 10SRE, 10Analytics-Radar, 10observability: Set up cross DC topic mirroring for Kafka logging clusters - https://phabricator.wikimedia.org/T276972 (10fgiunchedi) >>! In T276972#7824672, @Ottomata wrote: > In https://phabricator.wikimedia.org/T304373#7823916 @fgiunchedi wrote >> to clarify my position on T2769... [09:44:51] !log pool cp5003 with HAProxy as TLS termination layer - T290005 [09:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:54] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:45:06] (03CR) 10Silvan Heintze: [C: 03+1] Use wgRestAPIAdditionalRouteFiles for WB REST API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [09:47:12] !log installing zlib security updates [09:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:24] (03CR) 10Ollie Shotton: [C: 03+1] Use wgRestAPIAdditionalRouteFiles for WB REST API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774901 (owner: 10Jakob) [09:48:28] (03PS2) 10Giuseppe Lavagetto: Add log command [software/conftool] - 10https://gerrit.wikimedia.org/r/776855 [09:48:32] (03PS1) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/776875 [09:48:56] !log jelto@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM gitlab1001.wikimedia.org [09:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:12] (03PS1) 10Elukey: Change the Calico's pod IP subnet for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/776876 (https://phabricator.wikimedia.org/T304673) [09:49:14] (03PS1) 10Elukey: Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673) [09:49:37] (03CR) 10Btullis: [C: 03+2] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [09:50:11] (03CR) 10jerkins-bot: [V: 04-1] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/776875 (owner: 10Giuseppe Lavagetto) [09:50:44] (03CR) 10Btullis: [C: 03+2] Add helm charts and a helmfile configuration for datahub (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [09:50:48] (03CR) 10jerkins-bot: [V: 04-1] Add log command [software/conftool] - 10https://gerrit.wikimedia.org/r/776855 (owner: 10Giuseppe Lavagetto) [09:50:52] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [09:51:20] !log pool cp6008 with HAProxy as TLS termination layer - T290005 [09:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:24] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:52:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-druid1005.eqiad.wmnet [09:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:44] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:54:00] (03PS1) 10Alexandros Kosiaris: Split watchrat URLs by need of proxy usage [puppet] - 10https://gerrit.wikimedia.org/r/776878 (https://phabricator.wikimedia.org/T303803) [09:54:06] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [09:54:57] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:22] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:46] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM gitlab1001.wikimedia.org [09:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24032 and previous config saved to /var/cache/conftool/dbconfig/20220404-095558-ladsgroup.json [09:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:01] (03PS1) 10Elukey: role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673) [09:57:03] (03PS1) 10Elukey: role::ml_k8s::master: change the codfw svc IP range [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) [09:58:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-druid1005.eqiad.wmnet [09:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:22] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:56] ^gitlab alerts expected due to maintenance [10:02:15] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776883 [10:02:26] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp3054 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776865 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:03:19] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4028 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776866 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:03:50] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp3055 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776867 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:04:27] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp4022 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776868 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:05:45] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5008 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776869 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:06:10] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6015 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776870 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:06:39] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp5002 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776871 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:07:08] (03CR) 10Vgutierrez: [C: 03+1] site: Reimage cp6007 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776872 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:07:21] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.2.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776883 (owner: 10Volans) [10:08:44] !log installing icu bugfix updates from buster 10.12 point release [10:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:55] (03CR) 10Klausman: [C: 03+1] Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:09:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-druid1004.eqiad.wmnet [10:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:22] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.0 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/776883 (owner: 10Volans) [10:10:30] (03PS1) 10Ayounsi: PuppetDB report: more explicit error messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776888 [10:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:10:45] (03CR) 10Klausman: [C: 03+1] Change the Calico's pod IP subnet for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/776876 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:11:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P24033 and previous config saved to /var/cache/conftool/dbconfig/20220404-101104-ladsgroup.json [10:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:08] (03CR) 10Klausman: [C: 03+1] role::ml_k8s::master: change the svc eqiad IP subnet [puppet] - 10https://gerrit.wikimedia.org/r/776879 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:11:27] (03CR) 10Klausman: [C: 03+1] role::ml_k8s::master: change the codfw svc IP range [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [10:12:06] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776888 (owner: 10Ayounsi) [10:12:34] (03CR) 10Ayounsi: [C: 03+2] PuppetDB report: more explicit error messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776888 (owner: 10Ayounsi) [10:12:38] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [10:13:04] PROBLEM - Host gitlab1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:13:46] (03Merged) 10jenkins-bot: PuppetDB report: more explicit error messages [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776888 (owner: 10Ayounsi) [10:14:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-druid1004.eqiad.wmnet [10:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:47] (03PS1) 10Vgutierrez: traffic: Add HAProxyEdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/776890 (https://phabricator.wikimedia.org/T290005) [10:14:50] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:16] RECOVERY - Host gitlab1001 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [10:15:22] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:17:25] (03PS1) 10Ayounsi: Fix typo [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776891 [10:19:13] (03CR) 10Kosta Harlan: GrowthExperiments: Add mailing list question for eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/773204 (https://phabricator.wikimedia.org/T303240) (owner: 10Kosta Harlan) [10:19:39] (03PS1) 10Marostegui: switchover-tmpl.sh: Add prerequisites link and calendar invite [software] - 10https://gerrit.wikimedia.org/r/776892 (https://phabricator.wikimedia.org/T303605) [10:20:36] PROBLEM - SSH on gitlab1001 is CRITICAL: connect to address 208.80.154.6 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:21:11] (03PS1) 10JMeybohm: Move datahub secrets into the right subchart YAML structure [labs/private] - 10https://gerrit.wikimedia.org/r/776893 [10:21:18] (03PS1) 10Volans: Upstream release v1.2.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/776895 [10:21:34] (03PS1) 10Btullis: Remove the egress rules from datahub-fronted to mysql [deployment-charts] - 10https://gerrit.wikimedia.org/r/776896 (https://phabricator.wikimedia.org/T301454) [10:21:45] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Move datahub secrets into the right subchart YAML structure [labs/private] - 10https://gerrit.wikimedia.org/r/776893 (owner: 10JMeybohm) [10:23:54] (03CR) 10MMandere: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/776890 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:24:16] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [10:25:23] (03CR) 10Volans: [C: 03+2] Upstream release v1.2.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/776895 (owner: 10Volans) [10:26:04] !log installing libxml2 security updates [10:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T298565)', diff saved to https://phabricator.wikimedia.org/P24034 and previous config saved to /var/cache/conftool/dbconfig/20220404-102609-ladsgroup.json [10:26:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [10:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [10:26:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24035 and previous config saved to /var/cache/conftool/dbconfig/20220404-102616-ladsgroup.json [10:26:17] (03PS2) 10Btullis: Remove the references to mysql_password from datahub-frontend [deployment-charts] - 10https://gerrit.wikimedia.org/r/776896 (https://phabricator.wikimedia.org/T301454) [10:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:28] (03PS3) 10Btullis: Remove the MySQL specific details from datahub-frontend [deployment-charts] - 10https://gerrit.wikimedia.org/r/776896 (https://phabricator.wikimedia.org/T301454) [10:28:11] (03Merged) 10jenkins-bot: Upstream release v1.2.0 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/776895 (owner: 10Volans) [10:29:13] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [10:30:22] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:31:54] (03CR) 10Btullis: [C: 03+2] Remove the MySQL specific details from datahub-frontend [deployment-charts] - 10https://gerrit.wikimedia.org/r/776896 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [10:32:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-druid1003.eqiad.wmnet [10:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:21] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776891 (owner: 10Ayounsi) [10:38:12] !log uploaded python3-wmflib_1.2.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [10:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-druid1003.eqiad.wmnet [10:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] vrts: rename module files and classes [puppet] - 10https://gerrit.wikimedia.org/r/776237 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [10:40:43] (03CR) 10Ayounsi: [C: 03+2] Fix typo [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776891 (owner: 10Ayounsi) [10:42:19] RECOVERY - SSH on gitlab1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:42:42] (03Merged) 10jenkins-bot: Fix typo [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/776891 (owner: 10Ayounsi) [10:51:26] (03PS1) 10Amire80: Update feed URL for wikimedia.no blog [puppet] - 10https://gerrit.wikimedia.org/r/776905 [10:52:01] (03PS1) 10Btullis: Increment the chart version and allow version range matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/776906 (https://phabricator.wikimedia.org/T301454) [10:53:09] (03CR) 10Jon Harald Søby: [C: 03+1] Update feed URL for wikimedia.no blog [puppet] - 10https://gerrit.wikimedia.org/r/776905 (owner: 10Amire80) [10:53:48] (03CR) 10Ayounsi: [C: 03+1] "Niiiiiiice!" [puppet] - 10https://gerrit.wikimedia.org/r/776878 (https://phabricator.wikimedia.org/T303803) (owner: 10Alexandros Kosiaris) [10:53:49] !log depool cp3054 for reimage - T290005 [10:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:52] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:56:34] (03CR) 10MMandere: [C: 03+2] site: Reimage cp3054 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776865 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [10:58:19] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:59:46] (03CR) 10JMeybohm: [C: 03+1] Increment the chart version and allow version range matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/776906 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:00:29] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:04:58] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3054.esams.wmnet with OS buster [11:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:08] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3054.esams.wmnet with OS buster [11:05:20] jouncebot: next [11:05:20] In 1 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T1300) [11:07:12] (03CR) 10Btullis: [C: 03+2] Increment the chart version and allow version range matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/776906 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [11:07:19] !log installing cups security updates on buster (client side tools/libs) [11:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:01] (03PS1) 10Btullis: Remove the statsv source from the VarnishkafkaNoMessages alert [alerts] - 10https://gerrit.wikimedia.org/r/776912 (https://phabricator.wikimedia.org/T300246) [11:09:15] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [11:09:16] !log jforrester@deploy1002 Started deploy [integration/docroot@63b762d]: Id56cd5bf64ed Adding WikiLambda doc block [11:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:24] !log jforrester@deploy1002 Finished deploy [integration/docroot@63b762d]: Id56cd5bf64ed Adding WikiLambda doc block (duration: 00m 08s) [11:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:43] !log deploying python3-wmflib 1.2.0 fleet-wide [11:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:50] !log depool cp4028 for reimage - T290005 [11:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:52] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:15:25] (03PS1) 10Btullis: Remove an-test-coord* from the Hive JVM heap memory alerts [alerts] - 10https://gerrit.wikimedia.org/r/776919 (https://phabricator.wikimedia.org/T293399) [11:15:27] (03CR) 10MMandere: [C: 03+2] site: Reimage cp4028 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776866 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [11:18:15] (03PS1) 10Muehlenhoff: Add mdadm processes to filter list for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/776929 (https://phabricator.wikimedia.org/T135991) [11:18:24] (03PS2) 10Btullis: Remove test hosts from the JVM heap memory alerts [alerts] - 10https://gerrit.wikimedia.org/r/776919 (https://phabricator.wikimedia.org/T293399) [11:18:42] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:49] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4028.ulsfo.wmnet with OS buster [11:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:58] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4028.ulsfo.wmnet with OS buster [11:23:27] (03PS3) 10Giuseppe Lavagetto: Add log command [software/conftool] - 10https://gerrit.wikimedia.org/r/776855 [11:23:29] (03PS2) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/776875 [11:25:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24036 and previous config saved to /var/cache/conftool/dbconfig/20220404-112506-ladsgroup.json [11:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:25:37] (03CR) 10jerkins-bot: [V: 04-1] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/776875 (owner: 10Giuseppe Lavagetto) [11:25:46] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [11:25:53] (03CR) 10jerkins-bot: [V: 04-1] Add log command [software/conftool] - 10https://gerrit.wikimedia.org/r/776855 (owner: 10Giuseppe Lavagetto) [11:27:39] !log installing jbig2dec security updates [11:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:21] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:33:39] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3054.esams.wmnet with reason: host reimage [11:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:59] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [11:34:18] !log installing zziplib security updates [11:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:09] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3054.esams.wmnet with reason: host reimage [11:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:34] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4028.ulsfo.wmnet with reason: host reimage [11:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:35] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [11:39:44] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [11:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24037 and previous config saved to /var/cache/conftool/dbconfig/20220404-114011-ladsgroup.json [11:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:59] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4028.ulsfo.wmnet with reason: host reimage [11:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:32] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:28] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [11:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:47] (03PS2) 10Jcrespo: check: Read list of valid sections/valid backup jobs from a file [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) [11:55:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P24038 and previous config saved to /var/cache/conftool/dbconfig/20220404-115516-ladsgroup.json [11:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:40] (03PS3) 10Jcrespo: check: Read list of valid sections/valid backup jobs from a file [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) [12:01:11] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3054.esams.wmnet with OS buster [12:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:14] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:01:19] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3054.esams.wmnet with OS buster com... [12:01:34] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:56] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4028.ulsfo.wmnet with OS buster [12:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:05] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4028.ulsfo.wmnet with OS buster com... [12:05:14] !log pool cp3054 with HAProxy as TLS termination layer - T290005 [12:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:17] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:10:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T298565)', diff saved to https://phabricator.wikimedia.org/P24039 and previous config saved to /var/cache/conftool/dbconfig/20220404-121022-ladsgroup.json [12:10:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:10:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24040 and previous config saved to /var/cache/conftool/dbconfig/20220404-121030-ladsgroup.json [12:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:09] !log pool cp4028 with HAProxy as TLS termination layer - T290005 [12:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:12] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:18:21] !log installing expat updates (followups to earlier security fixes, no security impact by itself) [12:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:23] (03PS3) 10JMeybohm: Move miscweb back to state production [puppet] - 10https://gerrit.wikimedia.org/r/774917 (https://phabricator.wikimedia.org/T290966) [12:26:47] !log deleting empty typo topics from kafka main-codfw: codfw.mediawiki.page_delete, codfw.mediawiki.page_move, codfw.mediawiki.page_restore, codfw.mediawiki.revision_create (found while working on T241178) [12:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:23] !log deleting empty typo topics from kafka main-eqiad: eqiad.mediawiki.page-edit (found while working on T241178) [12:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:35] !log depool cp3055 for reimage - T290005 [12:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:37] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:34:02] (03CR) 10MMandere: [C: 03+2] site: Reimage cp3055 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776867 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [12:35:49] !log removing retention.ms override from eventstreams publicly exposed topics in kafka main-eqiad and main-codfw - T241178 [12:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:27] (03Abandoned) 10Hashar: scap: automatize plugins handling [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/723992 (owner: 10Hashar) [12:36:30] (03Abandoned) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/709975 (owner: 10Hashar) [12:36:38] (03Abandoned) 10Hashar: mwdeploy user is provided by LDAP on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/699427 (https://phabricator.wikimedia.org/T73480) (owner: 10Hashar) [12:38:43] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp3055.esams.wmnet with OS buster [12:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:52] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp3055.esams.wmnet with OS buster [12:42:29] !log depool cp4022 for reimage - T290005 [12:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:32] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [12:43:35] !log installing gmp security updates [12:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:21] jouncebot: next [12:45:21] In 0 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T1300) [12:45:40] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [12:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:23] (03CR) 10MMandere: [C: 03+2] site: Reimage cp4022 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776868 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [12:48:53] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp4022.ulsfo.wmnet with OS buster [12:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:03] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4022.ulsfo.wmnet with OS buster [12:49:11] (03CR) 10Jelto: [V: 03+1] "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:49:31] (03PS3) 10Func: Add logo variants for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775416 (https://phabricator.wikimedia.org/T273578) [12:49:58] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Add prerequisites link and calendar invite [software] - 10https://gerrit.wikimedia.org/r/776892 (https://phabricator.wikimedia.org/T303605) (owner: 10Marostegui) [12:50:35] (03Abandoned) 10Hashar: gerrit: move CI result table to a tab [puppet] - 10https://gerrit.wikimedia.org/r/756685 (owner: 10Hashar) [12:52:57] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye [12:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2001.codfw.wmnet [12:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:47] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1003.eqiad.wmnet with OS bullseye [12:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:12] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [12:57:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2001.codfw.wmnet [12:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:01] (03CR) 10Hokwelum: [C: 03+1] "Ariel and I tested on the deployment-prep and it looks good" [dumps] - 10https://gerrit.wikimedia.org/r/767477 (https://phabricator.wikimedia.org/T138208) (owner: 10Ladsgroup) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T1300). [13:00:04] duesen, Lucas_WMDE, and Func: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] * urbanecm waves [13:00:52] duesen: do you want to self-deploy? [13:01:00] o/ [13:01:23] hey taavi [13:01:58] Func: hello, are you around? [13:02:03] yes [13:02:30] o/ [13:02:34] Func: okay, let's start with you then :). Can your patch be tested? [13:02:52] Nothing to test, this just wants to make sure we can see the outcome of change 773936 after the next train. [13:03:00] okay [13:03:05] (03CR) 10Urbanecm: [C: 03+2] Add logo variants for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775416 (https://phabricator.wikimedia.org/T273578) (owner: 10Func) [13:03:39] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [13:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:42] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye [13:03:44] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1003.eqiad.wmnet with OS bullseye [13:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:51] (03Merged) 10jenkins-bot: Add logo variants for zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775416 (https://phabricator.wikimedia.org/T273578) (owner: 10Func) [13:04:03] urbanecm: i'l like to give it a go, yes. I have only done a config deploy once before though [13:04:41] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4022.ulsfo.wmnet with reason: host reimage [13:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:08] duesen: i can guide you :) [13:05:25] i need to finish Func's patch first though [13:05:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7ebad8ffa1826ed3429cd822d388807270cfe341: Add logo variants for zhwiki (T273578) (duration: 00m 51s) [13:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:31] T273578: Wikis with language variants must override logo in Common.css - https://phabricator.wikimedia.org/T273578 [13:05:31] urbanecm: cool! But Func is going first, right? [13:05:44] duesen: correct. i didn't see a re from you, so went with their patch instead :) [13:05:44] (the bot chatter in here is really distracting) [13:05:46] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp3055.esams.wmnet with reason: host reimage [13:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:56] (and I don't see Lucas, so I'll be skipping their patch) [13:06:01] Func: your patch is live now. [13:06:16] ok, thanks! [13:06:29] duesen: it also provides useful info though :)) [13:06:37] duesen: feel free to deploy your patch now [13:06:46] (03CR) 10Elukey: [C: 03+1] "Built all the images locally and verified that the patch was applied correctly." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/775277 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [13:06:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:59] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers is the docs [13:07:45] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4022.ulsfo.wmnet with reason: host reimage [13:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24041 and previous config saved to /var/cache/conftool/dbconfig/20220404-130751-ladsgroup.json [13:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:08:30] urbanecm: sorry, irccloud doesn't do proper notifications [13:08:35] i'll get started on my patch now [13:08:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [13:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:05] duesen: okay. let me know if i can help. [13:09:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:08] (03CR) 10Daniel Kinzler: [C: 03+2] "deploying now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776164 (https://phabricator.wikimedia.org/T305176) (owner: 10Daniel Kinzler) [13:10:40] urbanecm: I keep forgetting to allocate time for the patch to actually merge. luckily, for config, that should be quick [13:10:47] yup [13:10:49] (03PS1) 10Volans: spicerack: add wmflib.prometheus.Thanos support [software/spicerack] - 10https://gerrit.wikimedia.org/r/776946 [13:10:51] (03Merged) 10jenkins-bot: Always set MW_USE_CONFIG_SCHEMA. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776164 (https://phabricator.wikimedia.org/T305176) (owner: 10Daniel Kinzler) [13:10:56] here you go :) [13:11:07] excellent [13:11:08] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp3055.esams.wmnet with reason: host reimage [13:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:36] urbanecm: git log is clean [13:11:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:06] duesen: is that a question or just a mere statement? [13:12:33] urbanecm: just a statement :) [13:12:39] ok, wasn't sure:) [13:12:40] urbanecm: pulling to mwdebug1001 now [13:12:45] ack [13:13:55] urbanecm: i'm now waiting for the stats to show in grafana. Last time, it took a couple of minutes [13:14:03] ack [13:15:30] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34673/console" [puppet] - 10https://gerrit.wikimedia.org/r/776878 (https://phabricator.wikimedia.org/T303803) (owner: 10Alexandros Kosiaris) [13:15:36] there we go, looking good [13:15:51] great [13:16:14] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:38] urbanecm: so... I just scap sync-file? [13:16:41] yes [13:16:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:59] Amir1: thank you so much for writing deployment-commands! [13:17:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [13:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:58] duesen: ^^ as I said, I'm lazy [13:18:05] !log daniel@deploy1002 Synchronized multiversion/defines.php: Config: [[gerrit:776164|Always set MW_USE_CONFIG_SCHEMA. (T305176)]] (duration: 00m 50s) [13:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:08] T305176: Make loading defaults from the config schema the default - https://phabricator.wikimedia.org/T305176 [13:18:30] urbanecm: so, did I break wikipedia? [13:18:41] it still loads for me :) [13:19:21] and if you did, canaries would've told you, probably :) [13:19:30] duesen: anything else to deploy from you? [13:19:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:35] urbanecm: nope, all good, thank you! I just confirmed the stats, all requests sseem to be using the new code now [13:20:40] excellent [13:20:49] in that case, we're done, as i still don't see Lucas [13:20:57] !log UTC afternoon B&C window done [13:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24042 and previous config saved to /var/cache/conftool/dbconfig/20220404-132256-ladsgroup.json [13:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [13:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:58] (03CR) 10Ayounsi: [C: 03+1] spicerack: add wmflib.prometheus.Thanos support [software/spicerack] - 10https://gerrit.wikimedia.org/r/776946 (owner: 10Volans) [13:26:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10herron) 05Stalled→03Invalid >>! In T303398#7825617, @Dzahn wrote: > Is it ok if we close this ticket and you just reopen it again once he is back? Go... [13:26:27] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Access for new Data Platform Dev: Thomas Chin - https://phabricator.wikimedia.org/T305193 (10herron) [13:27:47] (03PS2) 10Herron: admin: add tchin to groups platform-engineering and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/775954 (https://phabricator.wikimedia.org/T305193) [13:29:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5001.wikimedia.org [13:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:06] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4022.ulsfo.wmnet with OS buster [13:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:14] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4022.ulsfo.wmnet with OS buster com... [13:31:31] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10Andrew) [13:32:04] (03CR) 10Func: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T298308) (owner: 10Winston Sung) [13:32:32] 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10Andrew) Here is the last thing I see before a blank screen and then grub: {F35037959} [13:34:00] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp3055.esams.wmnet with OS buster [13:34:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5001.wikimedia.org [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:07] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:34:09] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp3055.esams.wmnet with OS buster com... [13:34:51] (03CR) 10Filippo Giunchedi: [C: 03+1] spicerack: add wmflib.prometheus.Thanos support [software/spicerack] - 10https://gerrit.wikimedia.org/r/776946 (owner: 10Volans) [13:35:31] (03CR) 10Volans: [C: 03+2] spicerack: add wmflib.prometheus.Thanos support [software/spicerack] - 10https://gerrit.wikimedia.org/r/776946 (owner: 10Volans) [13:35:50] !log pool cp4022 with HAProxy as TLS termination layer - T290005 [13:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:53] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:36:06] (03PS1) 10Ssingh: trafficserver: update Icinga check in check_trafficserver_log_fifo.py [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) [13:36:58] (03PS1) 10Ssingh: haproxy: use Requires= in haproxy-mtail@tls.service [puppet] - 10https://gerrit.wikimedia.org/r/776949 (https://phabricator.wikimedia.org/T305275) [13:37:37] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: update Icinga check in check_trafficserver_log_fifo.py [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [13:37:41] urbanecm: is there a good way to put a bashrc on the deployment/maintenance/debug hosts? [13:37:55] I mean, i can copy one around, but I was hoping there was a nicer way [13:38:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P24043 and previous config saved to /var/cache/conftool/dbconfig/20220404-133801-ladsgroup.json [13:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:14] duesen: a puppet patch. under `modules/admin/files/home/` [13:38:26] whatever you put there will be automatically propagated to all hosts you have access to [13:38:31] (03CR) 10jerkins-bot: [V: 04-1] haproxy: use Requires= in haproxy-mtail@tls.service [puppet] - 10https://gerrit.wikimedia.org/r/776949 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [13:38:46] duesen: https://github.com/wikimedia/puppet/tree/production/modules/admin/files/home/urbanecm is my dotfiles :) [13:38:56] (you'll need to get a friendly SRE to merge it though) [13:41:10] (03PS25) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) [13:41:29] (03CR) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [13:41:32] duesen: is that what you were looking for? [13:41:34] (03PS26) 10Winston Sung: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) [13:41:43] (03PS2) 10Ssingh: haproxy: use Requires= in haproxy-mtail@tls.service [puppet] - 10https://gerrit.wikimedia.org/r/776949 (https://phabricator.wikimedia.org/T305275) [13:42:00] (03PS6) 10Winston Sung: Rearrange zh namespace names and namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776031 [13:42:15] (03PS3) 10Herron: admin: add tchin to groups platform-engineering and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/775954 (https://phabricator.wikimedia.org/T305193) [13:42:37] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add controller_sync_error_count metric [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/775277 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [13:43:00] (03Merged) 10jenkins-bot: spicerack: add wmflib.prometheus.Thanos support [software/spicerack] - 10https://gerrit.wikimedia.org/r/776946 (owner: 10Volans) [13:44:08] (03PS2) 10Ssingh: trafficserver: update Icinga check in check_trafficserver_log_fifo.py [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) [13:44:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5002.wikimedia.org [13:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:28] (03CR) 10Func: Revert "Add zh-hans and zh-hant translation of Module and Module_talk aliases" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747913 (https://phabricator.wikimedia.org/T286291) (owner: 10Winston Sung) [13:44:51] !log pool cp3055 with HAProxy as TLS termination layer - T290005 [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:54] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:45:15] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34675/console" [puppet] - 10https://gerrit.wikimedia.org/r/776949 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [13:47:22] (03PS1) 10Btullis: Apply kafka broker templates correctly in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/776950 (https://phabricator.wikimedia.org/T301454) [13:50:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5002.wikimedia.org [13:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T298565)', diff saved to https://phabricator.wikimedia.org/P24044 and previous config saved to /var/cache/conftool/dbconfig/20220404-135307-ladsgroup.json [13:53:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:53:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24045 and previous config saved to /var/cache/conftool/dbconfig/20220404-135314-ladsgroup.json [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:41] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/776952 [13:54:10] (03CR) 10Kormat: check: Read list of valid sections/valid backup jobs from a file (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [13:54:31] (03PS3) 10Ssingh: certspotter: switch to a local CT logs list [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) [13:57:24] (03CR) 10JMeybohm: [C: 03+1] Apply kafka broker templates correctly in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/776950 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:57:39] (03CR) 10Btullis: [C: 03+2] Apply kafka broker templates correctly in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/776950 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:58:15] !log depool cp5008 for reimage - T290005 [13:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:19] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:59:49] (03PS1) 10JMeybohm: Update cert-manager in staging to 1.5.4-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/776953 (https://phabricator.wikimedia.org/T304092) [14:01:37] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [14:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:04] (03CR) 10Jcrespo: check: Read list of valid sections/valid backup jobs from a file (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [14:02:50] (03CR) 10Vgutierrez: [C: 04-2] "this isn't the right approach to fix the issue. This check reads data from a fifo-log-demux instance and returns OK if it's able to do so" [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [14:02:54] (03PS1) 10Btullis: Define the DATHUB_SECRET value [deployment-charts] - 10https://gerrit.wikimedia.org/r/776954 (https://phabricator.wikimedia.org/T301454) [14:02:55] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:18] (03CR) 10Bking: [C: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/764830 (https://phabricator.wikimedia.org/T302330) (owner: 10Addshore) [14:05:51] (03PS4) 10Jcrespo: check: Read list of valid sections/valid backup jobs from a file [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) [14:06:22] (03CR) 10MMandere: [C: 03+2] site: Reimage cp5008 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776869 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:06:56] (03CR) 10Jcrespo: check: Read list of valid sections/valid backup jobs from a file (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [14:07:26] (03CR) 10Kormat: [C: 03+1] check: Read list of valid sections/valid backup jobs from a file [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [14:07:48] (03CR) 10Vgutierrez: [C: 03+1] "Thanks for working on this. I've tried to tell systemd that by adding the "After=haproxy-mtail@%i.socket" stanza on the haproxy-mtail@.sys" [puppet] - 10https://gerrit.wikimedia.org/r/776949 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [14:08:00] (03CR) 10Jcrespo: "Thank you a lot for the review!" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/776171 (https://phabricator.wikimedia.org/T301315) (owner: 10Jcrespo) [14:08:37] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp5008.eqsin.wmnet with OS buster [14:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:46] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp5008.eqsin.wmnet with OS buster [14:08:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/775954 (https://phabricator.wikimedia.org/T305193) (owner: 10Herron) [14:10:45] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:11:43] (03PS4) 10Herron: admin: add tchin to groups platform-engineering and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/775954 (https://phabricator.wikimedia.org/T305193) [14:12:41] (03CR) 10JMeybohm: [C: 03+1] Define the DATHUB_SECRET value [deployment-charts] - 10https://gerrit.wikimedia.org/r/776954 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [14:13:06] (03PS3) 10Bking: Revert "Temp remove codfw from wikidata updateQueryServiceLag check" [puppet] - 10https://gerrit.wikimedia.org/r/764830 (https://phabricator.wikimedia.org/T302330) (owner: 10Addshore) [14:13:38] (03CR) 10Herron: [C: 03+2] admin: add tchin to groups platform-engineering and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/775954 (https://phabricator.wikimedia.org/T305193) (owner: 10Herron) [14:16:05] !log depool cp6015 for reimage - T290005 [14:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:09] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [14:16:58] (03CR) 10Bking: [C: 03+2] Revert "Temp remove codfw from wikidata updateQueryServiceLag check" [puppet] - 10https://gerrit.wikimedia.org/r/764830 (https://phabricator.wikimedia.org/T302330) (owner: 10Addshore) [14:17:49] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Access for new Data Platform Dev: Thomas Chin - https://phabricator.wikimedia.org/T305193 (10herron) 05Open→03Resolved a:03herron Hi @tchin, the requested access has now been provisioned and will be fully deployed within 30 minutes (as puppet runs compl... [14:18:45] (03CR) 10MMandere: [C: 03+2] site: Reimage cp6015 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/776870 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:24:39] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6015.drmrs.wmnet with OS buster [14:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:49] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6015.drmrs.wmnet with OS buster [14:26:26] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/776952 (owner: 10Volans) [14:28:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host releases2002.codfw.wmnet [14:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:45] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:30:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host releases2002.codfw.wmnet [14:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:31] (03CR) 10Ssingh: [V: 03+1 C: 03+2] haproxy: use Requires= in haproxy-mtail@tls.service [puppet] - 10https://gerrit.wikimedia.org/r/776949 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [14:33:26] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5008.eqsin.wmnet with reason: host reimage [14:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:43] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/776952 (owner: 10Volans) [14:34:11] (03PS4) 10Giuseppe Lavagetto: Add log command [software/conftool] - 10https://gerrit.wikimedia.org/r/776855 [14:34:13] (03PS3) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/776875 [14:36:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add log command [software/conftool] - 10https://gerrit.wikimedia.org/r/776855 (owner: 10Giuseppe Lavagetto) [14:36:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/776875 (owner: 10Giuseppe Lavagetto) [14:36:55] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5008.eqsin.wmnet with reason: host reimage [14:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:18] !log rebooting alert2001 [14:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:25] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10MoritzMuehlenhoff) @diego: We also need the estimated end date of the internship (you'll be contacted two weeks before it expires whether to extend access or not).... [14:37:58] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [14:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:59] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10MoritzMuehlenhoff) @diego: We also need the estimated end date of the internship (you'll be contacted two weeks before it expires whether to extend access or not... [14:38:08] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bu... [14:38:11] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:38:17] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:38:35] (03Merged) 10jenkins-bot: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/776875 (owner: 10Giuseppe Lavagetto) [14:38:48] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10odimitrijevic) Approved! [14:39:25] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10diego) > @diego: We also need the estimated end date of the internship (you'll be contacted two weeks before it expires whether to extend access or not). Internship... [14:40:01] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:17] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10diego) >>! In T305298#7828668, @MoritzMuehlenhoff wrote: > @diego: We also need the estimated end date of the internship (you'll be contacted two weeks before it... [14:41:09] (03PS1) 10Volans: Upstream release v2.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/776957 [14:42:34] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage [14:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:50] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10MoritzMuehlenhoff) 05Open→03Stalled I'm setting this task to Stalled until @TheresNoTime or @thcipriani think it's ready to revisit. [14:44:03] (03CR) 10DCausse: [C: 03+1] cirrus: Migrate popularity_score configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775965 (owner: 10Ebernhardson) [14:44:17] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert1001.wikimedia.org [14:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:19] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp6015.drmrs.wmnet with reason: host reimage [14:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:00] (03CR) 10Vgutierrez: [C: 03+1] "LGTM but this adds manual work that should be documented somewhere (wikitech?). Mainly how we should stay up to date regarding available C" [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [14:45:37] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10TheresNoTime) Thank you 🙂 I'll see how T305191 goes! [14:46:16] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullse... [14:46:24] (03CR) 10Ssingh: certspotter: switch to a local CT logs list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [14:48:33] (03CR) 10Ssingh: [C: 03+2] certspotter: switch to a local CT logs list [puppet] - 10https://gerrit.wikimedia.org/r/776217 (https://phabricator.wikimedia.org/T204993) (owner: 10Ssingh) [14:51:34] (03CR) 10Ssingh: trafficserver: update Icinga check in check_trafficserver_log_fifo.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [14:51:59] PROBLEM - Host cp5008 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:09] PROBLEM - Keyholder SSH agent on alert1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [14:53:09] RECOVERY - Host cp5008 is UP: PING OK - Packet loss = 0%, RTA = 224.98 ms [14:53:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24047 and previous config saved to /var/cache/conftool/dbconfig/20220404-145323-ladsgroup.json [14:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:53:33] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10Ottomata) Approved. [14:53:36] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10Ottomata) Approved. [14:54:09] RECOVERY - Keyholder SSH agent on alert1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [14:55:38] !log herron@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host alert1001.wikimedia.org [14:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:36] ^it actually looks fine, alert1001 is a special case where other hosts are checked from it [15:03:28] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5008.eqsin.wmnet with OS buster [15:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:30] (03CR) 10Volans: [C: 03+2] Upstream release v2.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/776957 (owner: 10Volans) [15:03:37] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp5008.eqsin.wmnet with OS buster com... [15:05:13] urbanecm: sorry, got distracted. Yea, that looks great! [15:05:41] !log pool cp5008 with HAProxy as TLS termination layer - T290005 [15:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:46] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:06:14] is it okay if I deploy a beta config change now? [15:06:33] (I’d added it to the UTC afternoon backport window, but didn’t open my laptop until after the window started, so I missed stashbot’s reminder) [15:06:59] 10SRE, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10Mitar) I think all media files should be made available through IPFS. Then it would be easy to host a copy of files, or contr... [15:07:25] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6015.drmrs.wmnet with OS buster [15:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:34] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6015.drmrs.wmnet with OS buster com... [15:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24048 and previous config saved to /var/cache/conftool/dbconfig/20220404-150828-ladsgroup.json [15:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:01] (03CR) 10Volans: "question inline" [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [15:11:07] (03Merged) 10jenkins-bot: Upstream release v2.4.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/776957 (owner: 10Volans) [15:13:56] (03PS2) 10Lucas Werkmeister (WMDE): Use "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774847 [15:14:08] ^ I’ll quickly deploy this unless someone yells at me to stop :) [15:15:07] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774847 (owner: 10Lucas Werkmeister (WMDE)) [15:15:49] (03Merged) 10jenkins-bot: Use "unexpectedUnconnectedPage" page prop on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774847 (owner: 10Lucas Werkmeister (WMDE)) [15:16:27] ooh, `scap pull` output looks different [15:17:13] !log pool cp6015 with HAProxy as TLS termination layer - T290005 [15:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:17] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [15:17:53] (03PS1) 10Volans: Upstream release v2.4.0 (take 2) [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/776961 [15:18:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:774847|Use "unexpectedUnconnectedPage" page prop on Beta]] (production no-op) (duration: 00m 50s) [15:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:29] PROBLEM - Check systemd state on releases2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10Discovery-Search (Current work), and 3 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10MPhamWMF) [15:22:53] RECOVERY - Check systemd state on releases2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host releases1002.eqiad.wmnet [15:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P24049 and previous config saved to /var/cache/conftool/dbconfig/20220404-152333-ladsgroup.json [15:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:21] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:24:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:24:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host releases1002.eqiad.wmnet [15:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:40] (03PS4) 10Andrew Bogott: dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [15:27:58] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-volume-backup: use created_at rather than modified_at for purging [puppet] - 10https://gerrit.wikimedia.org/r/775997 (owner: 10Andrew Bogott) [15:28:19] !log remove stray debmonitor-server/cumin installs (cleanup of 548425ba5833089e5ad6025890a6db87fbe718b8) [15:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:05] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: cleanup remaining x-novaproxy-edit-dns users [puppet] - 10https://gerrit.wikimedia.org/r/771406 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [15:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T1530). [15:31:05] (03PS3) 10Ssingh: trafficserver: update Icinga check in check_trafficserver_log_fifo.py [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) [15:31:11] (03CR) 10Volans: [C: 03+2] Upstream release v2.4.0 (take 2) [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/776961 (owner: 10Volans) [15:31:44] (03CR) 10Ssingh: trafficserver: update Icinga check in check_trafficserver_log_fifo.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [15:34:09] (03CR) 10Vgutierrez: [C: 03+1] "looking good, just a small nitpick but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [15:34:59] (03PS4) 10Ssingh: trafficserver: update Icinga check in check_trafficserver_log_fifo.py [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) [15:37:50] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: update Icinga check in check_trafficserver_log_fifo.py [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [15:38:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298565)', diff saved to https://phabricator.wikimedia.org/P24050 and previous config saved to /var/cache/conftool/dbconfig/20220404-153839-ladsgroup.json [15:38:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:38:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24051 and previous config saved to /var/cache/conftool/dbconfig/20220404-153846-ladsgroup.json [15:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:48] (03Merged) 10jenkins-bot: Upstream release v2.4.0 (take 2) [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/776961 (owner: 10Volans) [15:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:12] (03CR) 10Ssingh: [C: 03+2] trafficserver: update Icinga check in check_trafficserver_log_fifo.py [puppet] - 10https://gerrit.wikimedia.org/r/776948 (https://phabricator.wikimedia.org/T305275) (owner: 10Ssingh) [15:44:50] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [15:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:00] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bu... [15:50:24] (03PS1) 10Volans: sre.SREBatchBase: allow to customize grace sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/776965 [15:50:26] (03PS1) 10Volans: sre.cdn.roll-restart-varnish: override grace sleep [cookbooks] - 10https://gerrit.wikimedia.org/r/776966 [15:54:26] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye [15:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:24] (03PS2) 10Jcrespo: dbbackups: Monitor db_inventory rather than zarcillo section [puppet] - 10https://gerrit.wikimedia.org/r/776170 (https://phabricator.wikimedia.org/T301315) [15:55:26] (03PS1) 10Jcrespo: dbbackups: Setup a valid_sections.txt config for db backup checks [puppet] - 10https://gerrit.wikimedia.org/r/776969 (https://phabricator.wikimedia.org/T301315) [15:56:58] RECOVERY - Check systemd state on gitlab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:04] RECOVERY - DPKG on idp-test1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:58:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1003.eqiad.wmnet with OS bullseye [15:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:15] !log bblack@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 1 hosts matching query P{cp2027.codfw.wmnet} [16:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:05] !log bblack@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 1 hosts matching query P{cp2027.codfw.wmnet} [16:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:43] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:05:29] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1002.eqiad.wmnet with reason: host reimage [16:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:34] (03PS2) 10Jcrespo: dbbackups: Setup a valid_sections.txt config for db backup checks [puppet] - 10https://gerrit.wikimedia.org/r/776969 (https://phabricator.wikimedia.org/T301315) [16:08:21] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: host reimage [16:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1002.eqiad.wmnet with reason: host reimage [16:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:53] !log uploaded spicerack_2.4.0 to apt.wikimedia.org bullseye-wikimedia [16:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:14] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1003.eqiad.wmnet with reason: host reimage [16:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:39] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: host reimage [16:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:24] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1003.eqiad.wmnet with reason: host reimage [16:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:05] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:51] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1002.eqiad.wmnet with OS bullseye [16:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:04] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:31:49] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [16:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:57] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullse... [16:34:42] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1003.eqiad.wmnet with OS bullseye [16:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24052 and previous config saved to /var/cache/conftool/dbconfig/20220404-164144-ladsgroup.json [16:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:50:53] !log mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki "Brand" "Brand/Archive" "Majavah" --reason '[[:phab:T305387]]' # T305387 [16:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:56] T305387: Move the "Brand" portal page on Meta-Wiki - https://phabricator.wikimedia.org/T305387 [16:51:03] (03PS1) 10JMeybohm: Update cert-manager to 1.5.4-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/776971 (https://phabricator.wikimedia.org/T304092) [16:56:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24053 and previous config saved to /var/cache/conftool/dbconfig/20220404-165649-ladsgroup.json [16:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:02] (03CR) 10Dzahn: [C: 03+2] Update feed URL for wikimedia.no blog [puppet] - 10https://gerrit.wikimedia.org/r/776905 (owner: 10Amire80) [17:00:04] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T1700). [17:00:16] (03PS1) 10Ayounsi: uRPF: add DHCP exception [homer/public] - 10https://gerrit.wikimedia.org/r/776973 (https://phabricator.wikimedia.org/T285461) [17:00:41] (03CR) 10JMeybohm: [C: 03+2] Update cert-manager in staging to 1.5.4-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/776953 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [17:02:33] (03CR) 10Ayounsi: "From the doc: https://www.juniper.net/documentation/us/en/software/junos/security-services/topics/topic-map/interfaces-configuring-unicast" [homer/public] - 10https://gerrit.wikimedia.org/r/776973 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [17:05:22] (03Merged) 10jenkins-bot: Update cert-manager in staging to 1.5.4-3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/776953 (https://phabricator.wikimedia.org/T304092) (owner: 10JMeybohm) [17:06:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1076.eqiad.wmnet [17:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:50] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:42] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:37] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:30] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P24054 and previous config saved to /var/cache/conftool/dbconfig/20220404-171154-ladsgroup.json [17:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1076.eqiad.wmnet [17:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:07] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [17:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2037.codfw.wmnet [17:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:12] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Management interface SSH icinga alerts - https://phabricator.wikimedia.org/T304289 (10Papaul) |Hostname|Old verssion|New version| |db2083.mgmt| |db2086.mgmt| |db2090.mgmt| |kubernetes2001.mgmt| |ms-fe2008.mgmt| |mw2252.mgmt| |mw2254... [17:20:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [17:20:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10wiki_willy) @Jclark-ctr - just following up Cathal's last comment >>! In T292095#7801403, @cmooney wrote: > @Jclark-ctr I'm not getting any... [17:21:10] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): PXE boot failures on cloudvirt-wdqs100[1-3] - https://phabricator.wikimedia.org/T305368 (10ayounsi) The bug was introduced with this change: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/775279/ The following one sho... [17:23:58] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2037.codfw.wmnet [17:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:02] (03CR) 10Ayounsi: [C: 03+2] "Tested manually for https://phabricator.wikimedia.org/T305368 and confirmed working as expected." [homer/public] - 10https://gerrit.wikimedia.org/r/776973 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [17:24:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5001.eqsin.wmnet [17:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:50] (03Merged) 10jenkins-bot: uRPF: add DHCP exception [homer/public] - 10https://gerrit.wikimedia.org/r/776973 (https://phabricator.wikimedia.org/T285461) (owner: 10Ayounsi) [17:25:19] (03PS1) 10RLazarus: httpbb: Delete the git::clone and install via deb package [puppet] - 10https://gerrit.wikimedia.org/r/776977 (https://phabricator.wikimedia.org/T299705) [17:25:41] !log push urpf DHCP exception to all core routers with urpf configured - T285461 [17:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:45] T285461: Review filtering for cloud-hosts on CR routers eqiad - https://phabricator.wikimedia.org/T285461 [17:26:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T298565)', diff saved to https://phabricator.wikimedia.org/P24055 and previous config saved to /var/cache/conftool/dbconfig/20220404-172659-ladsgroup.json [17:27:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [17:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [17:27:03] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24056 and previous config saved to /var/cache/conftool/dbconfig/20220404-172707-ladsgroup.json [17:27:09] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: host reimage [17:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:30] (03CR) 10Ahmon Dancy: [C: 03+1] docker: move pruning to new profile docker::prune [puppet] - 10https://gerrit.wikimedia.org/r/773641 (https://phabricator.wikimedia.org/T304644) (owner: 10Razzi) [17:29:53] (03CR) 10Ahmon Dancy: [C: 03+1] ci: docker system prune on ci::master [puppet] - 10https://gerrit.wikimedia.org/r/773784 (owner: 10Hashar) [17:30:33] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: host reimage [17:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:10] (03CR) 10RLazarus: "Note that PCC fails, but I think only because of the private Hiera lookup:" [puppet] - 10https://gerrit.wikimedia.org/r/776977 (https://phabricator.wikimedia.org/T299705) (owner: 10RLazarus) [17:34:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5001.eqsin.wmnet [17:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:57] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:13] (03CR) 10Dzahn: [C: 03+1] gitlab: move backups to /mnt/gitlab-backup [puppet] - 10https://gerrit.wikimedia.org/r/776230 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [17:52:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye [17:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:19] PROBLEM - SSH on aqs1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:06] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host apifeatureusage1001.eqiad.wmnet [18:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:10] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:18:15] PROBLEM - Host mc2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:21:45] PROBLEM - SSH on wtp1025.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:25:19] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apifeatureusage1001.eqiad.wmnet [18:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4024.ulsfo.wmnet [18:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1077.eqiad.wmnet [18:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:09] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host apifeatureusage2001.codfw.wmnet [18:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2038.codfw.wmnet [18:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:13] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (paramita_das) - https://phabricator.wikimedia.org/T305298 (10KFrancis) Hi all, I am confirming as Paramita Das is a contractor with the WMF, an NDA is already on file. Please proceed with the access request. [18:31:15] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:32:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4024.ulsfo.wmnet [18:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24057 and previous config saved to /var/cache/conftool/dbconfig/20220404-183227-ladsgroup.json [18:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:33:20] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Aitolkyn) - https://phabricator.wikimedia.org/T305299 (10KFrancis) @MoritzMuehlenhoff I am confirming the contractor NDA is already on file. Please proceed with the access request. [18:34:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2038.codfw.wmnet [18:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1077.eqiad.wmnet [18:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:53] 10SRE, 10Traffic: Resolve issues with cp hosts and the reboot-single cookbook - https://phabricator.wikimedia.org/T305275 (10ssingh) 05Open→03Resolved We have tested the above two changes with six cp host reboots and there are no concerns, confirming that this issue has been fixed. Thanks to everyone for... [18:36:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5004.eqsin.wmnet [18:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:52] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/776982 [18:37:53] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:38:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3058.esams.wmnet [18:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:25] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/776982 [18:38:42] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apifeatureusage2001.codfw.wmnet [18:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4025.ulsfo.wmnet [18:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5004.eqsin.wmnet [18:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4025.ulsfo.wmnet [18:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1078.eqiad.wmnet [18:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3058.esams.wmnet [18:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24058 and previous config saved to /var/cache/conftool/dbconfig/20220404-184733-ladsgroup.json [18:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:37] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 2 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10CBogen) [18:50:10] (03PS3) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/776982 [18:50:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) the two junipers are up now. @cmooney [18:51:27] (03CR) 10Krinkle: [C: 03+1] "Thanks. LGTM. Good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776257 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [18:51:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2039.codfw.wmnet [18:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5005.eqsin.wmnet [18:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:38] (03PS4) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/776982 [18:55:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1078.eqiad.wmnet [18:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:16] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/pcc-worker1001/34684/" [puppet] - 10https://gerrit.wikimedia.org/r/776982 (owner: 10Herron) [18:59:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2039.codfw.wmnet [18:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3059.esams.wmnet [19:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:33] 10SRE, 10SRE Observability: apifeatureusage hosts hanging on shutdown - https://phabricator.wikimedia.org/T305403 (10herron) p:05Triage→03Medium [19:02:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P24059 and previous config saved to /var/cache/conftool/dbconfig/20220404-190238-ladsgroup.json [19:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4026.ulsfo.wmnet [19:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:05] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp5005.eqsin.wmnet [19:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:10] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, 10Service-deployment-requests: Setup Initial Image Suggestion Service CI and k8s params/stubs - https://phabricator.wikimedia.org/T305154 (10CBogen) [19:06:11] PROBLEM - Check systemd state on cp5005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter@frontend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:42] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 2 others: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10CBogen) [19:07:27] RECOVERY - SSH on aqs1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3059.esams.wmnet [19:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:58] (03PS5) 10Herron: logstash: set unit TimeoutStopSec of 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) [19:10:15] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4026.ulsfo.wmnet [19:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:32] (03PS6) 10Herron: logstash: set unit TimeoutStopSec of 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) [19:11:54] (03PS2) 10Volans: sre.SREBatchBase: additional customizations [cookbooks] - 10https://gerrit.wikimedia.org/r/776965 [19:11:56] (03PS2) 10Volans: sre.cdn.roll-restart-varnish: improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/776966 [19:16:01] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog1001.eqiad.wmnet [19:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:48] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5005.eqsin.wmnet,service=varnish-fe [19:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:57] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5005.eqsin.wmnet,service=ats-be [19:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:59] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp5005.eqsin.wmnet,service=ats-tls [19:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T298565)', diff saved to https://phabricator.wikimedia.org/P24060 and previous config saved to /var/cache/conftool/dbconfig/20220404-191743-ladsgroup.json [19:17:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [19:17:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24061 and previous config saved to /var/cache/conftool/dbconfig/20220404-191750-ladsgroup.json [19:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:21] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:21:23] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:47] RECOVERY - SSH on wtp1025.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:22:48] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1001.eqiad.wmnet [19:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:31] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:26:37] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:29:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1079.eqiad.wmnet [19:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:27] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:33:09] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon2002.codfw.wmnet [19:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:41] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott work in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:33:42] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott work in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:33:42] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott work in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:35:06] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon2002.codfw.wmnet [19:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:31] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafkamon1002.eqiad.wmnet [19:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:57] RECOVERY - Check systemd state on cp5005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:32] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafkamon1002.eqiad.wmnet [19:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:37] !log herron@cumin1001 START - Cookbook sre.hosts.reboot-single for host lists1001.wikimedia.org [19:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1079.eqiad.wmnet [19:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2040.codfw.wmnet [19:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:33] PROBLEM - Check systemd state on cp5005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter@frontend.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:43:38] (03CR) 10Cwhite: "As evidenced by this, it appears we haven't found the root cause of T275405. I propose we undo those changes and do something like this i" [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron) [19:43:59] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists1001.wikimedia.org [19:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:43] RECOVERY - Check systemd state on cp5005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:37] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): apifeatureusage hosts hanging on shutdown - https://phabricator.wikimedia.org/T305403 (10colewhite) [19:46:49] (03PS7) 10Herron: logstash: set unit TimeoutStopSec of 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) [19:47:20] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): apifeatureusage hosts hanging on shutdown - https://phabricator.wikimedia.org/T305403 (10colewhite) [19:48:41] (03PS8) 10Herron: logstash: set unit TimeoutStopSec of 2 minutes [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) [19:50:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2040.codfw.wmnet [19:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5005.eqsin.wmnet [19:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:24] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): apifeatureusage hosts hanging on shutdown - https://phabricator.wikimedia.org/T305403 (10herron) [19:56:56] (03PS1) 10RLazarus: slo: Set a custom description for the Varnish dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/776992 (https://phabricator.wikimedia.org/T302842) [19:56:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5005.eqsin.wmnet [19:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:18] (03CR) 10Herron: "PCC seems confused at the moment, but looking at the full diffs this should do the right thing https://puppet-compiler.wmflabs.org/pcc-wor" [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron) [19:58:39] (03CR) 10Herron: logstash: set unit TimeoutStopSec of 2 minutes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron) [20:00:04] RoanKattouw and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T2000). [20:00:04] AGueyte: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3060.esams.wmnet [20:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:27] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cp3060.esams.wmnet [20:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3060.esams.wmnet [20:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:42] * urbanecm is around, but he doesn't see AGueyte [20:02:43] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:03:05] She's joining [20:03:08] sorry for the delay [20:03:26] Tran: okay, I'll wait [20:03:34] hello [20:03:43] hello AnaisGueyte! I can deploy today [20:04:00] Great, thanks! It's my first deploy :) [20:04:38] okay, good to know! Feel free to ask if there's anything unclear -- the process can be confusing at first. No question is stupid :) [20:05:01] AnaisGueyte: do you have https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage installed for testing the change? [20:05:38] Yes! [20:05:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4027.ulsfo.wmnet [20:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:05] okay, great! [20:06:11] (03PS2) 10Urbanecm: Remove wgWMEIPAddressCopyActionEnabled from Beta and production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774904 (https://phabricator.wikimedia.org/T296469) (owner: 10Tchanders) [20:06:16] I'll fetch the patch to the debug server now [20:06:29] (03CR) 10Urbanecm: [C: 03+2] Remove wgWMEIPAddressCopyActionEnabled from Beta and production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774904 (https://phabricator.wikimedia.org/T296469) (owner: 10Tchanders) [20:06:40] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron) [20:07:19] urbanecm: I have one backport in the hopper -- as soon as https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/776226 merges, ok if I do that to wmf.5 after you're done? [20:07:28] (03Merged) 10jenkins-bot: Remove wgWMEIPAddressCopyActionEnabled from Beta and production config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774904 (https://phabricator.wikimedia.org/T296469) (owner: 10Tchanders) [20:07:32] (03CR) 10Herron: logstash: set unit TimeoutStopSec of 2 minutes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776982 (https://phabricator.wikimedia.org/T305403) (owner: 10Herron) [20:07:57] cjming: sure thing. Will you want to self-serve? [20:08:40] AnaisGueyte: your patch is at mwdebug1001. Can you have a look, please? [20:08:47] yes, thanks! [20:08:55] urbanecm: ah nvm - we want to test on beta cluster first so I'll schedule it for backport tomorrow instead [20:09:03] cjming: sounds good. [20:09:08] AnaisGueyte: let me know how it goes :) [20:10:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3060.esams.wmnet [20:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:57] (03CR) 10Herron: [C: 03+1] "LGTM!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/776992 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [20:11:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1080.eqiad.wmnet [20:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24062 and previous config saved to /var/cache/conftool/dbconfig/20220404-201409-ladsgroup.json [20:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:14:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4027.ulsfo.wmnet [20:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:12] Hi @urbanecm. do you know if there's any reason the events log would not be fired on test wiki? [20:16:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5006.eqsin.wmnet [20:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:56] AnaisGueyte: i'm confused. I thought the patch is meant to disable instrumentation at all wikis? [20:17:48] It does, but it's not firing either on the non test server [20:18:01] ah [20:18:16] not from top of my head [20:18:18] but i can check [20:19:46] it does fire it outside of a debug server [20:20:12] i see a req to https://test.wikipedia.org/beacon/statsv?MediaWiki.ipinfo_address_copy.special_contributions=1c&MediaWiki.ipinfo_address_copy_by_wiki.testwiki.special_contributions=1c when i copy an address [20:20:14] (03CR) 10RLazarus: [V: 03+2 C: 03+2] slo: Set a custom description for the Varnish dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/776992 (https://phabricator.wikimedia.org/T302842) (owner: 10RLazarus) [20:21:00] AnaisGueyte: what do you observe? [20:21:15] (I hope my understanding of the instrumented action is correct) [20:21:31] I see it now but it appears very delayed. Is that something I should expect? [20:21:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1080.eqiad.wmnet [20:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:07] AnaisGueyte: okay, great. it can take a few seconds (AFAIK it tries to save some resources by batching events) [20:23:54] Great, I wasn't expecting the delay, testing again! Thank you [20:25:45] no problem [20:26:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5006.eqsin.wmnet [20:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:02] Good, I don't see the event being fired on the test server, it appears to be successful, thank you @urbanecm [20:28:17] AnaisGueyte: that's great news :) [20:28:19] I'm deploying the change now [20:29:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24063 and previous config saved to /var/cache/conftool/dbconfig/20220404-202914-ladsgroup.json [20:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8c81de9c732adef4537226ec6a7023fef40f3396: Remove wgWMEIPAddressCopyActionEnabled from Beta and production config (T296469) (duration: 00m 51s) [20:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:32] T296469: Log when a user copies an IP address - https://phabricator.wikimedia.org/T296469 [20:29:39] AnaisGueyte: should be live now. [20:29:44] anything else i can do for you today? [20:29:44] Thank you! [20:29:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp3061.esams.wmnet [20:29:55] Nope, that was a great first experience! Thanks! [20:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:04] happy to help! Talk to you later AnaisGueyte [20:30:10] !log UTC late B&C window completed [20:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5010.eqsin.wmnet [20:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1081.eqiad.wmnet [20:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3061.esams.wmnet [20:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:11] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:40:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5010.eqsin.wmnet [20:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:57] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1081.eqiad.wmnet [20:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P24064 and previous config saved to /var/cache/conftool/dbconfig/20220404-204419-ladsgroup.json [20:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24065 and previous config saved to /var/cache/conftool/dbconfig/20220404-205924-ladsgroup.json [20:59:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [20:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [20:59:29] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24066 and previous config saved to /var/cache/conftool/dbconfig/20220404-205932-ladsgroup.json [20:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:41] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers - https://phabricator.wikimedia.org/T303857 (10thcipriani) [21:00:04] Reedy and sbassett: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220404T2100). [21:02:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp2041.codfw.wmnet [21:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp5011.eqsin.wmnet [21:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:33] (03CR) 10Dzahn: [C: 03+2] geoip::maxmind: rename the update timers, don't use 'legacy' term [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:05:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1082.eqiad.wmnet [21:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2041.codfw.wmnet [21:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:55] (03CR) 10Dzahn: "[puppetmaster2003:~] $ sudo systemctl status geoip_update" [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:11:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5011.eqsin.wmnet [21:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:17] !log puppetmaster1001/puppetmaster2003 - geoip / maxmind database update timers renamed. 'geoip_update_legacy' became 'geoip_update_main', 'geoip_update' became 'geoip_update_ipinfo'. Not using the confusing 'legacy' term anymore as was suggested as part of (T303464) [21:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:20] T303464: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 [21:14:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1082.eqiad.wmnet [21:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:16] (03CR) 10Dzahn: "only runs on puppetmaste1001, the active one, but works. confirmed. manually started etc" [puppet] - 10https://gerrit.wikimedia.org/r/773845 (https://phabricator.wikimedia.org/T303464) (owner: 10Dzahn) [21:50:14] (03PS1) 10Bking: elastic: don't wait for green on first node in cluster [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) [21:52:15] (03PS2) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [21:53:47] (03PS3) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [22:00:07] (03CR) 10jerkins-bot: [V: 04-1] elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [22:03:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24067 and previous config saved to /var/cache/conftool/dbconfig/20220404-220313-ladsgroup.json [22:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:18] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:12:25] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:18:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24068 and previous config saved to /var/cache/conftool/dbconfig/20220404-221818-ladsgroup.json [22:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:47] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:28:21] ACKNOWLEDGEMENT - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project andrew bogott investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:31:07] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [22:32:30] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:33:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24069 and previous config saved to /var/cache/conftool/dbconfig/20220404-223323-ladsgroup.json [22:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:46] (BlazegraphJvmQuakeWarnGC) firing: Blazegraph instance wdqs2003:9100 is entering a GC death spiral - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphJvmQuakeWarnGC [22:41:44] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10RobH) [22:42:20] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10RobH) [22:42:27] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10RobH) [22:48:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24070 and previous config saved to /var/cache/conftool/dbconfig/20220404-224828-ladsgroup.json [22:48:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [22:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [22:48:32] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24071 and previous config saved to /var/cache/conftool/dbconfig/20220404-224836-ladsgroup.json [22:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:22] (03PS4) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [22:59:33] (03PS5) 10Ryan Kemper: elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [23:01:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:03:37] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:05:44] (03CR) 10jerkins-bot: [V: 04-1] elastic: don't wait for green on first node [software/spicerack] - 10https://gerrit.wikimedia.org/r/776999 (https://phabricator.wikimedia.org/T304570) (owner: 10Bking) [23:35:43] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:36:28] (03PS4) 10Dzahn: aptrepo: import gitlab-runner package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/767604 (https://phabricator.wikimedia.org/T297659) [23:37:43] (03CR) 10Dzahn: [C: 03+2] aptrepo: import gitlab-runner package for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/767604 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [23:45:33] (03CR) 10Dzahn: "[apt1001:~] $ sudo -E reprepro --component thirdparty/gitlab-runner checkupdate bullseye-wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/767604 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [23:48:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24072 and previous config saved to /var/cache/conftool/dbconfig/20220404-234850-ladsgroup.json [23:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:50:14] 10SRE-OnFire (FY2021/2022-Q3), 10WMF-NDA: non-wikimedia.org domain names for status page - https://phabricator.wikimedia.org/T293504 (10CDanis) [23:51:38] !log apt1001 - importing gitlab-runner package for bullseye via: 'sudo -E reprepro --noskipold --component thirdparty/gitlab-runner update bullseye-wikimedia' after gerrit:767604 (T297659) [23:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:41] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659