[00:08:44] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [00:08:49] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [00:12:18] (03Abandoned) 10Ryan Kemper: 6.8.23-wmf2 search-extra for bullseye [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818507 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper) [00:12:46] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-08-23 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:13:33] (03CR) 10Ryan Kemper: [C: 03+1] prometheus-elasticsearch-exporter: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826835 (owner: 10Muehlenhoff) [00:13:35] (03CR) 10Ryan Kemper: [C: 03+2] prometheus-elasticsearch-exporter: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826835 (owner: 10Muehlenhoff) [00:14:09] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [00:14:14] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [00:14:50] !log T316719 First elastic host upgraded properly. Cancelling cookbook to kick off a new rolling upgrade that will go 3 nodes at a time (first run was just one host as a sanity check) [00:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:20] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [00:19:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:18] 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10RKemper) [00:20:26] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:20:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:46] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-08-23 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:22:54] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:25:39] 10ops-codfw, 10decommission-hardware: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T316729 (10RKemper) [00:26:33] (03CR) 10Ryan Kemper: elastic: decom elastic2035 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [00:26:45] (03PS3) 10Ryan Kemper: elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T316729) [00:27:07] (03CR) 10CI reject: [V: 04-1] elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T316729) (owner: 10Ryan Kemper) [00:30:48] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-08-23 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:33:42] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-08-23 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:40:45] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:40:57] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [00:52:36] ACKNOWLEDGEMENT - DNS on cloudservices1003.mgmt is CRITICAL: Domain cloudservices1003.mgmt.eqiad.wmnet was not found by the server Andrew Bogott not urgent. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:21:44] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:25:40] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:28:04] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 6.284 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:34:00] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:38:46] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.987 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:40:45] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:00] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:44:46] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:45:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:54:24] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:55:02] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:00:04] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:00:26] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:02:22] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:02:42] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:10:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:38] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:49:27] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [02:49:32] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [02:50:04] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [02:58:50] PROBLEM - PHP7 rendering on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:01:08] RECOVERY - PHP7 rendering on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 1.986 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:11:22] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [03:11:42] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [03:16:38] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [03:17:23] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [03:17:28] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [03:23:50] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [03:23:50] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [03:23:55] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [03:27:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:05:18] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [04:07:46] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [04:41:12] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [04:43:53] (03PS2) 10Marostegui: mariadb: Promote db1159 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/828009 (https://phabricator.wikimedia.org/T316506) [04:44:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1195].eqiad.wmnet with reason: switchover m1 T316506 [04:44:31] T316506: Switchover m3 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T316506 [04:44:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1195].eqiad.wmnet with reason: switchover m1 T316506 [04:45:27] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1159 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/828009 (https://phabricator.wikimedia.org/T316506) (owner: 10Marostegui) [04:46:38] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:50] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [04:52:36] I am switching over phabricator db master in a few minutes [04:56:42] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [05:00:02] !log Failover m3 from db1183 to db1159 - T316506 [05:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:07] T316506: Switchover m3 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T316506 [05:07:39] (03PS1) 10Marostegui: dbproxy1020,dbproxy1016: Add db1117:3323 back as standby [puppet] - 10https://gerrit.wikimedia.org/r/828384 (https://phabricator.wikimedia.org/T316742) [05:10:04] (03CR) 10Marostegui: [C: 03+2] dbproxy1020,dbproxy1016: Add db1117:3323 back as standby [puppet] - 10https://gerrit.wikimedia.org/r/828384 (https://phabricator.wikimedia.org/T316742) (owner: 10Marostegui) [05:12:01] (03PS1) 10Marostegui: db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/828396 [05:17:10] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:28] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:44] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:21:08] (03CR) 10Marostegui: [C: 03+2] db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/828396 (owner: 10Marostegui) [05:25:26] (03PS1) 10Marostegui: mariadb: Move db1183 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/828397 (https://phabricator.wikimedia.org/T316742) [05:26:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1183 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/828397 (https://phabricator.wikimedia.org/T316742) (owner: 10Marostegui) [05:28:52] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:31:21] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [05:35:31] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [05:35:35] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:36:15] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [05:40:01] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [05:58:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:21] (03PS1) 10Giuseppe Lavagetto: role::ci::master: remove admin dependency hack [puppet] - 10https://gerrit.wikimedia.org/r/828399 [06:11:00] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:18:09] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [06:21:57] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [06:32:45] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:54:10] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) 05In progress→03Resolved [06:54:16] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) [06:55:01] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:55:19] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:55:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 247, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:33] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:56:49] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:56:57] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:56:57] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:57:57] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 252, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:59:13] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:00:05] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T0700). [07:00:05] _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:42] <_joe_> yeah it won't be deployed [07:01:45] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [07:02:35] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:02:35] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:02:49] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:04:01] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:04:11] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:04:11] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:04:39] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:04:59] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:14:38] (03PS1) 10DCausse: Relax elasticsearch marster node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 [07:15:31] (03PS1) 10Muehlenhoff: tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404 [07:15:37] (03PS2) 10DCausse: Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 [07:15:59] !log bounce thanos-compact on thanos-fe2001 [07:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:06] (03CR) 10CI reject: [V: 04-1] tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404 (owner: 10Muehlenhoff) [07:18:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:18:01] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [07:18:31] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:18:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:18:52] (03CR) 10Jelto: [C: 03+2] admin: add tsepothoabala to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/827980 (https://phabricator.wikimedia.org/T315409) (owner: 10Jelto) [07:20:10] (03PS2) 10Muehlenhoff: tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404 [07:20:45] (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:20:47] (03PS1) 10Ebernhardson: elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 [07:20:54] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [07:20:57] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [07:20:59] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:22:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:23:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:23:19] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:23:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.325 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:26:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/828404 (owner: 10Muehlenhoff) [07:26:24] (03PS2) 10Ebernhardson: elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 [07:27:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [07:27:39] (03CR) 10Ebernhardson: [C: 03+1] Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (owner: 10DCausse) [07:27:43] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [07:29:19] ^ this alert will be flapping for a while (til we merge https://gerrit.wikimedia.org/r/828403) [07:31:30] (03PS1) 10Marostegui: dbproxy1017,dbproxy1021: Add db1183 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/828408 (https://phabricator.wikimedia.org/T316742) [07:32:09] (03CR) 10Marostegui: [C: 03+2] dbproxy1017,dbproxy1021: Add db1183 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/828408 (https://phabricator.wikimedia.org/T316742) (owner: 10Marostegui) [07:32:56] (03CR) 10CI reject: [V: 04-1] elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson) [07:35:27] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:37:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [07:39:55] (03PS3) 10Muehlenhoff: tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404 [07:39:55] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [07:40:13] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [07:41:18] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/828468 (https://phabricator.wikimedia.org/T316745) [07:41:22] (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/828469 (https://phabricator.wikimedia.org/T316745) [07:42:28] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) @TThoabala access was granted. Can you please verify that you have access to the requested data/notebook? [07:43:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/828404 (owner: 10Muehlenhoff) [07:43:42] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/828468 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot) [07:44:15] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/828469 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot) [07:45:05] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:45:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [07:47:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [07:47:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 for upgrade', diff saved to https://phabricator.wikimedia.org/P33705 and previous config saved to /var/cache/conftool/dbconfig/20220831-074748-root.json [07:50:15] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [07:50:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2022.codfw.wmnet to cluster codfw and group B [07:50:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [07:51:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2022.codfw.wmnet to cluster codfw and group B [07:53:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33706 and previous config saved to /var/cache/conftool/dbconfig/20220831-075310-root.json [07:54:22] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host prometheus2006.codfw.wmnet [07:55:37] (03CR) 10Volans: "Minor nits reported by CI, see inline comments for the details. Beside that LGTM." [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson) [07:57:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [07:58:09] 10SRE, 10Observability-Metrics: Not all carbon service start at graphite reboot - https://phabricator.wikimedia.org/T316747 (10fgiunchedi) [08:00:05] dduvall and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T0800). [08:00:58] (03PS1) 10Filippo Giunchedi: carbon: start at boot [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747) [08:01:00] (03PS1) 10Filippo Giunchedi: graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) [08:01:17] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Set action=never-cache for caching=websockets|pipe [puppet] - 10https://gerrit.wikimedia.org/r/827506 (https://phabricator.wikimedia.org/T316545) (owner: 10Vgutierrez) [08:01:59] (03CR) 10CI reject: [V: 04-1] graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi) [08:03:41] (03PS2) 10Filippo Giunchedi: graphite: start carbon.service at boot [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747) [08:03:43] (03PS2) 10Filippo Giunchedi: graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) [08:06:48] (03CR) 10Clément Goubert: [C: 03+1] "LGTM, no forced ordering is good." [puppet] - 10https://gerrit.wikimedia.org/r/828399 (owner: 10Giuseppe Lavagetto) [08:07:22] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) a:03Cmjohnson [08:08:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33707 and previous config saved to /var/cache/conftool/dbconfig/20220831-080815-root.json [08:09:17] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:12:03] !log test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338 [08:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:07] T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 [08:14:12] (03PS3) 10DCausse: elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson) [08:14:17] (03CR) 10DCausse: elasticsearch: Simplify routine to start masters last (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson) [08:20:05] !log end test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338 [08:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:10] T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 [08:23:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33708 and previous config saved to /var/cache/conftool/dbconfig/20220831-082319-root.json [08:26:11] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:27:55] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 24 hosts with reason: Downtiming php7.4 parsoid servers until they are ready to pool [08:28:13] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 24 hosts with reason: Downtiming php7.4 parsoid servers until they are ready to pool [08:28:16] !log upgrading ganeti2016/ganeti2018 to 3.0.2 T312637 [08:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:21] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [08:30:36] (03CR) 10DCausse: [C: 03+1] elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson) [08:30:38] (03PS1) 10Filippo Giunchedi: Remove upstart configs in /etc/init/ [puppet] - 10https://gerrit.wikimedia.org/r/828477 [08:32:27] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) [08:32:56] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet [08:33:05] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:34:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is a symlink (half-manually - as in there is an automating script) in our current setup, so this would break our setup." [puppet] - 10https://gerrit.wikimedia.org/r/828078 (owner: 10AOkoth) [08:36:07] (03CR) 10Muehlenhoff: [C: 03+1] "Good riddance :-)" [puppet] - 10https://gerrit.wikimedia.org/r/828477 (owner: 10Filippo Giunchedi) [08:38:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33709 and previous config saved to /var/cache/conftool/dbconfig/20220831-083824-root.json [08:38:34] (03Abandoned) 10Jelto: gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:39:06] (03Abandoned) 10Jelto: gitlab: use actual backup name instead of latest on replica [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:39:23] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1001.eqiad.wmnet [08:40:37] (03PS1) 10Jelto: Revert "install_server: change partman config for gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/827578 [08:42:33] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:59] (03CR) 10Filippo Giunchedi: sre: followup on Kafka partition replication alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [08:43:07] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:43:24] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet [08:44:29] (03CR) 10Jelto: "Instead of reserving more space for backups we want to store less backups on GitLab hosts (and use bacula instead). This will allow the ro" [puppet] - 10https://gerrit.wikimedia.org/r/827578 (owner: 10Jelto) [08:44:51] (03CR) 10Alexandros Kosiaris: Label the eight dse-k8s-worker nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828052 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [08:44:53] (03CR) 10Filippo Giunchedi: Remove upstart configs in /etc/init/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828477 (owner: 10Filippo Giunchedi) [08:45:23] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [08:51:03] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet [08:51:07] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33710 and previous config saved to /var/cache/conftool/dbconfig/20220831-085329-root.json [08:56:46] 10SRE, 10Traffic: ATS isn't honoring the cache policy set in cache::alternate_domains on some cases - https://phabricator.wikimedia.org/T316545 (10Jersione) What do I do [08:58:09] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:59:33] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [09:00:37] (03CR) 10Klausman: [C: 03+1] "Spot checked a few of the host-row assignments, all LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/828052 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [09:01:38] (03CR) 10Klausman: [C: 03+1] Add a kublet node_label to each master of the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/828049 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis) [09:01:55] (03CR) 10Volans: "This one should have been merged before deploying spicerack 3.2.0 to the cumin hosts (it's currently on cumin2002 AFAICT). Is anything blo" [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [09:02:13] (03CR) 10DCausse: [C: 04-1] cirrus: Handle transition to elasticsearch 7.10 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [09:02:47] (03CR) 10Volans: [C: 04-1] "comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [09:05:15] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:08:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33711 and previous config saved to /var/cache/conftool/dbconfig/20220831-090834-root.json [09:10:09] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [09:10:47] (03CR) 10Clément Goubert: [C: 03+2] role::ci::master: remove admin dependency hack [puppet] - 10https://gerrit.wikimedia.org/r/828399 (owner: 10Giuseppe Lavagetto) [09:11:05] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet [09:13:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet [09:14:00] (03CR) 10DCausse: [C: 04-1] cirrus: Handle transition to elasticsearch 7.10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [09:16:23] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [09:17:23] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet [09:17:43] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host parse1002.eqiad.wmnet with OS buster [09:19:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet [09:22:39] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [09:22:52] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet [09:22:57] (03PS4) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) [09:23:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33712 and previous config saved to /var/cache/conftool/dbconfig/20220831-092339-root.json [09:26:44] (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [09:27:20] !log installing docker.io bugfix updates from Bullseye point release [09:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:21] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:29:45] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage [09:31:12] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:31:54] 10SRE-tools, 10Infrastructure-Foundations, 10Release-Engineering-Team: Investigate sharing releng common python code to pywmflib - https://phabricator.wikimedia.org/T316757 (10hashar) [09:33:00] (03PS1) 10Phuedx: beta: $wgIPInfoGeoIP2Prefix -> $wgIPInfoGeoLite2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 [09:33:20] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage [09:34:43] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-fe2001.codfw.wmnet [09:37:15] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet [09:37:48] (03CR) 10Effie Mouzeli: [C: 03+1] Update calico to v3.23.3 [debs/calico] (v3.23) - 10https://gerrit.wikimedia.org/r/826230 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:38:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33713 and previous config saved to /var/cache/conftool/dbconfig/20220831-093844-root.json [09:44:02] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet [09:44:09] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet [09:44:56] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:51:44] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet [09:53:30] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-08-30 07:48:35 (3454 GiB, +1.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:53:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33714 and previous config saved to /var/cache/conftool/dbconfig/20220831-095348-root.json [09:57:04] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-08-30 07:31:40 (3433 GiB, +1.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:59:12] (03PS1) 10Volans: peeringdb: minor fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486 [09:59:14] (03PS1) 10Volans: CHANGELOG: fix typos and uniform format [software/spicerack] - 10https://gerrit.wikimedia.org/r/828487 [09:59:47] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:00:23] (03CR) 10Volans: "Updated comment, based on new patch." [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [10:00:56] (03CR) 10Volans: "I've left some post-merge comment. I've sent a patch with some small fixes, see:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi) [10:06:01] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1002.eqiad.wmnet with OS buster [10:06:04] (03CR) 10CI reject: [V: 04-1] peeringdb: minor fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486 (owner: 10Volans) [10:08:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33715 and previous config saved to /var/cache/conftool/dbconfig/20220831-100853-root.json [10:10:07] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [10:11:43] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [10:18:22] (03CR) 10Ladsgroup: "We talked about details of this in 1:1. Are we good to go?" [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup) [10:18:43] (03PS2) 10Volans: peeringdb: minor fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486 [10:18:45] (03PS2) 10Volans: CHANGELOG: fix typos and uniform format [software/spicerack] - 10https://gerrit.wikimedia.org/r/828487 [10:21:09] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:23:47] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [10:23:54] is there a way to get a PHP 7.4 shell.php on mwdebug? [10:24:37] aha, `PHP=php7.4 mwscript` works \o/ [10:26:31] (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson) [10:28:32] Lucas_WMDE: fyi mwscript doesn't do much more than `sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php`. sometimes that's useful, so just informing :). [10:29:29] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [10:31:33] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [10:32:27] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:33:25] (03Merged) 10jenkins-bot: elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson) [10:36:25] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [10:39:23] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:41:12] (03PS5) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) [10:42:18] <_joe_> !log updating php 7.4 on mwdebug1002 to test the new patched packages T316601 [10:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:23] T316601: PHP Warning: Erroneous data format for unserializing 'Wikimedia\Rdbms\MySQLPrimaryPos' - https://phabricator.wikimedia.org/T316601 [10:44:58] (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [10:46:00] (03PS1) 10Clément Goubert: ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 [10:46:46] (03PS6) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) [10:48:36] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37051/console" [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert) [10:49:03] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:50:31] (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [10:50:53] (03CR) 10Clément Goubert: ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert) [10:51:07] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:57:46] (03CR) 10TsepoThoabala: [C: 03+1] "I only have +1 rights on this repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 (owner: 10Phuedx) [10:59:46] (03PS7) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) [10:59:54] (03CR) 10Marostegui: [C: 03+1] auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup) [11:00:39] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [11:00:43] <_joe_> !log updating php 7.4 packages in wikimedia/bustrer T316601 [11:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:51] T316601: PHP Warning: Erroneous data format for unserializing 'Wikimedia\Rdbms\MySQLPrimaryPos' - https://phabricator.wikimedia.org/T316601 [11:01:03] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup) [11:01:09] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:01:49] (03Merged) 10jenkins-bot: auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup) [11:04:09] !log test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338 [11:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:13] T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 [11:06:35] (03PS3) 10Samtar: InitialiseSettings.php: Enable Realtime Preview on Group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) [11:07:58] (03CR) 10Volans: [C: 03+2] "trivial comment removal only, self-merging" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/828022 (owner: 10Volans) [11:09:14] (03CR) 10Volans: [C: 03+2] "Most fixes are trivial, self-merging to have them released with today's release. Happy to fix anything that might come up later in review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486 (owner: 10Volans) [11:09:20] (03PS3) 10Volans: peeringdb: minor fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486 [11:09:53] (03PS1) 10Hnowlan: image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) [11:10:01] (03CR) 10Volans: [C: 03+2] "CHANGELOG only changes, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/828487 (owner: 10Volans) [11:13:04] (03Merged) 10jenkins-bot: tests: remove unnecessary pylint disable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/828022 (owner: 10Volans) [11:13:59] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-08-30 07:31:40 (3454 GiB, +1.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [11:16:18] (03PS1) 10Clément Goubert: P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 [11:16:37] finally it finished! [11:16:54] (03CR) 10CI reject: [V: 04-1] P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert) [11:17:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet [11:18:04] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37052/console" [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert) [11:21:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet [11:21:49] (03PS8) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) [11:22:23] !log draining ganeti2015 for eventual reimage T311686 [11:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:27] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [11:25:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet [11:27:18] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [11:27:32] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version [11:27:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [11:28:31] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [11:28:35] (03PS1) 10Clément Goubert: C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502 [11:29:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet [11:29:57] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37053/console" [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert) [11:30:49] (03PS2) 10Clément Goubert: P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 [11:32:01] (03PS2) 10Clément Goubert: C:ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 [11:32:46] (03PS1) 10Hnowlan: Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104) [11:33:19] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [11:34:23] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:37:25] (03CR) 10Hnowlan: [C: 03+2] api-gateway: custom host overrides in discovery services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/825729 (owner: 10Hnowlan) [11:38:55] (03PS1) 10Clément Goubert: P:prometheus::nutcracker_exporter: Order service and package [puppet] - 10https://gerrit.wikimedia.org/r/828504 [11:39:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet [11:40:00] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37054/console" [puppet] - 10https://gerrit.wikimedia.org/r/828504 (owner: 10Clément Goubert) [11:41:21] (03Merged) 10jenkins-bot: api-gateway: custom host overrides in discovery services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/825729 (owner: 10Hnowlan) [11:49:12] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host webperf1004.eqiad.wmnet [11:49:26] (03PS3) 10Volans: CHANGELOG: fix typos and uniform format [software/spicerack] - 10https://gerrit.wikimedia.org/r/828487 [11:49:53] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:47] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:22] (03PS1) 10Clément Goubert: C:httpd::mpm: Remove mod_php* for php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/828507 [11:57:43] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [11:58:01] !log Reboot sanitarium hosts, lag will appear on clouddb* hosts [11:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:31] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37055/console" [puppet] - 10https://gerrit.wikimedia.org/r/828507 (owner: 10Clément Goubert) [12:03:25] PROBLEM - mediawiki-installation DSH group on parse1002 is CRITICAL: Host parse1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:04:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [12:05:01] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [12:05:35] (03PS1) 10Andrew Bogott: Remove backy2 backups from cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/828508 (https://phabricator.wikimedia.org/T316731) [12:06:11] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [12:06:54] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [12:08:00] (03PS1) 10Clément Goubert: C:mediawiki::packages::fonts: Order install and config [puppet] - 10https://gerrit.wikimedia.org/r/828510 [12:08:23] (03PS5) 10Jcrespo: bacula: Setup backup1008, backup2008 as new database backup storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582) [12:08:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [12:13:23] (03CR) 10CI reject: [V: 04-1] C:mediawiki::packages::fonts: Order install and config [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [12:13:45] (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:16:59] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [12:17:15] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2002.codfw.wmnet [12:18:09] (03PS2) 10Clément Goubert: C:mediawiki::packages::fonts: Order install and config [puppet] - 10https://gerrit.wikimedia.org/r/828510 [12:18:45] (JobUnavailable) resolved: Reduced availability for job k8s-pods in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:19:24] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37057/console" [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [12:22:45] (03CR) 10Clément Goubert: "Related puppet log: https://phabricator.wikimedia.org/P33716$1" [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert) [12:23:09] (03CR) 10Clément Goubert: "Related puppet log: https://phabricator.wikimedia.org/P33716$19" [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert) [12:23:58] (KubernetesCalicoDown) firing: ml-staging2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:24:12] (03CR) 10Muehlenhoff: C:mediawiki::packages::fonts: Order install and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [12:24:14] (03CR) 10Clément Goubert: P:mediawiki::php: Order wmerrors config and package install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert) [12:25:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [12:25:56] (03CR) 10Clément Goubert: [V: 03+1] C:mediawiki::packages::fonts: Order install and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [12:26:42] (03CR) 10Clément Goubert: [V: 03+1] "Related puppet log: https://phabricator.wikimedia.org/P33716$19" [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert) [12:27:15] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2002.codfw.wmnet [12:27:27] (03CR) 10Clément Goubert: [V: 03+1] "Related puppet log: https://phabricator.wikimedia.org/P33716$30" [puppet] - 10https://gerrit.wikimedia.org/r/828504 (owner: 10Clément Goubert) [12:27:55] (03CR) 10Clément Goubert: [V: 03+1] "Related puppet log: https://phabricator.wikimedia.org/P33716$35" [puppet] - 10https://gerrit.wikimedia.org/r/828507 (owner: 10Clément Goubert) [12:28:43] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [12:28:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [12:28:57] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [12:28:58] (KubernetesCalicoDown) resolved: ml-staging2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:29:46] (03CR) 10Jcrespo: [C: 03+2] bacula: Setup backup1008, backup2008 as new database backup storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [12:30:47] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [12:31:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:31:45] (03CR) 10Herron: [C: 03+1] graphite: start carbon.service at boot [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi) [12:31:55] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:31:57] (03CR) 10Herron: [C: 03+1] graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi) [12:33:15] (03CR) 10Andrew Bogott: [C: 03+2] Remove backy2 backups from cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/828508 (https://phabricator.wikimedia.org/T316731) (owner: 10Andrew Bogott) [12:33:21] (03PS2) 10Andrew Bogott: Remove backy2 backups from cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/828508 (https://phabricator.wikimedia.org/T316731) [12:33:41] !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-staging-ctrl2001.codfw.wmnet [12:33:41] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-staging-ctrl2001.codfw.wmnet [12:34:25] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-staging-ctrl2002.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [12:34:39] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-staging-ctrl2002.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [12:35:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [12:35:41] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:39:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [12:39:19] !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-staging-ctrl2002.codfw.wmnet [12:39:19] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-staging-ctrl2002.codfw.wmnet [12:39:26] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters - https://phabricator.wikimedia.org/T315560 (10Volans) Wasn't that the required behaviour to allow to failover an active/passive service without downtime? A... [12:43:33] (03CR) 10JMeybohm: [C: 03+1] image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) (owner: 10Hnowlan) [12:52:11] (03CR) 10Vgutierrez: "Tested in cp6016:" [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [12:52:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2002.codfw.wmnet [12:53:04] (03PS8) 10DCausse: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [12:53:12] (03CR) 10DCausse: cirrus: Handle transition to elasticsearch 7.10 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [12:54:05] (03PS9) 10DCausse: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [12:55:05] (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [12:55:11] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [12:55:42] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [12:56:08] (03PS1) 10Jcrespo: bacula: Migrate new database dump long term backups to backup[12]008 [puppet] - 10https://gerrit.wikimedia.org/r/828515 (https://phabricator.wikimedia.org/T313582) [12:57:22] !log test trafficserver: Hide non session cookies during cache lookup in drmrs - T316338 [12:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:27] T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 [12:57:41] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters - https://phabricator.wikimedia.org/T315560 (10JMeybohm) Ah, okay. Makes sense in that case. I think I was assuming pool() would check because depool() does... [12:58:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2002.codfw.wmnet [12:59:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet [12:59:35] (03PS10) 10DCausse: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T1300) [13:00:05] TheresNoTime and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:12] o/ [13:00:15] hai [13:00:22] does anyone want to deploy, or should i? [13:00:37] I can deploy [13:00:47] TheresNoTime: what’s your current deployer status? ^^ [13:01:26] she's a deployer, too [13:01:30] I can self-deploy if you'd prefer? [13:01:40] sure! [13:01:46] and nice \o/ [13:02:03] Lucas_WMDE: did you want to do your patch first [13:02:10] nah, it’s not important at all [13:02:19] just a no-op to simplify the config a tiny bit [13:02:32] ack, will deploy mine :) [13:03:19] (03CR) 10Samtar: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar) [13:04:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet [13:04:54] (03Merged) 10jenkins-bot: InitialiseSettings.php: Enable Realtime Preview on Group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar) [13:05:54] (03PS2) 10Jcrespo: bacula: Migrate new database dump long term backups to backup[12]008 [puppet] - 10https://gerrit.wikimedia.org/r/828515 (https://phabricator.wikimedia.org/T313582) [13:05:55] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10Volans) a:03SLyngshede-WMF Indeed, I can confirm the issue. The problem comes from the a bit //automagic// dry-run handling in the `@retry` decorato... [13:07:18] !log klausman@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [13:07:29] (syncing mine) [13:10:16] (03CR) 10Muehlenhoff: C:mediawiki::packages::fonts: Order install and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [13:10:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:11:10] !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:828059|InitialiseSettings.php: Enable Realtime Preview on Group 2 (T314828)]] (duration: 03m 54s) [13:11:14] T314828: Enable Realtime preview on group2 - https://phabricator.wikimedia.org/T314828 [13:11:20] Lucas_WMDE: all yours :) [13:11:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:11:26] ok :) [13:11:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:11:52] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:11:53] I’ll first verify that I can test this behavior on mwdebug [13:12:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:12:32] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-08-30 07:48:35 (3433 GiB, +1.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [13:12:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [13:13:19] !log installing zlib security updates on bullseye [13:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:22] ok, that works [13:14:38] if I comment out the assignment that I’m tampering with, then searchEntities.php for “English” returns slightly different results [13:14:51] so I should be able to use that to test that the assignments are still effective after the config file changes [13:14:57] (03PS3) 10Lucas Werkmeister (WMDE): Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872 [13:15:00] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872 (owner: 10Lucas Werkmeister (WMDE)) [13:16:22] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:16:41] (03Merged) 10jenkins-bot: Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872 (owner: 10Lucas Werkmeister (WMDE)) [13:17:15] looks good on mwdebug, syncing [13:18:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [13:18:23] (03PS3) 10Lucas Werkmeister (WMDE): Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873 [13:18:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873 (owner: 10Lucas Werkmeister (WMDE)) [13:18:59] (03PS1) 10Andrew Bogott: P:systemd::timedated: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) [13:19:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [13:20:23] (03Merged) 10jenkins-bot: Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873 (owner: 10Lucas Werkmeister (WMDE)) [13:21:15] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikibase.php: Config: [[gerrit:806872|Only set WikibaseCirrusSearch settings if wmg globals are set]] (duration: 03m 42s) [13:22:29] second change also looks good on mwdebug, syncing [13:22:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:22:51] (03CR) 10Clément Goubert: [V: 03+1] C:mediawiki::packages::fonts: Order install and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [13:23:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:23:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:02] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:25:56] PROBLEM - Check systemd state on netflow4002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [13:26:45] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806873|Directly set WikibaseCirrusSearch settings in IS.php]] (1/3) (duration: 03m 47s) [13:28:10] PROBLEM - Check systemd state on db2173 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:10] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:28:58] (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:29:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:29:58] (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:30:09] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10fgiunchedi) In Matthew's absence I can confirm that the drives are hot swappable @Jclark-ctr ! [13:30:28] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:30:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:806873|Directly set WikibaseCirrusSearch settings in IS.php]] (2/3) (duration: 03m 39s) [13:30:49] (03PS3) 10Lucas Werkmeister (WMDE): Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874 [13:31:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:41] (03PS3) 10Clément Goubert: C:mediawiki::packages::fonts: Install fontconfig-config [puppet] - 10https://gerrit.wikimedia.org/r/828510 [13:31:50] !log restarting exim on the MXes to pick up zlib update [13:31:52] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:47] (03CR) 10Clément Goubert: C:mediawiki::packages::fonts: Install fontconfig-config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [13:33:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [13:33:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [13:34:26] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37062/console" [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [13:34:45] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikidata.php: Config: [[gerrit:806873|Directly set WikibaseCirrusSearch settings in IS.php]] (3/3) (duration: 03m 42s) [13:34:52] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 111.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [13:34:58] (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:35:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874 (owner: 10Lucas Werkmeister (WMDE)) [13:36:31] (03Merged) 10jenkins-bot: Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874 (owner: 10Lucas Werkmeister (WMDE)) [13:37:32] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:37:32] checking on mwdebug [13:37:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [13:37:46] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] C:mediawiki::packages::fonts: Install fontconfig-config [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [13:39:38] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:42:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:42:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:43:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:43:57] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikibase.php: Config: [[gerrit:806874|Remove unused assignments from SearchSettingsForWikibase.php]] (1/2) (duration: 03m 38s) [13:44:28] PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 102.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [13:47:06] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:47:11] There's an issue with the patch https://gerrit.wikimedia.org/r/828510 I just merged, reverting [13:47:45] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikidata.php: Config: [[gerrit:806874|Remove unused assignments from SearchSettingsForWikibase.php]] (2/2) (duration: 03m 33s) [13:47:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) cloudvirt1054 E4 U29 Port 36/37 Cableid 20220045 / 20220041 cloudvirt1055 E4 U30 Port 38/39 Cableid 20220046... [13:48:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) [13:48:44] I think I’m done [13:48:49] anything else to deploy? [13:49:14] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [13:49:28] !log UTC afternoon backport+config window done [13:49:29] (03PS1) 10Clément Goubert: Revert "C:mediawiki::packages::fonts: Install fontconfig-config" [puppet] - 10https://gerrit.wikimedia.org/r/828536 [13:49:38] RECOVERY - Check systemd state on db2173 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:57] (03PS1) 10JMeybohm: image-suggestion: Update to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/828537 (https://phabricator.wikimedia.org/T313973) [13:51:10] PROBLEM - Check systemd state on mw1383 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:42] (03CR) 10Clément Goubert: "There's an issue with the previous patch, it keeps desinstalling and reinstalling fontconfig-config at each run on parse hosts. I'll hunt " [puppet] - 10https://gerrit.wikimedia.org/r/828536 (owner: 10Clément Goubert) [13:52:55] (03PS2) 10Clément Goubert: Revert "C:mediawiki::packages::fonts: Install fontconfig-config" [puppet] - 10https://gerrit.wikimedia.org/r/828536 [13:54:28] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:25] (03CR) 10Clément Goubert: [C: 03+2] Revert "C:mediawiki::packages::fonts: Install fontconfig-config" [puppet] - 10https://gerrit.wikimedia.org/r/828536 (owner: 10Clément Goubert) [14:00:31] (03CR) 10Muehlenhoff: P:systemd::timedated: exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) (owner: 10Andrew Bogott) [14:02:39] (03CR) 10Hnowlan: [C: 03+1] image-suggestion: Update to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/828537 (https://phabricator.wikimedia.org/T313973) (owner: 10JMeybohm) [14:03:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:04:17] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] "Reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/828536" [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [14:06:31] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh) Downgrading and reimaging drmrs ATS9 hosts cp6008 and cp6016 to ATS8 for a week so that we can have comparative data later when we upgrade all instances to ATS9 in drmrs. [14:07:01] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10Krinkle) [14:07:30] (03CR) 10Muehlenhoff: [C: 03+1] C:mediawiki::packages::fonts: Install fontconfig-config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert) [14:08:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS buster [14:08:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6008.drmrs.wmnet with OS buster [14:08:50] !log deploy trafficserver: Hide non session cookies during cache lookup globally - T316338 [14:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:54] T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338 [14:08:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:09:36] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp6008.drmrs.wmnet with OS buster [14:09:43] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6008.drmrs.wmnet with OS buster executed with errors: - cp6008 (**FAIL... [14:10:47] (03CR) 10Jcrespo: [C: 03+2] bacula: Migrate new database dump long term backups to backup[12]008 [puppet] - 10https://gerrit.wikimedia.org/r/828515 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [14:11:12] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:11:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6008.drmrs.wmnet,service=ats-tls [14:11:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6008.drmrs.wmnet,service=ats-be [14:11:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6008.drmrs.wmnet,service=varnish-fe [14:13:56] (03PS1) 10Ssingh: hiera: downgrade cp6008 to ATS8 [puppet] - 10https://gerrit.wikimedia.org/r/828540 (https://phabricator.wikimedia.org/T309651) [14:14:13] (03CR) 10JMeybohm: [C: 03+2] image-suggestion: Update to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/828537 (https://phabricator.wikimedia.org/T313973) (owner: 10JMeybohm) [14:14:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) [14:15:03] (03CR) 10Ssingh: [C: 03+2] hiera: downgrade cp6008 to ATS8 [puppet] - 10https://gerrit.wikimedia.org/r/828540 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [14:15:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS buster [14:16:13] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6008.drmrs.wmnet with OS buster [14:18:26] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [14:18:46] (03PS1) 10Ssingh: hiera: downgrade cp6016 to ATS8 [puppet] - 10https://gerrit.wikimedia.org/r/828543 (https://phabricator.wikimedia.org/T309651) [14:19:33] (03Merged) 10jenkins-bot: image-suggestion: Update to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/828537 (https://phabricator.wikimedia.org/T313973) (owner: 10JMeybohm) [14:22:49] (03PS1) 10Clément Goubert: C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544 [14:22:59] (03CR) 10Hnowlan: [C: 03+2] image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) (owner: 10Hnowlan) [14:23:03] (03PS2) 10Hnowlan: image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) [14:23:26] (03PS2) 10Clément Goubert: C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544 [14:24:10] (03PS1) 10JMeybohm: Merge cert-manager/sample-external-issuer@55b043b [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/828545 (https://phabricator.wikimedia.org/T310486) [14:26:05] (03PS3) 10Clément Goubert: C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544 [14:27:23] (03PS2) 10JMeybohm: Merge cert-manager/sample-external-issuer@55b043b [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/828545 (https://phabricator.wikimedia.org/T310486) [14:28:16] (03CR) 10Andrew Bogott: P:systemd::timedated: exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) (owner: 10Andrew Bogott) [14:28:28] (03PS2) 10Andrew Bogott: P:systemd::timedated: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) [14:29:11] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37064/console" [puppet] - 10https://gerrit.wikimedia.org/r/828544 (owner: 10Clément Goubert) [14:33:25] (03CR) 10Hnowlan: [C: 03+2] image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) (owner: 10Hnowlan) [14:37:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [14:37:43] (03Merged) 10jenkins-bot: image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) (owner: 10Hnowlan) [14:40:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage [14:41:06] !log klausman@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [14:42:27] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [14:42:33] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi) [14:42:52] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [14:43:17] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [14:43:26] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi) [14:43:58] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [14:44:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) druid1009 A5 U06 druid1010 B5 U13 druid1011 D6 U37 [14:45:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544 (owner: 10Clément Goubert) [14:47:00] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [14:47:45] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [14:48:39] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544 (owner: 10Clément Goubert) [14:48:51] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [14:49:26] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [14:50:00] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-serve-ctrl2002.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [14:50:14] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-serve-ctrl2002.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [14:54:08] RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 32.26 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [14:54:20] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:54:20] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:14] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:14] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:37] !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-serve-ctrl2002.codfw.wmnet [14:56:38] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-serve-ctrl2002.codfw.wmnet [14:56:44] !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-serve-ctrl2001.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [14:56:57] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-serve-ctrl2001.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185) [15:00:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6008.drmrs.wmnet with OS buster [15:00:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6008.drmrs.wmnet with OS buster completed: - cp6008 (**WARN**) - Dow... [15:01:03] !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-serve-ctrl2001.codfw.wmnet [15:01:04] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-serve-ctrl2001.codfw.wmnet [15:04:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet,service=ats-tls [15:04:04] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet,service=ats-be [15:04:05] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet,service=varnish-fe [15:05:49] (03PS1) 10Jcrespo: bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) [15:06:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6016.drmrs.wmnet,service=ats-tls [15:06:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6016.drmrs.wmnet,service=ats-be [15:06:24] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6016.drmrs.wmnet,service=varnish-fe [15:06:38] (03CR) 10CI reject: [V: 04-1] bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [15:06:42] (03CR) 10Ssingh: [C: 03+2] hiera: downgrade cp6016 to ATS8 [puppet] - 10https://gerrit.wikimedia.org/r/828543 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh) [15:07:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS buster [15:07:36] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster [15:07:52] (03PS1) 10Volans: CHANGELOG: add changelogs for release v3.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/828560 [15:08:30] (03CR) 10Volans: [C: 03+2] "changelog for new release, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/828560 (owner: 10Volans) [15:08:35] (03PS2) 10Jcrespo: bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) [15:09:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:33] (03CR) 10CI reject: [V: 04-1] bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [15:11:32] (03PS3) 10Jcrespo: bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) [15:14:10] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:15:46] (03PS1) 10Vgutierrez: trafficserver: Replace session cookies with Token=1 iff V:C isn't there [puppet] - 10https://gerrit.wikimedia.org/r/828564 (https://phabricator.wikimedia.org/T316338) [15:17:34] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v3.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/828560 (owner: 10Volans) [15:24:01] (03CR) 10Jcrespo: [C: 03+2] bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [15:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:27:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [15:28:20] (03PS3) 10Jelto: sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) [15:31:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage [15:32:46] (03PS1) 10Volans: Upstream release v3.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/828568 [15:35:04] 10SRE, 10Image-Suggestions, 10Structured-Data-Backlog (Current Work): [M] Schedule image suggestions notifications - https://phabricator.wikimedia.org/T300024 (10matthiasmullie) 05Open→03Resolved [15:35:44] (03PS1) 10Ahmon Dancy: Revert "Turn mw_releases into a list" [puppet] - 10https://gerrit.wikimedia.org/r/828586 [15:36:28] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:37:36] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:37:50] (03PS3) 10Clément Goubert: C:ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 [15:39:32] (03Abandoned) 10Vgutierrez: trafficserver: Replace session cookies with Token=1 iff V:C isn't there [puppet] - 10https://gerrit.wikimedia.org/r/828564 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez) [15:46:06] (03CR) 10Volans: [C: 03+2] Upstream release v3.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/828568 (owner: 10Volans) [15:52:14] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:54:09] (03Merged) 10jenkins-bot: Upstream release v3.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/828568 (owner: 10Volans) [15:54:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [15:54:31] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [15:54:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [15:55:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS buster [15:55:44] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster completed: - cp6016 (**WARN**) - Dow... [15:56:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:56:59] <_joe_> !log updated php 7.4 in all of production T316691 [15:57:01] (03CR) 10Muehlenhoff: "This looks fine, but please don't merge yet. This is a fleet-wide available service and I need to first doublecheck that we don't run into" [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) (owner: 10Andrew Bogott) [15:57:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:04] T316691: Vue: reimplement LanguageSelector clear strategy - https://phabricator.wikimedia.org/T316691 [15:57:16] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:58:46] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:00:26] !log uploaded spicerack_3.2.1 to apt.wikimedia.org bullseye-wikimedia [16:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:36] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet,service=ats-tls [16:00:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet,service=ats-be [16:00:37] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet,service=varnish-fe [16:07:47] (03PS1) 10Jcrespo: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) [16:08:21] (03CR) 10CI reject: [V: 04-1] Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [16:08:34] (03PS2) 10Jcrespo: bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) [16:09:09] (03CR) 10CI reject: [V: 04-1] bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [16:09:34] (03PS3) 10Jcrespo: bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) [16:11:16] (03CR) 10Jaime Nuche: [C: 04-1] "From Joe's latest explanation, we won't be able to know the namespace names in advance. That means key lookup is not required, we will nee" [puppet] - 10https://gerrit.wikimedia.org/r/828586 (owner: 10Ahmon Dancy) [16:19:00] (03PS4) 10Jcrespo: bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) [16:21:43] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:52] (03CR) 10Jcrespo: [C: 03+2] bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [16:34:17] (03Abandoned) 10Ahmon Dancy: Revert "Turn mw_releases into a list" [puppet] - 10https://gerrit.wikimedia.org/r/828586 (owner: 10Ahmon Dancy) [16:35:56] (03PS1) 10Ahmon Dancy: Revert comment change [puppet] - 10https://gerrit.wikimedia.org/r/828583 [16:37:45] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:43:30] (03PS1) 10Jcrespo: bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582) [16:44:06] (03CR) 10CI reject: [V: 04-1] bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [16:47:11] (03PS1) 10CDanis: dbctl: python 3.10 & x2 section [software/conftool] - 10https://gerrit.wikimedia.org/r/828585 [16:47:13] (03PS1) 10CDanis: dbctl: Add omit_replicas_in_mwconfig section attribute [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) [16:47:55] (03PS2) 10Jcrespo: bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582) [16:48:11] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:48:52] (03CR) 10CDanis: "Pre-review -- will add docs once you think the patch is sound." [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis) [16:49:25] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:50:01] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:52:01] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:17] (03PS3) 10Jcrespo: bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582) [16:52:41] !log installing spicerack 3.2.1 on cumin2002 [16:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:11] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:56:17] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [16:58:35] (03PS1) 10Volans: sre.hardware.upgrade-firmware: sort drivers files [cookbooks] - 10https://gerrit.wikimedia.org/r/828609 [17:04:05] PROBLEM - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is CRITICAL: connect to address 10.64.48.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:04:21] PROBLEM - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is CRITICAL: connect to address 10.64.48.151 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:04:39] PROBLEM - cassandra-b service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:04:45] PROBLEM - cassandra-c service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:04:53] PROBLEM - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is CRITICAL: connect to address 10.64.48.152 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:04:57] PROBLEM - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:05:09] PROBLEM - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:06:14] !log installing spicerack 3.2.1 on cumin1001 [17:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:17] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/828585 (owner: 10CDanis) [17:07:48] (03CR) 10CDanis: [C: 03+2] dbctl: python 3.10 & x2 section [software/conftool] - 10https://gerrit.wikimedia.org/r/828585 (owner: 10CDanis) [17:08:57] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:10:05] (03Merged) 10jenkins-bot: dbctl: python 3.10 & x2 section [software/conftool] - 10https://gerrit.wikimedia.org/r/828585 (owner: 10CDanis) [17:17:28] (03PS1) 10Jcrespo: bacula: Fix backups with custom jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/828610 (https://phabricator.wikimedia.org/T313582) [17:29:13] (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:33:05] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:39:36] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix backups with custom jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/828610 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [17:43:09] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:47:59] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:54:23] (03PS1) 10DDesouza: Deploy Research Incentive Survey to enwiki on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828613 (https://phabricator.wikimedia.org/T316464) [17:54:51] jouncebot: now [17:54:51] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [17:58:17] (03PS1) 10DDesouza: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828614 (https://phabricator.wikimedia.org/T316466) [18:00:05] dduvall and hashar: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T1800). [18:00:05] dduvall and hashar: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T1800). [18:06:57] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:08:06] (03PS1) 10Bernard Wang: Remove Vector grid config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 [18:08:29] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:43] (03PS2) 10Bernard Wang: Remove Vector grid config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) [18:12:06] (03CR) 10Jdlrobson: [C: 03+1] "This is unused config so can be removed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) (owner: 10Bernard Wang) [18:18:31] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828620 (https://phabricator.wikimedia.org/T314188) [18:18:33] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828620 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [18:19:22] (03Abandoned) 10AOkoth: gitlab: copy ssh host keys for failover [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth) [18:19:35] (03Abandoned) 10AOkoth: vrts: create /opt/otrs folder [puppet] - 10https://gerrit.wikimedia.org/r/828078 (owner: 10AOkoth) [18:19:39] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828620 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot) [18:21:47] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:23:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:23:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:24:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:24:16] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.27 refs T314188 [18:24:20] T314188: 1.39.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T314188 [18:27:54] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.27 refs T314188 (duration: 03m 37s) [18:29:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:30:09] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:33:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:33:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:33:32] (03PS11) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 [18:35:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T314041)', diff saved to https://phabricator.wikimedia.org/P33721 and previous config saved to /var/cache/conftool/dbconfig/20220831-183513-ladsgroup.json [18:35:19] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [18:36:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:38:19] (03PS1) 10Dzahn: prometheus: fix/invert comments about matching in blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/828622 [18:39:02] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Wikimedia-Incident: Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10Zabe) [18:39:50] (03PS2) 10Dzahn: prometheus: fix/invert comments about matching in blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/828622 [18:40:21] (03CR) 10Dzahn: [C: 03+2] prometheus::blackbox::http: add/edit parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [18:46:23] RECOVERY - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.151 port 9042 https://phabricator.wikimedia.org/T93886 [18:46:58] (03PS1) 10Andrew Bogott: Refactor profile::wmcs::backup_glance_images to run on a dedicated backup host [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) [18:47:33] (03CR) 10CI reject: [V: 04-1] Refactor profile::wmcs::backup_glance_images to run on a dedicated backup host [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) (owner: 10Andrew Bogott) [18:50:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P33722 and previous config saved to /var/cache/conftool/dbconfig/20220831-185020-ladsgroup.json [18:52:23] (03PS2) 10Andrew Bogott: Refactor profile::wmcs::backup_glance_images to run on a dedicated backup host [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) [18:52:34] (03CR) 10Dzahn: [C: 03+1] "This looks reasonable to me. Compiler output is a bit hard to diff but production networks for install servers makes sense to me and if an" [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [18:56:04] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [18:56:09] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [18:56:35] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) Thanks will be done within a hour [18:56:56] (03CR) 10Dzahn: [C: 04-1] "so Gerrit thinks it's my turn here and it's in my attention set and that's why it keeps showing it to as my todo / it's waiting for me. He" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [18:58:51] (03PS3) 10Ryan Kemper: Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (owner: 10DCausse) [18:59:03] (03CR) 10Dzahn: [C: 04-1] "ack:) I consider it stalled but don't want to abandon it. Gerrit thinks it's "my turn" so I am replying to get it out of that attention se" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond) [19:00:53] (03CR) 10Dzahn: "ACK, currently on-ice." [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [19:01:21] (03PS3) 10Andrew Bogott: Refactor profile::wmcs::backup_glance_images to run on a dedicated backup host [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) [19:02:14] (03PS2) 10Dzahn: Revert "install_server: change partman config for gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/827578 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [19:02:26] (03PS3) 10Dzahn: Revert "install_server: change partman config for gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/827578 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [19:02:29] (03PS4) 10Andrew Bogott: Move profile::wmcs::backup_glance_images from cloudcontrols to backup servers [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) [19:02:32] (03CR) 10Dzahn: [C: 03+2] Revert "install_server: change partman config for gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/827578 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [19:03:04] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37071/" [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) (owner: 10Andrew Bogott) [19:05:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P33723 and previous config saved to /var/cache/conftool/dbconfig/20220831-190526-ladsgroup.json [19:06:02] (03PS4) 10Ryan Kemper: Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (https://phabricator.wikimedia.org/T308676) (owner: 10DCausse) [19:06:08] (03CR) 10Ryan Kemper: [C: 03+2] Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (https://phabricator.wikimedia.org/T308676) (owner: 10DCausse) [19:06:10] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (https://phabricator.wikimedia.org/T308676) (owner: 10DCausse) [19:07:26] squeezes in between the merges, puppetmaster can be fast nowadays [19:08:39] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:12:35] PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:15:44] !log gitlab: reimaging gitlab2003 with cookbook after reverting partman change and comment on gerrit:827578 T274463 [19:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:51] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [19:16:27] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [19:18:16] !log dzahn@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gitlab2003.wikimedia.org with OS bullseye [19:19:23] reimaging failed again [19:20:10] Ctrl+c pressed..whoops [19:20:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T314041)', diff saved to https://phabricator.wikimedia.org/P33724 and previous config saved to /var/cache/conftool/dbconfig/20220831-192032-ladsgroup.json [19:20:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:20:39] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [19:20:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:21:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:21:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [19:21:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P33725 and previous config saved to /var/cache/conftool/dbconfig/20220831-192120-ladsgroup.json [19:21:30] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [19:21:48] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719 [19:21:53] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [19:22:27] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:24:05] ^ Merged a fix for that flapping but didn't manually run puppet, so I expect it to resolve within 10 mins or so [19:24:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) What is the status on this one? it has been sitting for a while [19:24:58] ryankemper: ack, thanks for that [19:25:05] figured it was the maintenance [19:25:22] because codfw only [19:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [19:29:05] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:30:43] !log T316719 Rolling upgrade operation complete; all of elastic codfw is now on `7.10.2`. Next week our related cirrus changes will go out with the mediawiki deploy train in `1.39.0-wmf.28` [19:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:49] T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719 [19:34:50] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:37:20] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [19:37:24] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [19:41:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [19:43:58] (03PS1) 10Jcrespo: bacula: Fix job and restore defaults to use the new production pool [puppet] - 10https://gerrit.wikimedia.org/r/828629 (https://phabricator.wikimedia.org/T313582) [19:46:33] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix job and restore defaults to use the new production pool [puppet] - 10https://gerrit.wikimedia.org/r/828629 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [19:50:01] (03PS1) 10Ebernhardson: admin: Update my home directory [puppet] - 10https://gerrit.wikimedia.org/r/828630 [19:53:22] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit1001), No backups: 109 (an-master1002, ...), Fresh: 5 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [19:56:40] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [19:57:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T2000). [20:00:05] ebernhardson and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:13] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) Replaced 2 failed drives [20:00:18] I can deploy today [20:00:20] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) 05Open→03Resolved [20:01:09] \o [20:01:38] urbanecm: my patch should have no visible change, it's prep for next weeks train [20:01:49] ack [20:02:19] is there any order the files should be synced in? [20:02:32] also, feel free to self-deploy if you want, looks like danisztls's not around today [20:02:54] urbanecm: sure i can deploy it [20:03:10] go ahead then :) [20:05:21] sry, I'm late [20:05:49] no worries [20:05:57] urbanecm: actually i'm going to delay mine till tomorrow, i notice dcausse changed a part of it and i'm not sure which way is correct, will have to check with him and ship tomorrow [20:06:04] okay, sounds good! [20:06:10] so only danisztls's patch then [20:06:11] well, i think my way is correct and he thinks his is, we should agree first :) [20:06:20] yep, sounds like a good idea [20:06:27] taking over the window [20:06:31] (03PS2) 10Urbanecm: Deploy Research Incentive Survey to enwiki on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828613 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza) [20:06:47] danisztls: your patch has zero coverage, is that intentional? [20:06:52] urbanecm: yes [20:06:54] okay [20:07:00] (03CR) 10Urbanecm: [C: 03+2] Deploy Research Incentive Survey to enwiki on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828613 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza) [20:07:01] urbanecm: will trigger via parameter [20:07:06] makes sense [20:07:08] shipping :) [20:07:45] (03Merged) 10jenkins-bot: Deploy Research Incentive Survey to enwiki on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828613 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza) [20:08:24] Hello team, I'll reboot the netmon1003 instance in 30 minutes for a kernel update. [20:08:55] !log bking@cumin1001 conftool action : get/pooled; selector: dnsdisc=wdqs,name=codfw [20:09:41] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10Jclark-ctr) ps2-e4 had a failed network card. started rma for card. in meantime swapped card from unconfigured pdu to verify fixed [20:10:03] danisztls: should be deployed to beta soon. anything else to deploy today? [20:10:06] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10Jclark-ctr) 05Open→03Resolved [20:10:09] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Jclark-ctr) [20:10:43] urbanecm: just this patch, thanks! [20:10:49] okay, then we're done :) [20:10:56] (with deployment, i mean :D) [20:11:04] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit1001), No backups: 109 (an-master1002, ...), Fresh: 5 jobs Jcrespo known issue - backups are being refactored https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [20:13:02] RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:13:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:13:14] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [20:14:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:16:22] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:23:27] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [20:24:38] (03CR) 10Dzahn: "ah, cool, I suggested creating this yesterday without realizing the patch for it already existed" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [20:25:43] (03CR) 10Dzahn: "I would have expected that we set a specific UID/GID that we reserve in admin module. Like for librenms and phd. Not needed here?" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [20:31:24] !log rebooting netmon1003 for a kernel upgrade [20:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:51] !log denisse@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot [20:36:58] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:37:10] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:38:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Andrew) @Jclark-ctr These are blocked on a variety of tech decisions; no action needed in the DC for now. Thanks for checking in! [20:38:21] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0) [20:39:50] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [20:40:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:41:30] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:43:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [20:52:16] (03PS12) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 [20:52:18] (03CR) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson) [20:56:18] (03CR) 10Ori: [C: 03+1] "' => $body_regex_not_matches," [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn) [20:56:49] (03CR) 10Ori: [C: 03+1] prometheus: fix/invert comments about matching in blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn) [20:57:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [20:58:05] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@94b160c]: drop_old_data: Add new required param --allowed-interval [20:58:37] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) 05Open→03Resolved [21:00:12] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@94b160c]: drop_old_data: Add new required param --allowed-interval (duration: 02m 07s) [21:02:07] (03CR) 10Dzahn: [C: 03+2] Revert "install_server: change partman config for gitlab" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/827578 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [21:04:09] (03CR) 10Dzahn: prometheus: fix/invert comments about matching in blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn) [21:05:41] (03CR) 10Muehlenhoff: rancid: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [21:17:08] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:23] (03CR) 10Dzahn: [C: 03+1] rancid: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [21:20:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:20:46] (03CR) 10Dzahn: [C: 03+2] prometheus: fix/invert comments about matching in blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn) [21:21:03] (03CR) 10Dzahn: [C: 03+2] "just the comments for now to avoid wrong docs" [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn) [21:27:43] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37072/netmon1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff) [21:29:13] (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:29:46] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade [21:30:13] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade [21:30:14] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade [21:30:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade [21:32:53] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1028.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001 [21:32:58] T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697 [21:34:12] RECOVERY - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-a valid until 2024-08-30 21:25:17 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:36:52] RECOVERY - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-b valid until 2024-08-30 21:25:20 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:38:38] RECOVERY - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-c valid until 2024-08-30 21:25:22 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:40:41] !log run search index creation for guwwiktionary [21:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:10] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:41:31] !log run search index creation for bjnwiktionary [21:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:19] !log run search index creation for pcmwiki [21:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:28] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1028.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001 [21:42:32] T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697 [21:42:40] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1029.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001 [21:44:10] RECOVERY - cassandra-a SSL 10.64.16.180:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-a valid until 2024-08-30 21:39:09 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:46:10] RECOVERY - cassandra-b SSL 10.64.16.181:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-b valid until 2024-08-30 21:39:11 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:48:25] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1030.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001 [21:48:31] T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697 [21:48:40] RECOVERY - cassandra-c SSL 10.64.16.182:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-c valid until 2024-08-30 21:39:14 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:52:18] RECOVERY - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-a valid until 2024-08-30 21:39:16 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:52:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1029.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001 [21:53:04] RECOVERY - cassandra-b SSL 10.64.48.235:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-b valid until 2024-08-30 21:39:18 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:55:22] RECOVERY - cassandra-c SSL 10.64.48.236:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-c valid until 2024-08-30 21:39:21 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [21:55:42] ACKNOWLEDGEMENT - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service daniel_zahn https://phabricator.wikimedia.org/T310395 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:55:46] ACKNOWLEDGEMENT - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service daniel_zahn https://phabricator.wikimedia.org/T310395 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:37] ACKNOWLEDGEMENT - mediawiki-installation DSH group on parse1002 is CRITICAL: Host parse1002 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T312638 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:58:20] !log mw1383 start php7.2-fpm_check_restart.service [21:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:08] !log etherpad (etherpad1003) - rebooting for maintenance [21:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:08] RECOVERY - Check systemd state on mw1383 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1030.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001 [22:00:17] T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697 [22:30:25] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:41:29] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:58:48] (03PS5) 10Krinkle: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto) [23:02:06] (03CR) 10Krinkle: [C: 03+2] Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto) [23:03:51] (03Merged) 10jenkins-bot: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto) [23:06:01] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:06:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:07:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:07:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:08:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:12:10] !log krinkle@deploy1002 Change /srv/mediawiki-staging/private to remove wmgElectronSecret [23:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:13] (03PS1) 10Dduvall: phabricator: Reintroduce script to ensure correct config ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953) [23:12:37] (03PS4) 10Krinkle: Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) [23:13:17] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:13:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:13:41] !log krinkle@deploy1002 Synchronized wmf-config/: Ibdac0a (duration: 03m 44s) [23:14:09] (03CR) 10Krinkle: [C: 03+2] Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) (owner: 10Krinkle) [23:14:09] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:14:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:14:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:15:16] (03Merged) 10jenkins-bot: Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) (owner: 10Krinkle) [23:15:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:16:22] (03CR) 10CI reject: [V: 04-1] phabricator: Reintroduce script to ensure correct config ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [23:17:34] !log krinkle@deploy1002 Synchronized private/: (no justification provided) (duration: 03m 42s) [23:18:11] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:20:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:21:09] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:29] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:21:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:21:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:22:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:25:42] (03PS1) 10Dduvall: Run all puppetized deploy scripts as checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/828655 (https://phabricator.wikimedia.org/T313953) [23:27:34] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:29:55] (03PS2) 10Dduvall: Run all puppetized deploy scripts as checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/828655 (https://phabricator.wikimedia.org/T313953) [23:31:29] !log krinkle@deploy1002 Synchronized wmf-config/: I493b5e4662 (duration: 03m 43s) [23:35:59] (03PS2) 10Dduvall: phabricator: Reintroduce script to ensure correct config ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953) [23:36:30] (03CR) 10Dduvall: [V: 03+1] "Successfully tested in devtools." [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [23:46:54] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:52:04] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:52:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:54:43] (03CR) 10Krinkle: [C: 03+1] "Test fixture LGTM. In particular, absence of replicas in the pool, but fine to keep in hostname map indeed." [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis) [23:56:44] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:57:13] (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:57:45] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10Zabe)