[00:08:44] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[00:08:49] <stashbot>	 T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719
[00:12:18] <wikibugs>	 (03Abandoned) 10Ryan Kemper: 6.8.23-wmf2 search-extra for bullseye [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/818507 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper)
[00:12:46] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-08-23 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:13:33] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] prometheus-elasticsearch-exporter: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826835 (owner: 10Muehlenhoff)
[00:13:35] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] prometheus-elasticsearch-exporter: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/826835 (owner: 10Muehlenhoff)
[00:14:09] <logmsgbot>	 !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[00:14:14] <stashbot>	 T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719
[00:14:50] <ryankemper>	 !log T316719 First elastic host upgraded properly. Cancelling cookbook to kick off a new rolling upgrade that will go 3 nodes at a time (first run was just one host as a sanity check)
[00:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:20] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[00:19:12] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:20:18] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission elastic10[48-52].eqiad.wmnet - https://phabricator.wikimedia.org/T316728 (10RKemper)
[00:20:26] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:20:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:20:46] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-08-23 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:22:54] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:56] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48534 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:25:39] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T316729 (10RKemper)
[00:26:33] <wikibugs>	 (03CR) 10Ryan Kemper: elastic: decom elastic2035 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper)
[00:26:45] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T316729)
[00:27:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T316729) (owner: 10Ryan Kemper)
[00:30:48] <icinga-wm>	 PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-08-23 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:33:42] <icinga-wm>	 PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-08-23 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:40:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:40:57] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[00:52:36] <icinga-wm>	 ACKNOWLEDGEMENT - DNS on cloudservices1003.mgmt is CRITICAL: Domain cloudservices1003.mgmt.eqiad.wmnet was not found by the server Andrew Bogott not urgent. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:21:44] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:25:40] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:28:04] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 6.284 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:34:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:38:46] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 4.987 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:00] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[01:44:46] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:45:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:50:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:54:24] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[01:55:02] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:00:04] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[02:00:26] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:02:22] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[02:02:42] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:24:38] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:49:27] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[02:49:32] <stashbot>	 T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719
[02:50:04] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[02:58:50] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:01:08] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 325 bytes in 1.986 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[03:11:22] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[03:11:42] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[03:16:38] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[03:17:23] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[03:17:28] <stashbot>	 T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719
[03:23:50] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[03:23:50] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[03:23:55] <stashbot>	 T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719
[03:27:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:05:18] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[04:07:46] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[04:41:12] <jinxer-wm>	 (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[04:43:53] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1159 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/828009 (https://phabricator.wikimedia.org/T316506)
[04:44:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1195].eqiad.wmnet with reason: switchover m1 T316506
[04:44:31] <stashbot>	 T316506: Switchover m3 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T316506
[04:44:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1195].eqiad.wmnet with reason: switchover m1 T316506
[04:45:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:46:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1159 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/828009 (https://phabricator.wikimedia.org/T316506) (owner: 10Marostegui)
[04:46:38] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:51:50] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[04:52:36] <marostegui>	 I am switching over phabricator db master in a few minutes
[04:56:42] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[05:00:02] <marostegui>	 !log Failover m3 from db1183 to db1159 - T316506
[05:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:07] <stashbot>	 T316506: Switchover m3 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T316506
[05:07:39] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1020,dbproxy1016: Add db1117:3323 back as standby [puppet] - 10https://gerrit.wikimedia.org/r/828384 (https://phabricator.wikimedia.org/T316742)
[05:10:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1020,dbproxy1016: Add db1117:3323 back as standby [puppet] - 10https://gerrit.wikimedia.org/r/828384 (https://phabricator.wikimedia.org/T316742) (owner: 10Marostegui)
[05:12:01] <wikibugs>	 (03PS1) 10Marostegui: db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/828396
[05:17:10] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:18:28] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:19:44] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:21:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/828396 (owner: 10Marostegui)
[05:25:26] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1183 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/828397 (https://phabricator.wikimedia.org/T316742)
[05:26:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1183 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/828397 (https://phabricator.wikimedia.org/T316742) (owner: 10Marostegui)
[05:28:52] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:31:21] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[05:35:31] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[05:35:35] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:36:15] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[05:40:01] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[05:58:15] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:09:21] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::ci::master: remove admin dependency hack [puppet] - 10https://gerrit.wikimedia.org/r/828399
[06:11:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:18:09] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[06:21:57] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[06:32:45] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:54:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) 05In progress→03Resolved
[06:54:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse)
[06:55:01] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:55:19] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:55:33] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 247, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:55:33] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:56:49] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:56:57] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:56:57] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:57:57] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 252, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:59:13] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T0700).
[07:00:05] <jouncebot>	 _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:42] <_joe_>	 yeah it won't be deployed
[07:01:45] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[07:02:35] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:02:35] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:02:49] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:04:01] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:04:11] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:04:11] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:04:39] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:04:59] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:14:38] <wikibugs>	 (03PS1) 10DCausse: Relax elasticsearch marster node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403
[07:15:31] <wikibugs>	 (03PS1) 10Muehlenhoff: tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404
[07:15:37] <wikibugs>	 (03PS2) 10DCausse: Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403
[07:15:59] <godog>	 !log bounce thanos-compact on thanos-fe2001
[07:16:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404 (owner: 10Muehlenhoff)
[07:18:01] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:18:01] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[07:18:31] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:18:39] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:18:52] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] admin: add tsepothoabala to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/827980 (https://phabricator.wikimedia.org/T315409) (owner: 10Jelto)
[07:20:10] <wikibugs>	 (03PS2) 10Muehlenhoff: tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404
[07:20:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:20:47] <wikibugs>	 (03PS1) 10Ebernhardson: elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406
[07:20:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[07:20:57] <jinxer-wm>	 (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown
[07:20:59] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:22:41] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:23:15] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:23:19] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:23:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.325 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:26:10] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/828404 (owner: 10Muehlenhoff)
[07:26:24] <wikibugs>	 (03PS2) 10Ebernhardson: elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406
[07:27:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[07:27:39] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (owner: 10DCausse)
[07:27:43] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[07:29:19] <dcausse>	 ^ this alert will be flapping for a while (til we merge https://gerrit.wikimedia.org/r/828403)
[07:31:30] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1017,dbproxy1021: Add db1183 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/828408 (https://phabricator.wikimedia.org/T316742)
[07:32:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1017,dbproxy1021: Add db1183 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/828408 (https://phabricator.wikimedia.org/T316742) (owner: 10Marostegui)
[07:32:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson)
[07:35:27] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:37:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet
[07:39:55] <wikibugs>	 (03PS3) 10Muehlenhoff: tlsproxy:ssl: Remove ssl_ecdhe_curve [puppet] - 10https://gerrit.wikimedia.org/r/828404
[07:39:55] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet
[07:40:13] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet
[07:41:18] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/828468 (https://phabricator.wikimedia.org/T316745)
[07:41:22] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/828469 (https://phabricator.wikimedia.org/T316745)
[07:42:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Data-Engineering-Operations: Access request to analytics system(s) for TThoabala - https://phabricator.wikimedia.org/T315409 (10Jelto) @TThoabala access was granted. Can you please verify that you have access to the requested data/notebook?
[07:43:32] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/828404 (owner: 10Muehlenhoff)
[07:43:42] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/828468 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot)
[07:44:15] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/828469 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot)
[07:45:05] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:45:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet
[07:47:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[07:47:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 for upgrade', diff saved to https://phabricator.wikimedia.org/P33705 and previous config saved to /var/cache/conftool/dbconfig/20220831-074748-root.json
[07:50:15] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[07:50:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2022.codfw.wmnet to cluster codfw and group B
[07:50:34] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet
[07:51:38] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2022.codfw.wmnet to cluster codfw and group B
[07:53:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33706 and previous config saved to /var/cache/conftool/dbconfig/20220831-075310-root.json
[07:54:22] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host prometheus2006.codfw.wmnet
[07:55:37] <wikibugs>	 (03CR) 10Volans: "Minor nits reported by CI, see inline comments for the details. Beside that LGTM." [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson)
[07:57:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures
[07:58:09] <wikibugs>	 10SRE, 10Observability-Metrics: Not all carbon service start at graphite reboot - https://phabricator.wikimedia.org/T316747 (10fgiunchedi)
[08:00:05] <jouncebot>	 dduvall and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T0800).
[08:00:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: carbon: start at boot [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747)
[08:01:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747)
[08:01:17] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Set action=never-cache for caching=websockets|pipe [puppet] - 10https://gerrit.wikimedia.org/r/827506 (https://phabricator.wikimedia.org/T316545) (owner: 10Vgutierrez)
[08:01:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi)
[08:03:41] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: start carbon.service at boot [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747)
[08:03:43] <wikibugs>	 (03PS2) 10Filippo Giunchedi: graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747)
[08:06:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM, no forced ordering is good." [puppet] - 10https://gerrit.wikimedia.org/r/828399 (owner: 10Giuseppe Lavagetto)
[08:07:22] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) a:03Cmjohnson
[08:08:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33707 and previous config saved to /var/cache/conftool/dbconfig/20220831-080815-root.json
[08:09:17] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:12:03] <vgutierrez>	 !log test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338
[08:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:12:07] <stashbot>	 T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338
[08:14:12] <wikibugs>	 (03PS3) 10DCausse: elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson)
[08:14:17] <wikibugs>	 (03CR) 10DCausse: elasticsearch: Simplify routine to start masters last (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson)
[08:20:05] <vgutierrez>	 !log end test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338
[08:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:10] <stashbot>	 T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338
[08:23:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33708 and previous config saved to /var/cache/conftool/dbconfig/20220831-082319-root.json
[08:26:11] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:27:55] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 24 hosts with reason: Downtiming php7.4 parsoid servers until they are ready to pool
[08:28:13] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 24 hosts with reason: Downtiming php7.4 parsoid servers until they are ready to pool
[08:28:16] <moritzm>	 !log upgrading ganeti2016/ganeti2018 to 3.0.2 T312637
[08:28:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:21] <stashbot>	 T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637
[08:30:36] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson)
[08:30:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Remove upstart configs in /etc/init/ [puppet] - 10https://gerrit.wikimedia.org/r/828477
[08:32:27] <wikibugs>	 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi)
[08:32:56] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet
[08:33:05] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:34:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is a symlink (half-manually - as in there is an automating script) in our current setup, so this would break our setup." [puppet] - 10https://gerrit.wikimedia.org/r/828078 (owner: 10AOkoth)
[08:36:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Good riddance :-)" [puppet] - 10https://gerrit.wikimedia.org/r/828477 (owner: 10Filippo Giunchedi)
[08:38:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33709 and previous config saved to /var/cache/conftool/dbconfig/20220831-083824-root.json
[08:38:34] <wikibugs>	 (03Abandoned) 10Jelto: gitlab: rotate backups on replica [puppet] - 10https://gerrit.wikimedia.org/r/824739 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[08:39:06] <wikibugs>	 (03Abandoned) 10Jelto: gitlab: use actual backup name instead of latest on replica [puppet] - 10https://gerrit.wikimedia.org/r/824730 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[08:39:23] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1001.eqiad.wmnet
[08:40:37] <wikibugs>	 (03PS1) 10Jelto: Revert "install_server: change partman config for gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/827578
[08:42:33] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:42:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: sre: followup on Kafka partition replication alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi)
[08:43:07] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:43:24] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet
[08:44:29] <wikibugs>	 (03CR) 10Jelto: "Instead of reserving more space for backups we want to store less backups on GitLab hosts (and use bacula instead). This will allow the ro" [puppet] - 10https://gerrit.wikimedia.org/r/827578 (owner: 10Jelto)
[08:44:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Label the eight dse-k8s-worker nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828052 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis)
[08:44:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: Remove upstart configs in /etc/init/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828477 (owner: 10Filippo Giunchedi)
[08:45:23] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[08:51:03] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet
[08:51:07] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:53:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33710 and previous config saved to /var/cache/conftool/dbconfig/20220831-085329-root.json
[08:56:46] <wikibugs>	 10SRE, 10Traffic: ATS isn't honoring the cache policy set in cache::alternate_domains on some cases - https://phabricator.wikimedia.org/T316545 (10Jersione) What do I do
[08:58:09] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:59:33] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[09:00:37] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] "Spot checked a few of the host-row assignments, all LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/828052 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis)
[09:01:38] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add a kublet node_label to each master of the dse-k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/828049 (https://phabricator.wikimedia.org/T310172) (owner: 10Btullis)
[09:01:55] <wikibugs>	 (03CR) 10Volans: "This one should have been merged before deploying spicerack 3.2.0 to the cumin hosts (it's currently on cumin2002 AFAICT). Is anything blo" [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi)
[09:02:13] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] cirrus: Handle transition to elasticsearch 7.10 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[09:02:47] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi)
[09:05:15] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:08:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33711 and previous config saved to /var/cache/conftool/dbconfig/20220831-090834-root.json
[09:10:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[09:10:47] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] role::ci::master: remove admin dependency hack [puppet] - 10https://gerrit.wikimedia.org/r/828399 (owner: 10Giuseppe Lavagetto)
[09:11:05] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet
[09:13:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2003.codfw.wmnet
[09:14:00] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] cirrus: Handle transition to elasticsearch 7.10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[09:16:23] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[09:17:23] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet
[09:17:43] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host parse1002.eqiad.wmnet with OS buster
[09:19:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2003.codfw.wmnet
[09:22:39] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[09:22:52] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet
[09:22:57] <wikibugs>	 (03PS4) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338)
[09:23:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33712 and previous config saved to /var/cache/conftool/dbconfig/20220831-092339-root.json
[09:26:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[09:27:20] <moritzm>	 !log installing docker.io bugfix updates from Bullseye point release
[09:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:21] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:29:45] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage
[09:31:12] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:31:54] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Release-Engineering-Team: Investigate sharing releng common python code to pywmflib - https://phabricator.wikimedia.org/T316757 (10hashar)
[09:33:00] <wikibugs>	 (03PS1) 10Phuedx: beta: $wgIPInfoGeoIP2Prefix -> $wgIPInfoGeoLite2Prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482
[09:33:20] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on parse1002.eqiad.wmnet with reason: host reimage
[09:34:43] <logmsgbot>	 !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-fe2001.codfw.wmnet
[09:37:15] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet
[09:37:48] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Update calico to v3.23.3 [debs/calico] (v3.23) - 10https://gerrit.wikimedia.org/r/826230 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[09:38:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33713 and previous config saved to /var/cache/conftool/dbconfig/20220831-093844-root.json
[09:44:02] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet
[09:44:09] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet
[09:44:56] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[09:51:44] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet
[09:53:30] <icinga-wm>	 RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-08-30 07:48:35 (3454 GiB, +1.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[09:53:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33714 and previous config saved to /var/cache/conftool/dbconfig/20220831-095348-root.json
[09:57:04] <icinga-wm>	 RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-08-30 07:31:40 (3433 GiB, +1.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[09:59:12] <wikibugs>	 (03PS1) 10Volans: peeringdb: minor fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486
[09:59:14] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: fix typos and uniform format [software/spicerack] - 10https://gerrit.wikimedia.org/r/828487
[09:59:47] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:00:23] <wikibugs>	 (03CR) 10Volans: "Updated comment, based on new patch." [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi)
[10:00:56] <wikibugs>	 (03CR) 10Volans: "I've left some post-merge comment. I've sent a patch with some small fixes, see:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[10:06:01] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host parse1002.eqiad.wmnet with OS buster
[10:06:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peeringdb: minor fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486 (owner: 10Volans)
[10:08:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33715 and previous config saved to /var/cache/conftool/dbconfig/20220831-100853-root.json
[10:10:07] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[10:11:43] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[10:18:22] <wikibugs>	 (03CR) 10Ladsgroup: "We talked about details of this in 1:1. Are we good to go?" [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup)
[10:18:43] <wikibugs>	 (03PS2) 10Volans: peeringdb: minor fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486
[10:18:45] <wikibugs>	 (03PS2) 10Volans: CHANGELOG: fix typos and uniform format [software/spicerack] - 10https://gerrit.wikimedia.org/r/828487
[10:21:09] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:23:47] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[10:23:54] <Lucas_WMDE>	 is there a way to get a PHP 7.4 shell.php on mwdebug?
[10:24:37] <Lucas_WMDE>	 aha, `PHP=php7.4 mwscript` works \o/
[10:26:31] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson)
[10:28:32] <urbanecm>	 Lucas_WMDE: fyi mwscript doesn't do much more than `sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php`. sometimes that's useful, so just informing :).
[10:29:29] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[10:31:33] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[10:32:27] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:33:25] <wikibugs>	 (03Merged) 10jenkins-bot: elasticsearch: Simplify routine to start masters last [software/spicerack] - 10https://gerrit.wikimedia.org/r/828406 (owner: 10Ebernhardson)
[10:36:25] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[10:39:23] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:41:12] <wikibugs>	 (03PS5) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338)
[10:42:18] <_joe_>	 !log updating php 7.4 on mwdebug1002 to test the new patched packages T316601
[10:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:23] <stashbot>	 T316601: PHP Warning: Erroneous data format for unserializing 'Wikimedia\Rdbms\MySQLPrimaryPos' - https://phabricator.wikimedia.org/T316601
[10:44:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[10:46:00] <wikibugs>	 (03PS1) 10Clément Goubert: ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494
[10:46:46] <wikibugs>	 (03PS6) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338)
[10:48:36] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37051/console" [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert)
[10:49:03] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[10:50:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[10:50:53] <wikibugs>	 (03CR) 10Clément Goubert: ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert)
[10:51:07] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[10:57:46] <wikibugs>	 (03CR) 10TsepoThoabala: [C: 03+1] "I only have +1 rights on this repo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828482 (owner: 10Phuedx)
[10:59:46] <wikibugs>	 (03PS7) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338)
[10:59:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup)
[11:00:39] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[11:00:43] <_joe_>	 !log updating php 7.4 packages in wikimedia/bustrer T316601
[11:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:51] <stashbot>	 T316601: PHP Warning: Erroneous data format for unserializing 'Wikimedia\Rdbms\MySQLPrimaryPos' - https://phabricator.wikimedia.org/T316601
[11:01:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup)
[11:01:09] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:01:49] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup)
[11:04:09] <vgutierrez>	 !log test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338
[11:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:13] <stashbot>	 T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338
[11:06:35] <wikibugs>	 (03PS3) 10Samtar: InitialiseSettings.php: Enable Realtime Preview on Group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828)
[11:07:58] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "trivial comment removal only, self-merging" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/828022 (owner: 10Volans)
[11:09:14] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Most fixes are trivial, self-merging to have them released with today's release. Happy to fix anything that might come up later in review." [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486 (owner: 10Volans)
[11:09:20] <wikibugs>	 (03PS3) 10Volans: peeringdb: minor fixes [software/spicerack] - 10https://gerrit.wikimedia.org/r/828486
[11:09:53] <wikibugs>	 (03PS1) 10Hnowlan: image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973)
[11:10:01] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "CHANGELOG only changes, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/828487 (owner: 10Volans)
[11:13:04] <wikibugs>	 (03Merged) 10jenkins-bot: tests: remove unnecessary pylint disable [software/pywmflib] - 10https://gerrit.wikimedia.org/r/828022 (owner: 10Volans)
[11:13:59] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-08-30 07:31:40 (3454 GiB, +1.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[11:16:18] <wikibugs>	 (03PS1) 10Clément Goubert: P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500
[11:16:37] <jynus>	 finally it finished!
[11:16:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert)
[11:17:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2004.codfw.wmnet
[11:18:04] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37052/console" [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert)
[11:21:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2004.codfw.wmnet
[11:21:49] <wikibugs>	 (03PS8) 10Vgutierrez: trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338)
[11:22:23] <moritzm>	 !log draining ganeti2015 for eventual reimage T311686
[11:22:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:27] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[11:25:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1003.eqiad.wmnet
[11:27:18] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version
[11:27:32] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on gitlab1004.wikimedia.org with reason: upgrade gitlab1004 to new version
[11:27:33] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[11:28:31] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[11:28:35] <wikibugs>	 (03PS1) 10Clément Goubert: C:cpufrequtils: Order install configuration and service [puppet] - 10https://gerrit.wikimedia.org/r/828502
[11:29:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1003.eqiad.wmnet
[11:29:57] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37053/console" [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert)
[11:30:49] <wikibugs>	 (03PS2) 10Clément Goubert: P:mediawiki::php: Order wmerrors config and package install [puppet] - 10https://gerrit.wikimedia.org/r/828500
[11:32:01] <wikibugs>	 (03PS2) 10Clément Goubert: C:ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494
[11:32:46] <wikibugs>	 (03PS1) 10Hnowlan: Fix environment in prep stage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/828503 (https://phabricator.wikimedia.org/T312104)
[11:33:19] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[11:34:23] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:37:25] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: custom host overrides in discovery services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/825729 (owner: 10Hnowlan)
[11:38:55] <wikibugs>	 (03PS1) 10Clément Goubert: P:prometheus::nutcracker_exporter: Order service and package [puppet] - 10https://gerrit.wikimedia.org/r/828504
[11:39:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1004.eqiad.wmnet
[11:40:00] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37054/console" [puppet] - 10https://gerrit.wikimedia.org/r/828504 (owner: 10Clément Goubert)
[11:41:21] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: custom host overrides in discovery services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/825729 (owner: 10Hnowlan)
[11:49:12] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host webperf1004.eqiad.wmnet
[11:49:26] <wikibugs>	 (03PS3) 10Volans: CHANGELOG: fix typos and uniform format [software/spicerack] - 10https://gerrit.wikimedia.org/r/828487
[11:49:53] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:47] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:57:22] <wikibugs>	 (03PS1) 10Clément Goubert: C:httpd::mpm: Remove mod_php* for php7.4 [puppet] - 10https://gerrit.wikimedia.org/r/828507
[11:57:43] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[11:58:01] <marostegui>	 !log Reboot sanitarium hosts, lag will appear on clouddb* hosts
[11:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:31] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37055/console" [puppet] - 10https://gerrit.wikimedia.org/r/828507 (owner: 10Clément Goubert)
[12:03:25] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on parse1002 is CRITICAL: Host parse1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[12:04:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet
[12:05:01] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[12:05:35] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove backy2 backups from cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/828508 (https://phabricator.wikimedia.org/T316731)
[12:06:11] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[12:06:54] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet
[12:08:00] <wikibugs>	 (03PS1) 10Clément Goubert: C:mediawiki::packages::fonts: Order install and config [puppet] - 10https://gerrit.wikimedia.org/r/828510
[12:08:23] <wikibugs>	 (03PS5) 10Jcrespo: bacula: Setup backup1008, backup2008 as new database backup storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582)
[12:08:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet
[12:13:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:mediawiki::packages::fonts: Order install and config [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[12:13:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:16:59] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet
[12:17:15] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2002.codfw.wmnet
[12:18:09] <wikibugs>	 (03PS2) 10Clément Goubert: C:mediawiki::packages::fonts: Order install and config [puppet] - 10https://gerrit.wikimedia.org/r/828510
[12:18:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job k8s-pods in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:19:24] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37057/console" [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[12:22:45] <wikibugs>	 (03CR) 10Clément Goubert: "Related puppet log: https://phabricator.wikimedia.org/P33716$1" [puppet] - 10https://gerrit.wikimedia.org/r/828494 (owner: 10Clément Goubert)
[12:23:09] <wikibugs>	 (03CR) 10Clément Goubert: "Related puppet log: https://phabricator.wikimedia.org/P33716$19" [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert)
[12:23:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-staging2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:24:12] <wikibugs>	 (03CR) 10Muehlenhoff: C:mediawiki::packages::fonts: Order install and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[12:24:14] <wikibugs>	 (03CR) 10Clément Goubert: P:mediawiki::php: Order wmerrors config and package install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828500 (owner: 10Clément Goubert)
[12:25:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet
[12:25:56] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] C:mediawiki::packages::fonts: Order install and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[12:26:42] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "Related puppet log: https://phabricator.wikimedia.org/P33716$19" [puppet] - 10https://gerrit.wikimedia.org/r/828502 (owner: 10Clément Goubert)
[12:27:15] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2002.codfw.wmnet
[12:27:27] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "Related puppet log: https://phabricator.wikimedia.org/P33716$30" [puppet] - 10https://gerrit.wikimedia.org/r/828504 (owner: 10Clément Goubert)
[12:27:55] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "Related puppet log: https://phabricator.wikimedia.org/P33716$35" [puppet] - 10https://gerrit.wikimedia.org/r/828507 (owner: 10Clément Goubert)
[12:28:43] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185)
[12:28:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet
[12:28:57] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185)
[12:28:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-staging2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:29:46] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Setup backup1008, backup2008 as new database backup storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[12:30:47] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[12:31:25] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:31:45] <wikibugs>	 (03CR) 10Herron: [C: 03+1] graphite: start carbon.service at boot [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi)
[12:31:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:31:57] <wikibugs>	 (03CR) 10Herron: [C: 03+1] graphite: properly shut carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi)
[12:33:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove backy2 backups from cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/828508 (https://phabricator.wikimedia.org/T316731) (owner: 10Andrew Bogott)
[12:33:21] <wikibugs>	 (03PS2) 10Andrew Bogott: Remove backy2 backups from cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/828508 (https://phabricator.wikimedia.org/T316731)
[12:33:41] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-staging-ctrl2001.codfw.wmnet
[12:33:41] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-staging-ctrl2001.codfw.wmnet
[12:34:25] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-staging-ctrl2002.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185)
[12:34:39] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-staging-ctrl2002.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185)
[12:35:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet
[12:35:41] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:39:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet
[12:39:19] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-staging-ctrl2002.codfw.wmnet
[12:39:19] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-staging-ctrl2002.codfw.wmnet
[12:39:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters - https://phabricator.wikimedia.org/T315560 (10Volans) Wasn't that the required behaviour to allow to failover an active/passive service without downtime? A...
[12:43:33] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) (owner: 10Hnowlan)
[12:52:11] <wikibugs>	 (03CR) 10Vgutierrez: "Tested in cp6016:" [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[12:52:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2002.codfw.wmnet
[12:53:04] <wikibugs>	 (03PS8) 10DCausse: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[12:53:12] <wikibugs>	 (03CR) 10DCausse: cirrus: Handle transition to elasticsearch 7.10 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[12:54:05] <wikibugs>	 (03PS9) 10DCausse: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[12:55:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[12:55:11] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[12:55:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Hide non session cookies during cache lookup [puppet] - 10https://gerrit.wikimedia.org/r/828002 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[12:56:08] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Migrate new database dump long term backups to backup[12]008 [puppet] - 10https://gerrit.wikimedia.org/r/828515 (https://phabricator.wikimedia.org/T313582)
[12:57:22] <vgutierrez>	 !log test trafficserver: Hide non session cookies during cache lookup in drmrs - T316338
[12:57:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:27] <stashbot>	 T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338
[12:57:41] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.dnsdisc.Discovery should not allow pooling active/passive services in both datacenters - https://phabricator.wikimedia.org/T315560 (10JMeybohm) Ah, okay. Makes sense in that case. I think I was assuming pool() would check because depool() does...
[12:58:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2002.codfw.wmnet
[12:59:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet
[12:59:35] <wikibugs>	 (03PS10) 10DCausse: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T1300)
[13:00:05] <jouncebot>	 TheresNoTime and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Lucas_WMDE>	 o/
[13:00:12] <urbanecm>	 o/
[13:00:15] <TheresNoTime>	 hai
[13:00:22] <urbanecm>	 does anyone want to deploy, or should i?
[13:00:37] <Lucas_WMDE>	 I can deploy
[13:00:47] <Lucas_WMDE>	 TheresNoTime: what’s your current deployer status? ^^
[13:01:26] <urbanecm>	 she's a deployer, too
[13:01:30] <TheresNoTime>	 I can self-deploy if you'd prefer?
[13:01:40] <Lucas_WMDE>	 sure!
[13:01:46] <Lucas_WMDE>	 and nice \o/
[13:02:03] <TheresNoTime>	 Lucas_WMDE: did you want to do your patch first
[13:02:10] <Lucas_WMDE>	 nah, it’s not important at all
[13:02:19] <Lucas_WMDE>	 just a no-op to simplify the config a tiny bit
[13:02:32] <TheresNoTime>	 ack, will deploy mine :)
[13:03:19] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar)
[13:04:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet
[13:04:54] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings.php: Enable Realtime Preview on Group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828059 (https://phabricator.wikimedia.org/T314828) (owner: 10Samtar)
[13:05:54] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Migrate new database dump long term backups to backup[12]008 [puppet] - 10https://gerrit.wikimedia.org/r/828515 (https://phabricator.wikimedia.org/T313582)
[13:05:55] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10Volans) a:03SLyngshede-WMF Indeed, I can confirm the issue. The problem comes from the a bit //automagic// dry-run handling in the `@retry` decorato...
[13:07:18] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
[13:07:29] <TheresNoTime>	 (syncing mine)
[13:10:16] <wikibugs>	 (03CR) 10Muehlenhoff: C:mediawiki::packages::fonts: Order install and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[13:10:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:11:10] <logmsgbot>	 !log samtar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:828059|InitialiseSettings.php: Enable Realtime Preview on Group 2 (T314828)]] (duration: 03m 54s)
[13:11:14] <stashbot>	 T314828: Enable Realtime preview on group2 - https://phabricator.wikimedia.org/T314828
[13:11:20] <TheresNoTime>	 Lucas_WMDE: all yours :)
[13:11:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:11:26] <Lucas_WMDE>	 ok :)
[13:11:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:11:52] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:11:53] <Lucas_WMDE>	 I’ll first verify that I can test this behavior on mwdebug
[13:12:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:12:32] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-08-30 07:48:35 (3433 GiB, +1.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[13:12:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet
[13:13:19] <moritzm>	 !log installing zlib security updates on bullseye
[13:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:22] <Lucas_WMDE>	 ok, that works
[13:14:38] <Lucas_WMDE>	 if I comment out the assignment that I’m tampering with, then searchEntities.php for “English” returns slightly different results
[13:14:51] <Lucas_WMDE>	 so I should be able to use that to test that the assignments are still effective after the config file changes
[13:14:57] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872
[13:15:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872 (owner: 10Lucas Werkmeister (WMDE))
[13:16:22] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:16:41] <wikibugs>	 (03Merged) 10jenkins-bot: Only set WikibaseCirrusSearch settings if wmg globals are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806872 (owner: 10Lucas Werkmeister (WMDE))
[13:17:15] <Lucas_WMDE>	 looks good on mwdebug, syncing
[13:18:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet
[13:18:23] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873
[13:18:26] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873 (owner: 10Lucas Werkmeister (WMDE))
[13:18:59] <wikibugs>	 (03PS1) 10Andrew Bogott: P:systemd::timedated: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643)
[13:19:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet
[13:20:23] <wikibugs>	 (03Merged) 10jenkins-bot: Directly set WikibaseCirrusSearch settings in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806873 (owner: 10Lucas Werkmeister (WMDE))
[13:21:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikibase.php: Config: [[gerrit:806872|Only set WikibaseCirrusSearch settings if wmg globals are set]] (duration: 03m 42s)
[13:22:29] <Lucas_WMDE>	 second change also looks good on mwdebug, syncing
[13:22:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:22:51] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] C:mediawiki::packages::fonts: Order install and config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[13:23:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:23:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:24:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:25:02] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:25:56] <icinga-wm>	 PROBLEM - Check systemd state on netflow4002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet
[13:26:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:806873|Directly set WikibaseCirrusSearch settings in IS.php]] (1/3) (duration: 03m 47s)
[13:28:10] <icinga-wm>	 PROBLEM - Check systemd state on db2173 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:10] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:28:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:29:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:29:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:30:09] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10fgiunchedi) In Matthew's absence I can confirm that the drives are hot swappable @Jclark-ctr !
[13:30:28] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:30:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:30:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:30:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:806873|Directly set WikibaseCirrusSearch settings in IS.php]] (2/3) (duration: 03m 39s)
[13:30:49] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874
[13:31:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:31:41] <wikibugs>	 (03PS3) 10Clément Goubert: C:mediawiki::packages::fonts: Install fontconfig-config [puppet] - 10https://gerrit.wikimedia.org/r/828510
[13:31:50] <moritzm>	 !log restarting exim on the MXes to pick up zlib update
[13:31:52] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:47] <wikibugs>	 (03CR) 10Clément Goubert: C:mediawiki::packages::fonts: Install fontconfig-config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[13:33:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[13:33:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet
[13:34:26] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37062/console" [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[13:34:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikidata.php: Config: [[gerrit:806873|Directly set WikibaseCirrusSearch settings in IS.php]] (3/3) (duration: 03m 42s)
[13:34:52] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 111.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[13:34:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:35:23] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874 (owner: 10Lucas Werkmeister (WMDE))
[13:36:31] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused assignments from SearchSettingsForWikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806874 (owner: 10Lucas Werkmeister (WMDE))
[13:37:32] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:37:32] <Lucas_WMDE>	 checking on mwdebug
[13:37:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet
[13:37:46] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] C:mediawiki::packages::fonts: Install fontconfig-config [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[13:39:38] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:41:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:42:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:42:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:43:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:43:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikibase.php: Config: [[gerrit:806874|Remove unused assignments from SearchSettingsForWikibase.php]] (1/2) (duration: 03m 38s)
[13:44:28] <icinga-wm>	 PROBLEM - k8s requests count to the API on ml-serve-ctrl2001 is CRITICAL: 102.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[13:47:06] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:47:11] <claime>	 There's an issue with the patch  https://gerrit.wikimedia.org/r/828510 I just merged, reverting
[13:47:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForWikidata.php: Config: [[gerrit:806874|Remove unused assignments from SearchSettingsForWikibase.php]] (2/2) (duration: 03m 33s)
[13:47:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) cloudvirt1054  E4 U29   Port 36/37   Cableid    20220045      /   20220041 cloudvirt1055  E4 U30   Port 38/39   Cableid    20220046...
[13:48:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr)
[13:48:44] <Lucas_WMDE>	 I think I’m done
[13:48:49] <Lucas_WMDE>	 anything else to deploy?
[13:49:14] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[13:49:28] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:49:29] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "C:mediawiki::packages::fonts: Install fontconfig-config" [puppet] - 10https://gerrit.wikimedia.org/r/828536
[13:49:38] <icinga-wm>	 RECOVERY - Check systemd state on db2173 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:57] <wikibugs>	 (03PS1) 10JMeybohm: image-suggestion: Update to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/828537 (https://phabricator.wikimedia.org/T313973)
[13:51:10] <icinga-wm>	 PROBLEM - Check systemd state on mw1383 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:42] <wikibugs>	 (03CR) 10Clément Goubert: "There's an issue with the previous patch, it keeps desinstalling and reinstalling fontconfig-config at each run on parse hosts. I'll hunt " [puppet] - 10https://gerrit.wikimedia.org/r/828536 (owner: 10Clément Goubert)
[13:52:55] <wikibugs>	 (03PS2) 10Clément Goubert: Revert "C:mediawiki::packages::fonts: Install fontconfig-config" [puppet] - 10https://gerrit.wikimedia.org/r/828536
[13:54:28] <icinga-wm>	 PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:25] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "C:mediawiki::packages::fonts: Install fontconfig-config" [puppet] - 10https://gerrit.wikimedia.org/r/828536 (owner: 10Clément Goubert)
[14:00:31] <wikibugs>	 (03CR) 10Muehlenhoff: P:systemd::timedated: exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) (owner: 10Andrew Bogott)
[14:02:39] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] image-suggestion: Update to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/828537 (https://phabricator.wikimedia.org/T313973) (owner: 10JMeybohm)
[14:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:04:17] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] "Reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/828536" [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[14:06:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ssingh) Downgrading and reimaging drmrs ATS9 hosts cp6008 and cp6016 to ATS8 for a week so that we can have comparative data later when we upgrade all instances to ATS9 in drmrs.
[14:07:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10Krinkle)
[14:07:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] C:mediawiki::packages::fonts: Install fontconfig-config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828510 (owner: 10Clément Goubert)
[14:08:13] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS buster
[14:08:21] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6008.drmrs.wmnet with OS buster
[14:08:50] <vgutierrez>	 !log deploy trafficserver: Hide non session cookies during cache lookup globally - T316338
[14:08:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:54] <stashbot>	 T316338: strip non session cookies before cache lookup in ATS - https://phabricator.wikimedia.org/T316338
[14:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:09:36] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp6008.drmrs.wmnet with OS buster
[14:09:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6008.drmrs.wmnet with OS buster executed with errors: - cp6008 (**FAIL...
[14:10:47] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Migrate new database dump long term backups to backup[12]008 [puppet] - 10https://gerrit.wikimedia.org/r/828515 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[14:11:12] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[14:11:46] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6008.drmrs.wmnet,service=ats-tls
[14:11:46] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6008.drmrs.wmnet,service=ats-be
[14:11:47] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6008.drmrs.wmnet,service=varnish-fe
[14:13:56] <wikibugs>	 (03PS1) 10Ssingh: hiera: downgrade cp6008 to ATS8 [puppet] - 10https://gerrit.wikimedia.org/r/828540 (https://phabricator.wikimedia.org/T309651)
[14:14:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] image-suggestion: Update to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/828537 (https://phabricator.wikimedia.org/T313973) (owner: 10JMeybohm)
[14:14:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr)
[14:15:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: downgrade cp6008 to ATS8 [puppet] - 10https://gerrit.wikimedia.org/r/828540 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[14:15:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS buster
[14:16:13] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6008.drmrs.wmnet with OS buster
[14:18:26] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[14:18:46] <wikibugs>	 (03PS1) 10Ssingh: hiera: downgrade cp6016 to ATS8 [puppet] - 10https://gerrit.wikimedia.org/r/828543 (https://phabricator.wikimedia.org/T309651)
[14:19:33] <wikibugs>	 (03Merged) 10jenkins-bot: image-suggestion: Update to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/828537 (https://phabricator.wikimedia.org/T313973) (owner: 10JMeybohm)
[14:22:49] <wikibugs>	 (03PS1) 10Clément Goubert: C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544
[14:22:59] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) (owner: 10Hnowlan)
[14:23:03] <wikibugs>	 (03PS2) 10Hnowlan: image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973)
[14:23:26] <wikibugs>	 (03PS2) 10Clément Goubert: C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544
[14:24:10] <wikibugs>	 (03PS1) 10JMeybohm: Merge cert-manager/sample-external-issuer@55b043b [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/828545 (https://phabricator.wikimedia.org/T310486)
[14:26:05] <wikibugs>	 (03PS3) 10Clément Goubert: C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544
[14:27:23] <wikibugs>	 (03PS2) 10JMeybohm: Merge cert-manager/sample-external-issuer@55b043b [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/828545 (https://phabricator.wikimedia.org/T310486)
[14:28:16] <wikibugs>	 (03CR) 10Andrew Bogott: P:systemd::timedated: exclude /mnt from accessible paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) (owner: 10Andrew Bogott)
[14:28:28] <wikibugs>	 (03PS2) 10Andrew Bogott: P:systemd::timedated: exclude /mnt from accessible paths [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643)
[14:29:11] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37064/console" [puppet] - 10https://gerrit.wikimedia.org/r/828544 (owner: 10Clément Goubert)
[14:33:25] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) (owner: 10Hnowlan)
[14:37:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage
[14:37:43] <wikibugs>	 (03Merged) 10jenkins-bot: image-suggestion: temporarily enable debug logging in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/828497 (https://phabricator.wikimedia.org/T313973) (owner: 10Hnowlan)
[14:40:46] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6008.drmrs.wmnet with reason: host reimage
[14:41:06] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw
[14:42:27] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[14:42:33] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/828471 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi)
[14:42:52] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[14:43:17] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[14:43:26] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/828472 (https://phabricator.wikimedia.org/T316747) (owner: 10Filippo Giunchedi)
[14:43:58] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[14:44:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data Engineering Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) druid1009  A5  U06 druid1010  B5  U13  druid1011  D6  U37
[14:45:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544 (owner: 10Clément Goubert)
[14:47:00] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply
[14:47:45] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply
[14:48:39] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] C:mediawiki::packages::fonts: match conf and class ensures [puppet] - 10https://gerrit.wikimedia.org/r/828544 (owner: 10Clément Goubert)
[14:48:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply
[14:49:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply
[14:50:00] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-serve-ctrl2002.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185)
[14:50:14] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-serve-ctrl2002.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185)
[14:54:08] <icinga-wm>	 RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 32.26 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[14:54:20] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:54:20] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:56:14] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:56:14] <icinga-wm>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:56:37] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-serve-ctrl2002.codfw.wmnet
[14:56:38] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-serve-ctrl2002.codfw.wmnet
[14:56:44] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-serve-ctrl2001.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185)
[14:56:57] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-serve-ctrl2001.codfw.wmnet with reason: Reboot to pick up kernel 5.10.136 (T316185)
[15:00:38] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6008.drmrs.wmnet with OS buster
[15:00:47] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6008.drmrs.wmnet with OS buster completed: - cp6008 (**WARN**)   - Dow...
[15:01:03] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.remove-downtime for ml-serve-ctrl2001.codfw.wmnet
[15:01:04] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-serve-ctrl2001.codfw.wmnet
[15:04:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:04:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet,service=ats-tls
[15:04:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet,service=ats-be
[15:04:05] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6008.drmrs.wmnet,service=varnish-fe
[15:05:49] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582)
[15:06:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6016.drmrs.wmnet,service=ats-tls
[15:06:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6016.drmrs.wmnet,service=ats-be
[15:06:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp6016.drmrs.wmnet,service=varnish-fe
[15:06:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[15:06:42] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: downgrade cp6016 to ATS8 [puppet] - 10https://gerrit.wikimedia.org/r/828543 (https://phabricator.wikimedia.org/T309651) (owner: 10Ssingh)
[15:07:28] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS buster
[15:07:36] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster
[15:07:52] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v3.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/828560
[15:08:30] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "changelog for new release, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/828560 (owner: 10Volans)
[15:08:35] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582)
[15:09:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:09:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[15:11:32] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582)
[15:14:10] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[15:15:46] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Replace session cookies with Token=1 iff V:C isn't there [puppet] - 10https://gerrit.wikimedia.org/r/828564 (https://phabricator.wikimedia.org/T316338)
[15:17:34] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v3.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/828560 (owner: 10Volans)
[15:24:01] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Add new hosts backup1009 & backup2009 as new storage servers [puppet] - 10https://gerrit.wikimedia.org/r/828559 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[15:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:27:42] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage
[15:28:20] <wikibugs>	 (03PS3) 10Jelto: sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481)
[15:31:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp6016.drmrs.wmnet with reason: host reimage
[15:32:46] <wikibugs>	 (03PS1) 10Volans: Upstream release v3.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/828568
[15:35:04] <wikibugs>	 10SRE, 10Image-Suggestions, 10Structured-Data-Backlog (Current Work): [M] Schedule image suggestions notifications - https://phabricator.wikimedia.org/T300024 (10matthiasmullie) 05Open→03Resolved
[15:35:44] <wikibugs>	 (03PS1) 10Ahmon Dancy: Revert "Turn mw_releases into a list" [puppet] - 10https://gerrit.wikimedia.org/r/828586
[15:36:28] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[15:37:36] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:37:50] <wikibugs>	 (03PS3) 10Clément Goubert: C:ipmi::monitor: Order service after package install [puppet] - 10https://gerrit.wikimedia.org/r/828494
[15:39:32] <wikibugs>	 (03Abandoned) 10Vgutierrez: trafficserver: Replace session cookies with Token=1 iff V:C isn't there [puppet] - 10https://gerrit.wikimedia.org/r/828564 (https://phabricator.wikimedia.org/T316338) (owner: 10Vgutierrez)
[15:46:06] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v3.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/828568 (owner: 10Volans)
[15:52:14] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[15:54:09] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v3.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/828568 (owner: 10Volans)
[15:54:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[15:54:31] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[15:54:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[15:55:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS buster
[15:55:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Package and deploy ATS 9.1.3 - https://phabricator.wikimedia.org/T309651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster completed: - cp6016 (**WARN**)   - Dow...
[15:56:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:56:59] <_joe_>	 !log updated php 7.4 in all of production T316691
[15:57:01] <wikibugs>	 (03CR) 10Muehlenhoff: "This looks fine, but please don't merge yet. This is a fleet-wide available service and I need to first doublecheck that we don't run into" [puppet] - 10https://gerrit.wikimedia.org/r/828526 (https://phabricator.wikimedia.org/T310643) (owner: 10Andrew Bogott)
[15:57:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:04] <stashbot>	 T316691: Vue: reimplement LanguageSelector clear strategy - https://phabricator.wikimedia.org/T316691
[15:57:16] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[15:58:46] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[16:00:26] <volans>	 !log uploaded spicerack_3.2.1 to apt.wikimedia.org bullseye-wikimedia
[16:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:36] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet,service=ats-tls
[16:00:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet,service=ats-be
[16:00:37] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet,service=varnish-fe
[16:07:47] <wikibugs>	 (03PS1) 10Jcrespo: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582)
[16:08:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[16:08:34] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582)
[16:09:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[16:09:34] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582)
[16:11:16] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] "From Joe's latest explanation, we won't be able to know the namespace names in advance. That means key lookup is not required, we will nee" [puppet] - 10https://gerrit.wikimedia.org/r/828586 (owner: 10Ahmon Dancy)
[16:19:00] <wikibugs>	 (03PS4) 10Jcrespo: bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582)
[16:21:43] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Migrate production backups from backup1001 to backup1009 [puppet] - 10https://gerrit.wikimedia.org/r/828575 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[16:34:17] <wikibugs>	 (03Abandoned) 10Ahmon Dancy: Revert "Turn mw_releases into a list" [puppet] - 10https://gerrit.wikimedia.org/r/828586 (owner: 10Ahmon Dancy)
[16:35:56] <wikibugs>	 (03PS1) 10Ahmon Dancy: Revert comment change [puppet] - 10https://gerrit.wikimedia.org/r/828583
[16:37:45] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:43:30] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582)
[16:44:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[16:47:11] <wikibugs>	 (03PS1) 10CDanis: dbctl: python 3.10 & x2 section [software/conftool] - 10https://gerrit.wikimedia.org/r/828585
[16:47:13] <wikibugs>	 (03PS1) 10CDanis: dbctl: Add omit_replicas_in_mwconfig section attribute [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482)
[16:47:55] <wikibugs>	 (03PS2) 10Jcrespo: bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582)
[16:48:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:48:52] <wikibugs>	 (03CR) 10CDanis: "Pre-review -- will add docs once you think the patch is sound." [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis)
[16:49:25] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[16:50:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:52:01] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:52:17] <wikibugs>	 (03PS3) 10Jcrespo: bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582)
[16:52:41] <volans>	 !log installing spicerack 3.2.1 on cumin2002
[16:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:11] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[16:56:17] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Fix old references to old pools [puppet] - 10https://gerrit.wikimedia.org/r/828584 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[16:58:35] <wikibugs>	 (03PS1) 10Volans: sre.hardware.upgrade-firmware: sort drivers files [cookbooks] - 10https://gerrit.wikimedia.org/r/828609
[17:04:05] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is CRITICAL: connect to address 10.64.48.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[17:04:21] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is CRITICAL: connect to address 10.64.48.151 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[17:04:39] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:04:45] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:04:53] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is CRITICAL: connect to address 10.64.48.152 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[17:04:57] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[17:05:09] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[17:06:14] <volans>	 !log installing spicerack 3.2.1 on cumin1001
[17:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/828585 (owner: 10CDanis)
[17:07:48] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] dbctl: python 3.10 & x2 section [software/conftool] - 10https://gerrit.wikimedia.org/r/828585 (owner: 10CDanis)
[17:08:57] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[17:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: dbctl: python 3.10 & x2 section [software/conftool] - 10https://gerrit.wikimedia.org/r/828585 (owner: 10CDanis)
[17:17:28] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Fix backups with custom jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/828610 (https://phabricator.wikimedia.org/T313582)
[17:29:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:33:05] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[17:39:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Fix backups with custom jobdefaults [puppet] - 10https://gerrit.wikimedia.org/r/828610 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[17:43:09] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[17:47:59] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[17:54:23] <wikibugs>	 (03PS1) 10DDesouza: Deploy Research Incentive Survey to enwiki on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828613 (https://phabricator.wikimedia.org/T316464)
[17:54:51] <dduvall>	 jouncebot: now
[17:54:51] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 5 minute(s)
[17:58:17] <wikibugs>	 (03PS1) 10DDesouza: Deploy Research Incentive Survey to idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828614 (https://phabricator.wikimedia.org/T316466)
[18:00:05] <jouncebot>	 dduvall and hashar: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T1800).
[18:00:05] <jouncebot>	 dduvall and hashar: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T1800).
[18:06:57] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[18:08:06] <wikibugs>	 (03PS1) 10Bernard Wang: Remove Vector grid config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616
[18:08:29] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:08:43] <wikibugs>	 (03PS2) 10Bernard Wang: Remove Vector grid config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559)
[18:12:06] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "This is unused config so can be removed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828616 (https://phabricator.wikimedia.org/T313559) (owner: 10Bernard Wang)
[18:18:31] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828620 (https://phabricator.wikimedia.org/T314188)
[18:18:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828620 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot)
[18:19:22] <wikibugs>	 (03Abandoned) 10AOkoth: gitlab: copy ssh host keys for failover [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) (owner: 10AOkoth)
[18:19:35] <wikibugs>	 (03Abandoned) 10AOkoth: vrts: create /opt/otrs folder [puppet] - 10https://gerrit.wikimedia.org/r/828078 (owner: 10AOkoth)
[18:19:39] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828620 (https://phabricator.wikimedia.org/T314188) (owner: 10TrainBranchBot)
[18:21:47] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:22:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:23:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:23:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:24:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:24:16] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.27  refs T314188
[18:24:20] <stashbot>	 T314188: 1.39.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T314188
[18:27:54] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.27  refs T314188 (duration: 03m 37s)
[18:29:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:30:09] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[18:33:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:33:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:33:32] <wikibugs>	 (03PS11) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787
[18:35:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T314041)', diff saved to https://phabricator.wikimedia.org/P33721 and previous config saved to /var/cache/conftool/dbconfig/20220831-183513-ladsgroup.json
[18:35:19] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[18:36:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:38:19] <wikibugs>	 (03PS1) 10Dzahn: prometheus: fix/invert comments about matching in blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/828622
[18:39:02] <wikibugs>	 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Wikimedia-Incident: Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10Zabe)
[18:39:50] <wikibugs>	 (03PS2) 10Dzahn: prometheus: fix/invert comments about matching in blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/828622
[18:40:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] prometheus::blackbox::http: add/edit parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn)
[18:46:23] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.151 port 9042 https://phabricator.wikimedia.org/T93886
[18:46:58] <wikibugs>	 (03PS1) 10Andrew Bogott: Refactor profile::wmcs::backup_glance_images to run on a dedicated backup host [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738)
[18:47:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor profile::wmcs::backup_glance_images to run on a dedicated backup host [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) (owner: 10Andrew Bogott)
[18:50:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P33722 and previous config saved to /var/cache/conftool/dbconfig/20220831-185020-ladsgroup.json
[18:52:23] <wikibugs>	 (03PS2) 10Andrew Bogott: Refactor profile::wmcs::backup_glance_images to run on a dedicated backup host [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738)
[18:52:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "This looks reasonable to me. Compiler output is a bit hard to diff but production networks for install servers makes sense to me and if an" [puppet] - 10https://gerrit.wikimedia.org/r/827964 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi)
[18:56:04] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[18:56:09] <stashbot>	 T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719
[18:56:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) Thanks  will be done within a hour
[18:56:56] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "so Gerrit thinks it's my turn here and it's in my attention set and that's why it keeps showing it to as my todo / it's waiting for me. He" [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[18:58:51] <wikibugs>	 (03PS3) 10Ryan Kemper: Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (owner: 10DCausse)
[18:59:03] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "ack:) I consider it stalled but don't want to abandon it. Gerrit thinks it's "my turn" so I am replying to get it out of that attention se" [puppet] - 10https://gerrit.wikimedia.org/r/824412 (owner: 10Jbond)
[19:00:53] <wikibugs>	 (03CR) 10Dzahn: "ACK, currently on-ice." [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond)
[19:01:21] <wikibugs>	 (03PS3) 10Andrew Bogott: Refactor profile::wmcs::backup_glance_images to run on a dedicated backup host [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738)
[19:02:14] <wikibugs>	 (03PS2) 10Dzahn: Revert "install_server: change partman config for gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/827578 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[19:02:26] <wikibugs>	 (03PS3) 10Dzahn: Revert "install_server: change partman config for gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/827578 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[19:02:29] <wikibugs>	 (03PS4) 10Andrew Bogott: Move profile::wmcs::backup_glance_images from cloudcontrols to backup servers [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738)
[19:02:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "install_server: change partman config for gitlab" [puppet] - 10https://gerrit.wikimedia.org/r/827578 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[19:03:04] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37071/" [puppet] - 10https://gerrit.wikimedia.org/r/828623 (https://phabricator.wikimedia.org/T316738) (owner: 10Andrew Bogott)
[19:05:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P33723 and previous config saved to /var/cache/conftool/dbconfig/20220831-190526-ladsgroup.json
[19:06:02] <wikibugs>	 (03PS4) 10Ryan Kemper: Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (https://phabricator.wikimedia.org/T308676) (owner: 10DCausse)
[19:06:08] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (https://phabricator.wikimedia.org/T308676) (owner: 10DCausse)
[19:06:10] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Relax elasticsearch master node detection [puppet] - 10https://gerrit.wikimedia.org/r/828403 (https://phabricator.wikimedia.org/T308676) (owner: 10DCausse)
[19:07:26] <mutante>	 squeezes in between the merges, puppetmaster can be fast nowadays
[19:08:39] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:12:35] <icinga-wm>	 PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:15:44] <mutante>	 !log gitlab: reimaging gitlab2003 with cookbook after reverting partman change and comment on gerrit:827578 T274463
[19:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:51] <stashbot>	 T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463
[19:16:27] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[19:18:16] <logmsgbot>	 !log dzahn@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host gitlab2003.wikimedia.org with OS bullseye
[19:19:23] <mutante>	 reimaging failed again
[19:20:10] <mutante>	 Ctrl+c pressed..whoops
[19:20:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T314041)', diff saved to https://phabricator.wikimedia.org/P33724 and previous config saved to /var/cache/conftool/dbconfig/20220831-192032-ladsgroup.json
[19:20:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[19:20:39] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[19:20:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[19:21:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[19:21:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[19:21:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T314041)', diff saved to https://phabricator.wikimedia.org/P33725 and previous config saved to /var/cache/conftool/dbconfig/20220831-192120-ladsgroup.json
[19:21:30] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[19:21:48] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw es7 cluster upgrade - ryankemper@cumin2002 - T316719
[19:21:53] <stashbot>	 T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719
[19:22:27] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 0 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[19:24:05] <ryankemper>	 ^ Merged a fix for that flapping but didn't manually run puppet, so I expect it to resolve within 10 mins or so
[19:24:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Jclark-ctr) What is the status on this one? it has been sitting for a while
[19:24:58] <mutante>	 ryankemper: ack, thanks for that
[19:25:05] <mutante>	 figured it was the maintenance
[19:25:22] <mutante>	 because codfw only
[19:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[19:29:05] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[19:30:43] <ryankemper>	 !log T316719 Rolling upgrade operation complete; all of elastic codfw is now on `7.10.2`. Next week our related cirrus changes will go out with the mediawiki deploy train in `1.39.0-wmf.28`
[19:30:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:49] <stashbot>	 T316719: Upgrade codfw cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T316719
[19:34:50] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[19:37:20] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[19:37:24] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[19:41:03] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[19:43:58] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Fix job and restore defaults to use the new production pool [puppet] - 10https://gerrit.wikimedia.org/r/828629 (https://phabricator.wikimedia.org/T313582)
[19:46:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Fix job and restore defaults to use the new production pool [puppet] - 10https://gerrit.wikimedia.org/r/828629 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[19:50:01] <wikibugs>	 (03PS1) 10Ebernhardson: admin: Update my home directory [puppet] - 10https://gerrit.wikimedia.org/r/828630
[19:53:22] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit1001), No backups: 109 (an-master1002, ...), Fresh: 5 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[19:56:40] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye
[19:57:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220831T2000).
[20:00:05] <jouncebot>	 ebernhardson and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:11] <urbanecm>	 o/
[20:00:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) Replaced 2 failed drives
[20:00:18] <urbanecm>	 I can deploy today
[20:00:20] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) 05Open→03Resolved
[20:01:09] <ebernhardson>	 \o
[20:01:38] <ebernhardson>	 urbanecm: my patch should have no visible change, it's prep for next weeks train
[20:01:49] <urbanecm>	 ack
[20:02:19] <urbanecm>	 is there any order the files should be synced in?
[20:02:32] <urbanecm>	 also, feel free to self-deploy if you want, looks like danisztls's not around today
[20:02:54] <ebernhardson>	 urbanecm: sure i can deploy it
[20:03:10] <urbanecm>	 go ahead then :)
[20:05:21] <danisztls>	 sry, I'm late
[20:05:49] <urbanecm>	 no worries
[20:05:57] <ebernhardson>	 urbanecm: actually i'm going to delay mine till tomorrow, i notice dcausse changed a part of it and i'm not sure which way is correct, will have to check with him and ship tomorrow
[20:06:04] <urbanecm>	 okay, sounds good!
[20:06:10] <urbanecm>	 so only danisztls's patch then
[20:06:11] <ebernhardson>	 well, i think my way is correct and he thinks his is, we should agree first :)
[20:06:20] <urbanecm>	 yep, sounds like a good idea
[20:06:27] <urbanecm>	 taking over the window
[20:06:31] <wikibugs>	 (03PS2) 10Urbanecm: Deploy Research Incentive Survey to enwiki on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828613 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza)
[20:06:47] <urbanecm>	 danisztls: your patch has zero coverage, is that intentional?
[20:06:52] <danisztls>	 urbanecm: yes
[20:06:54] <urbanecm>	 okay
[20:07:00] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Deploy Research Incentive Survey to enwiki on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828613 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza)
[20:07:01] <danisztls>	 urbanecm: will trigger via parameter
[20:07:06] <urbanecm>	 makes sense
[20:07:08] <urbanecm>	 shipping :)
[20:07:45] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Research Incentive Survey to enwiki on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828613 (https://phabricator.wikimedia.org/T316464) (owner: 10DDesouza)
[20:08:24] <denisse|m>	 Hello team, I'll reboot the netmon1003 instance in 30 minutes for a kernel update.
[20:08:55] <logmsgbot>	 !log bking@cumin1001 conftool action : get/pooled; selector: dnsdisc=wdqs,name=codfw
[20:09:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10Jclark-ctr) ps2-e4  had a failed network card. started rma for card.     in meantime swapped card from  unconfigured pdu  to verify fixed
[20:10:03] <urbanecm>	 danisztls: should be deployed to beta soon. anything else to deploy today?
[20:10:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10Jclark-ctr) 05Open→03Resolved
[20:10:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Jclark-ctr)
[20:10:43] <danisztls>	 urbanecm: just this patch, thanks!
[20:10:49] <urbanecm>	 okay, then we're done :)
[20:10:56] <urbanecm>	 (with deployment, i mean :D)
[20:11:04] <icinga-wm>	 ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit1001), No backups: 109 (an-master1002, ...), Fresh: 5 jobs Jcrespo known issue - backups are being refactored https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[20:13:02] <icinga-wm>	 RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:13:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:13:14] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[20:14:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:14:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:15:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:16:22] <icinga-wm>	 PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:23:27] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[20:24:38] <wikibugs>	 (03CR) 10Dzahn: "ah, cool, I suggested creating this yesterday without realizing the patch for it already existed" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[20:25:43] <wikibugs>	 (03CR) 10Dzahn: "I would have expected that we set a specific UID/GID that we reserve in admin module. Like for librenms and phd. Not needed here?" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[20:31:24] <denisse|m>	 !log rebooting netmon1003 for a kernel upgrade
[20:31:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:51] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.upgrade-and-reboot
[20:36:58] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:37:10] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:38:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10Andrew) @Jclark-ctr These are blocked on a variety of tech decisions; no action needed in the DC for now. Thanks for checking in!
[20:38:21] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.upgrade-and-reboot (exit_code=0)
[20:39:50] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[20:40:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[20:41:30] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:43:31] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage
[20:52:16] <wikibugs>	 (03PS12) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787
[20:52:18] <wikibugs>	 (03CR) 10Ebernhardson: cirrus: Handle transition to elasticsearch 7.10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824787 (owner: 10Ebernhardson)
[20:56:18] <wikibugs>	 (03CR) 10Ori: [C: 03+1] "'     => $body_regex_not_matches," [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn)
[20:56:49] <wikibugs>	 (03CR) 10Ori: [C: 03+1] prometheus: fix/invert comments about matching in blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn)
[20:57:51] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye
[20:58:05] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@94b160c]: drop_old_data: Add new required param --allowed-interval
[20:58:37] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) 05Open→03Resolved
[21:00:12] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@94b160c]: drop_old_data: Add new required param --allowed-interval (duration: 02m 07s)
[21:02:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "install_server: change partman config for gitlab" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/827578 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto)
[21:04:09] <wikibugs>	 (03CR) 10Dzahn: prometheus: fix/invert comments about matching in blackbox monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn)
[21:05:41] <wikibugs>	 (03CR) 10Muehlenhoff: rancid: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[21:17:08] <icinga-wm>	 RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:20:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] rancid: Switch to systemd::sysuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[21:20:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder  - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[21:20:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] prometheus: fix/invert comments about matching in blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn)
[21:21:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "just the comments for now to avoid wrong docs" [puppet] - 10https://gerrit.wikimedia.org/r/828622 (owner: 10Dzahn)
[21:27:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37072/netmon1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/826867 (owner: 10Muehlenhoff)
[21:29:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:29:46] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade
[21:30:13] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade
[21:30:14] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade
[21:30:18] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade
[21:32:53] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1028.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001
[21:32:58] <stashbot>	 T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697
[21:34:12] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-a valid until 2024-08-30 21:25:17 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:36:52] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-b valid until 2024-08-30 21:25:20 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:38:38] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-c valid until 2024-08-30 21:25:22 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:40:41] <ebernhardson>	 !log run search index creation for guwwiktionary
[21:40:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:10] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:41:31] <ebernhardson>	 !log run search index creation for bjnwiktionary
[21:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:19] <ebernhardson>	 !log run search index creation for pcmwiki
[21:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:28] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1028.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001
[21:42:32] <stashbot>	 T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697
[21:42:40] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1029.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001
[21:44:10] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.16.180:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-a valid until 2024-08-30 21:39:09 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:46:10] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.16.181:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-b valid until 2024-08-30 21:39:11 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:48:25] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1030.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001
[21:48:31] <stashbot>	 T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697
[21:48:40] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.16.182:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-c valid until 2024-08-30 21:39:14 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:52:18] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-a valid until 2024-08-30 21:39:16 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:52:36] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1029.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001
[21:53:04] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.235:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-b valid until 2024-08-30 21:39:18 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:55:22] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.236:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-c valid until 2024-08-30 21:39:21 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[21:55:42] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service daniel_zahn https://phabricator.wikimedia.org/T310395 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:55:46] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service daniel_zahn https://phabricator.wikimedia.org/T310395 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:56:37] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on parse1002 is CRITICAL: Host parse1002 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T312638 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[21:58:20] <mutante>	 !log mw1383 start php7.2-fpm_check_restart.service
[21:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:08] <mutante>	 !log etherpad (etherpad1003) - rebooting for maintenance
[21:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:08] <icinga-wm>	 RECOVERY - Check systemd state on mw1383 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:13] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1030.eqiad.wmnet: Restart to apply new certificates (T316697) - eevans@cumin1001
[22:00:17] <stashbot>	 T316697: Replace expiring Cassandra SSL certificates - https://phabricator.wikimedia.org/T316697
[22:30:25] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:41:29] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:58:48] <wikibugs>	 (03PS5) 10Krinkle: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto)
[23:02:06] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto)
[23:03:51] <wikibugs>	 (03Merged) 10jenkins-bot: Remove references to the 'electron' service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634935 (owner: 10Giuseppe Lavagetto)
[23:06:01] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[23:06:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:07:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:07:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:08:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:12:10] <Krinkle>	 !log krinkle@deploy1002 Change /srv/mediawiki-staging/private to remove wmgElectronSecret
[23:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:13] <wikibugs>	 (03PS1) 10Dduvall: phabricator: Reintroduce script to ensure correct config ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953)
[23:12:37] <wikibugs>	 (03PS4) 10Krinkle: Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230)
[23:13:17] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[23:13:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:13:41] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/: Ibdac0a (duration: 03m 44s)
[23:14:09] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) (owner: 10Krinkle)
[23:14:09] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:14:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:14:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:15:16] <wikibugs>	 (03Merged) 10jenkins-bot: Remove reference to unreachable eventlogging-processor service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822666 (https://phabricator.wikimedia.org/T238230) (owner: 10Krinkle)
[23:15:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:16:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator: Reintroduce script to ensure correct config ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[23:17:34] <logmsgbot>	 !log krinkle@deploy1002 Synchronized private/: (no justification provided) (duration: 03m 42s)
[23:18:11] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[23:20:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[23:21:09] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:21:29] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[23:21:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[23:21:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[23:22:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[23:25:42] <wikibugs>	 (03PS1) 10Dduvall: Run all puppetized deploy scripts as checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/828655 (https://phabricator.wikimedia.org/T313953)
[23:27:34] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[23:29:55] <wikibugs>	 (03PS2) 10Dduvall: Run all puppetized deploy scripts as checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/828655 (https://phabricator.wikimedia.org/T313953)
[23:31:29] <logmsgbot>	 !log krinkle@deploy1002 Synchronized wmf-config/: I493b5e4662 (duration: 03m 43s)
[23:35:59] <wikibugs>	 (03PS2) 10Dduvall: phabricator: Reintroduce script to ensure correct config ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953)
[23:36:30] <wikibugs>	 (03CR) 10Dduvall: [V: 03+1] "Successfully tested in devtools." [puppet] - 10https://gerrit.wikimedia.org/r/828654 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[23:46:54] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[23:52:04] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:52:40] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:54:43] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Test fixture LGTM. In particular, absence of replicas in the pool, but fine to keep in hostname map indeed." [software/conftool] - 10https://gerrit.wikimedia.org/r/828606 (https://phabricator.wikimedia.org/T316482) (owner: 10CDanis)
[23:56:44] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[23:57:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:57:45] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: Upgrade deployment-prep Swift cluster to Debian Buster or newer - https://phabricator.wikimedia.org/T298253 (10Zabe)