[00:04:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:20:39] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:21:31] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:29] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:24:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:29:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:30:47] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:32:33] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:32:33] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:39:34] (03PS1) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) [00:40:31] (03CR) 10CI reject: [V: 04-1] mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [00:42:33] (03PS2) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) [00:43:13] (03PS3) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) [00:44:08] (03CR) 10CI reject: [V: 04-1] mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [00:44:33] (03PS4) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) [00:45:41] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35823/console" [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [00:51:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [00:55:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:04:19] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:05:39] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:06:31] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:08:47] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:15:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:25] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:22:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:24:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:30:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:35:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:40:27] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:46:15] 10SRE, 10Traffic: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10Bugreporter) [01:51:47] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:59:29] RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:10:05] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:15:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [02:15:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [02:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T298560)', diff saved to https://phabricator.wikimedia.org/P29629 and previous config saved to /var/cache/conftool/dbconfig/20220613-021511-ladsgroup.json [02:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:14] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [02:15:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:21:35] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:25:41] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:28:35] (03PS1) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 [02:29:46] (03CR) 10CI reject: [V: 04-1] WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [02:30:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:31:40] (03PS2) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 [02:36:14] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35824/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [02:37:01] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:38:46] (03PS3) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 [02:40:06] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35825/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [02:42:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:47:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:41] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:54:12] (03PS4) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 [02:56:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:58:20] (03PS5) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 [02:59:45] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35827/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [03:01:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:03:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:26] (03PS6) 10Legoktm: WIP: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 [03:07:13] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35828/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [03:08:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:09:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:12:07] (03PS7) 10Legoktm: Add profile::mediawiki::sharded_periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/804800 [03:14:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:16:52] (03PS5) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) [03:17:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:17:58] (03CR) 10CI reject: [V: 04-1] mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) (owner: 10Legoktm) [03:19:10] (03PS6) 10Legoktm: mediawiki: Split updateSpecialPages.php job to be per-shard [puppet] - 10https://gerrit.wikimedia.org/r/804788 (https://phabricator.wikimedia.org/T307314) [03:22:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:23:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:23:45] (03CR) 10Legoktm: "The follow-up patch demonstrates the usefulness of this refactor." [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [03:28:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:29:25] 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 4 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Winston_Sung) [03:32:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:37:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:40:49] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:41:03] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:45:49] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:50:48] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:50:54] down? [03:51:06] upstream connect error or disconnect/reset before headers. reset reason: overflow [03:51:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:52:18] (ProbeDown) firing: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:52:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:52:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:52:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:52:41] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:53:11] PROBLEM - Apache HTTP on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:11] PROBLEM - Apache HTTP on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:11] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:11] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:14] took too long to log into klaxon, pages fired on their own [03:53:39] PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:39] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:47] PROBLEM - Apache HTTP on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:47] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:47] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:47] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:49] PROBLEM - Apache HTTP on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:49] PROBLEM - Apache HTTP on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:55] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:53:57] PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [03:54:09] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:54:11] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [03:54:17] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:55:11] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.6575 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:55:21] RECOVERY - Apache HTTP on mw1441 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:21] RECOVERY - Apache HTTP on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:21] RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:21] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:25] [being investigated] [03:55:51] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:51] RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:57] RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:57] RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:57] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:57] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:55:59] RECOVERY - Apache HTTP on mw1325 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:56:01] RECOVERY - Apache HTTP on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:56:07] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:56:07] RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [03:56:18] looks up from here now, a few people in discord reporting the same [03:56:31] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [03:57:18] (ProbeDown) resolved: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:57:19] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:57:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:57:31] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [03:57:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:57:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:58:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:58:45] 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 3 others: Show SVGs in page view language for language variants if available - https://phabricator.wikimedia.org/T310453 (10Winston_Sung) [03:58:49] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:59:17] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:01:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [04:02:33] (ProbeDown) resolved: (23) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:06:50] everything's back up as far as we can see, continuing to stabilize some things but speak up if you're still having trouble accessing anything <3 [04:14:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:16:36] 10SRE, 10Traffic, 10Wikimedia-Incident: Unable to view all Wikimedia projects - https://phabricator.wikimedia.org/T310431 (10Liz) This happened again in the past 15 minutes and lasted about 4 or 5 minutes. [04:18:59] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:19:03] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:20:39] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:21:40] 10SRE, 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-Parser, and 4 others: Show SVGs in page view language for language variants if available - https://phabricator.wikimedia.org/T310453 (10PatchDemoBot) Test wiki **created** on [[ https://patchdemo.wmflabs.org | Patch demo ]] by Winston Sung using pat... [04:22:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:23:55] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 25.97 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [04:24:37] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 47.33 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [04:24:53] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 54.55 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [04:25:47] !log thumbor2006 - host down - attempting powercycle via DRAC console [04:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:55] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [04:27:11] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 91.84 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [04:28:35] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 87.65 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [04:29:25] !log thumbor2004 - attempted powercycle via DRAC console [04:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:45] (03PS3) 10KartikMistry: Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 [04:32:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:32:27] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=thumbor2004.codfw.wmnet [04:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:34:48] 10SRE: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) [04:35:08] 10SRE: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) 04:32 <+logmsgbot> !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=thumbor2004.codfw.wmnet [04:35:26] 10SRE: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) [04:35:37] (03CR) 10KartikMistry: [C: 03+2] Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 (owner: 10KartikMistry) [04:35:45] ACKNOWLEDGEMENT - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T310455 [04:37:31] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) [04:40:00] (03Merged) 10jenkins-bot: Update nodejs -> node command [deployment-charts] - 10https://gerrit.wikimedia.org/r/804256 (owner: 10KartikMistry) [04:44:56] !log dzahn@cumin2002 conftool action : set/pooled=inactive; selector: dc=codfw,name=thumbor2004.codfw.wmnet [04:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:06] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) [04:46:55] (03PS1) 10Legoktm: mediawiki: Disable useless mostlinkedcategories update job [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456) [04:47:30] (03PS1) 10Legoktm: mediawiki: Remove absented mostlinkedcategories job [puppet] - 10https://gerrit.wikimedia.org/r/804804 [04:49:20] (03PS1) 10Legoktm: Remove misleading "disable" of Special:Mostlinkedcategories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804805 (https://phabricator.wikimedia.org/T310456) [04:50:24] * kart_ updating cxserver.. [04:50:27] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [04:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:41] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [04:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:51:48] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) /admin1-> racadm serveraction powercycle Server power operation successful --- but nothing happens [04:52:13] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35830/console" [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm) [04:52:55] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:54:06] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [04:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:41] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:54:55] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [04:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:54] (03PS1) 10Samwilson: Enable Realtime Preview on cawiki, viwiki, and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804806 (https://phabricator.wikimedia.org/T303961) [04:56:34] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [04:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:17] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [04:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:01] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:59:19] !log Updated cxserver to 2022-06-08-124326-production + nodejs > node command update (T306995, T309169) [04:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:23] T309169: Set Google as the default translation service when translating to Spanish - https://phabricator.wikimedia.org/T309169 [04:59:23] T306995: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 [05:02:08] 10SRE, 10serviceops: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10KartikMistry) Upgrade note: node14 has removed symlink of nodejs -> node command. [05:05:39] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:06:10] (03PS1) 10Legoktm: mediawiki: Switch sharded_periodic_job to use foreachwikiindblist [puppet] - 10https://gerrit.wikimedia.org/r/804807 [05:07:30] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35831/console" [puppet] - 10https://gerrit.wikimedia.org/r/804807 (owner: 10Legoktm) [05:12:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:13:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:13:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [05:14:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [05:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T310011)', diff saved to https://phabricator.wikimedia.org/P29633 and previous config saved to /var/cache/conftool/dbconfig/20220613-051407-marostegui.json [05:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:12] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [05:16:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T310011)', diff saved to https://phabricator.wikimedia.org/P29634 and previous config saved to /var/cache/conftool/dbconfig/20220613-051613-marostegui.json [05:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P29635 and previous config saved to /var/cache/conftool/dbconfig/20220613-053118-marostegui.json [05:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:17] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:46:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P29636 and previous config saved to /var/cache/conftool/dbconfig/20220613-054623-marostegui.json [05:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P29637 and previous config saved to /var/cache/conftool/dbconfig/20220613-055557-root.json [05:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P29638 and previous config saved to /var/cache/conftool/dbconfig/20220613-061101-root.json [06:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:35] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:21:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:23:25] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:26:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P29639 and previous config saved to /var/cache/conftool/dbconfig/20220613-062605-root.json [06:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:43] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10ayounsi) [06:29:15] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:36:21] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:41:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P29640 and previous config saved to /var/cache/conftool/dbconfig/20220613-064109-root.json [06:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:21] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:00:04] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T0700). [07:00:04] TheresNoTime: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] (03CR) 10Ayounsi: "I don't know enough the venv internals to suggest a better approach." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [07:06:59] (03CR) 10Ayounsi: scap: update venv to use the system ca bundle (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [07:11:00] (03CR) 10Slyngshede: [C: 03+2] C:query_service::deploy::autodeploy remove used autodeploy. [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [07:11:59] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:16:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:18:09] (03PS1) 10Muehlenhoff: Record removed Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/805078 [07:23:19] (03CR) 10Muehlenhoff: [C: 03+2] Record removed Kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/805078 (owner: 10Muehlenhoff) [07:24:46] (03PS1) 10Muehlenhoff: Remove LDAP access for dstrine [puppet] - 10https://gerrit.wikimedia.org/r/805079 [07:28:39] (03PS5) 10Slyngshede: Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) [07:30:58] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for dstrine [puppet] - 10https://gerrit.wikimedia.org/r/805079 (owner: 10Muehlenhoff) [07:31:05] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:31:47] (03CR) 10CI reject: [V: 04-1] Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:33:38] (03PS6) 10Slyngshede: Move more OSM cronjobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) [07:38:03] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35832/console" [puppet] - 10https://gerrit.wikimedia.org/r/791349 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:41:45] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:54:58] !log failover ganeti master in esams to ganeti3003 T308238 [07:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:03] T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 [07:55:29] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) [08:00:19] PROBLEM - ganeti-wconfd running on ganeti3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:03:01] PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [08:06:27] (03CR) 10Volans: scap: update venv to use the system ca bundle (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [08:06:39] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:39] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [08:07:06] XioNoX: ^^^ mr1-drmrs [08:07:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:13:01] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:14:04] volans: looking [08:15:44] volans: looks like it died, and of course the console server is only reachable through mgmt :) [08:15:53] :/ [08:16:16] thanks to parent/child in netbox, only the relevant things alerted [08:16:22] er, in icinga I mean [08:16:27] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=2&sortoption=3&serviceprops=270336&hostprops=270336 [08:16:42] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10SLyngshede-WMF) p:05Triage→03Medium [08:20:39] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:22:00] PROBLEM - IPMI Sensor Status on cp6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:22:37] ohhh, did we lose a power feed? [08:22:45] I was about to say the same... [08:23:20] PROBLEM - IPMI Sensor Status on lvs6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:23:23] I guess it's easier to fix than a failed router [08:23:24] yep [08:23:35] asw1-b13-drmrs> show chassis environment [08:23:35] Class Item Status Measurement [08:23:35] Power FPC 0 Power Supply 0 OK 35 degrees C / 95 degrees F [08:23:35] FPC 0 Power Supply 1 Present [08:23:37] from icinga they are all in soft critical [08:24:09] was there a planned maintenance? [08:24:36] I can't see one in the calendar [08:24:42] thanks [08:24:44] PROBLEM - IPMI Sensor Status on dns6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:25:50] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:25:50] PROBLEM - IPMI Sensor Status on cp6007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:27:40] PROBLEM - IPMI Sensor Status on cp6006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:28:06] PROBLEM - IPMI Sensor Status on lvs6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:28:14] PROBLEM - IPMI Sensor Status on cp6010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:29:02] Please be advised that all equipment connected to only one feed could lose power. [08:29:02] We advise you to check that your equipment is connected to both feeds provided and/or to automatic source inverters in the case it is only connected to a single feed. [08:29:29] the mrs2 notifications don't go to maint-announce [08:29:30] PROBLEM - IPMI Sensor Status on cp6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:29:40] XioNoX: and where they go? [08:29:53] volans: noreply-notifications@interxion.com :) [08:29:56] PROBLEM - IPMI Sensor Status on cp6005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:30:02] :/ [08:30:16] at least me, and probably some of DCops [08:30:59] btw: [08:31:05] Time Start: 13 June 2022 09:00 Local time [08:31:05] Time End: 13 June 2022 18:00 Local time [08:31:10] so all day [08:31:27] hopefully less than that [08:32:09] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/804306 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [08:33:06] PROBLEM - IPMI Sensor Status on ganeti6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:34:31] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) Thanks for the update; I think the ILO needs our local configuration re-applying to it? If so, are you OK to do that, please? [08:34:32] PROBLEM - IPMI Sensor Status on cp6013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:34:34] PROBLEM - IPMI Sensor Status on cp6011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:34:42] PROBLEM - IPMI Sensor Status on cp6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:35:48] PROBLEM - IPMI Sensor Status on cp6012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:35:58] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [08:36:22] PROBLEM - IPMI Sensor Status on lvs6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:39:30] PROBLEM - IPMI Sensor Status on ganeti6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:40:38] PROBLEM - IPMI Sensor Status on cp6014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:41:18] PROBLEM - IPMI Sensor Status on cp6009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:42:24] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:43:00] PROBLEM - IPMI Sensor Status on cp6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:46:52] PROBLEM - IPMI Sensor Status on cp6016 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:46:52] PROBLEM - IPMI Sensor Status on cp6008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:48:18] PROBLEM - IPMI Sensor Status on cp6015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:51:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:51:28] PROBLEM - IPMI Sensor Status on ganeti6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:55:22] PROBLEM - IPMI Sensor Status on dns6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:58:30] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:03:52] PROBLEM - IPMI Sensor Status on ganeti6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:05:39] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:07:11] (03CR) 10David Caro: [C: 03+2] wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [09:07:20] !log installing ntfs-3g security updates [09:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:28] (03PS12) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings [puppet] - 10https://gerrit.wikimedia.org/r/802040 [09:11:20] 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10ayounsi) p:05Triage→03High [09:11:47] 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10ayounsi) [09:12:16] !log drain ganeti3001 for firmware update/reimage T308238 [09:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:19] T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 [09:12:46] ACKNOWLEDGEMENT - ps1-b13-drmrs-infeed-load-tower-B-single-phase on ps1-b13-drmrs is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:46] ACKNOWLEDGEMENT - ps1-b13-drmrs-infeed-load-tower-A-single-phase on ps1-b13-drmrs is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:46] ACKNOWLEDGEMENT - ps1-b12-drmrs-infeed-load-tower-B-single-phase on ps1-b12-drmrs is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:46] ACKNOWLEDGEMENT - ps1-b12-drmrs-infeed-load-tower-A-single-phase on ps1-b12-drmrs is CRITICAL: CRITICAL - Plugin timed out while executing system call ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:12:46] ACKNOWLEDGEMENT - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. [09:14:26] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:26] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:26] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:26] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:26] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:27] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:27] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:28] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:28] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:29] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:29] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:30] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:30] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:31] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:31] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:32] ACKNOWLEDGEMENT - IPMI Sensor Status on cp6016 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:32] ACKNOWLEDGEMENT - IPMI Sensor Status on dns6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:33] ACKNOWLEDGEMENT - IPMI Sensor Status on dns6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:33] ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:34] ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:34] ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:35] ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:35] ACKNOWLEDGEMENT - IPMI Sensor Status on lvs6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:36] ACKNOWLEDGEMENT - IPMI Sensor Status on lvs6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:14:36] ACKNOWLEDGEMENT - IPMI Sensor Status on lvs6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] ayounsi https://phabricator.wikimedia.org/T310470 - The acknowledgement expires at: 2022-06-13 16:00:00. https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [09:16:21] (03PS5) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 [09:19:56] (03CR) 10CI reject: [V: 04-1] sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [09:20:15] (03CR) 10Volans: "Do we need a new cookbook? can't we just extend the sre.hosts.dhcp one for this use case?" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [09:21:36] (03CR) 10David Caro: wmcs: Added taskircmail, ircmail and pagetaskircmail routings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802040 (owner: 10David Caro) [09:25:50] (03PS6) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 [09:29:58] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [09:43:20] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:45:34] (03PS7) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 [09:46:23] (03CR) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [09:48:55] 10ops-eqiad, 10DC-Ops: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T310478 (10MatthewVernon) p:05Triage→03High [09:50:12] 10ops-eqiad, 10DC-Ops: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T310478 (10MatthewVernon) [09:50:14] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [09:50:38] (03PS1) 10Slyngshede: LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) [09:51:11] (03PS9) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team [puppet] - 10https://gerrit.wikimedia.org/r/802074 [09:51:13] (03CR) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [09:51:21] (03PS3) 10David Caro: alertmanager.yml.erb: use facts directly instead of lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/802489 [09:51:45] 10SRE, 10Data-Persistence-Backup, 10media-backups, 10Goal, 10Patch-For-Review: Document media recovery use case proposals and decide their priority - https://phabricator.wikimedia.org/T299764 (10jcrespo) > Do whatever is the least effort from your end that still preserves something Thank you a lot, that... [09:52:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:05] (03CR) 10CI reject: [V: 04-1] LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) (owner: 10Slyngshede) [09:52:59] (03PS3) 10Muehlenhoff: Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214) [09:53:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:53:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:15] (03PS2) 10Slyngshede: LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) [09:54:23] (03CR) 10Hnowlan: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [09:54:31] (03CR) 10David Caro: [C: 03+2] alertmanager.yml.erb: use facts directly instead of lookupvar [puppet] - 10https://gerrit.wikimedia.org/r/802489 (owner: 10David Caro) [09:54:51] (03CR) 10jenkins-bot: LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) (owner: 10Slyngshede) [09:56:23] (03PS3) 10Slyngshede: LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) [09:58:03] (03PS2) 10Jbond: scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 [09:58:10] (03CR) 10Jbond: scap: update venv to use the system ca bundle (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [10:01:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) (owner: 10Slyngshede) [10:10:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803943 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:13:33] !log installing 5.10.120 kernel updates on bullseye hosts [10:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T310011)', diff saved to https://phabricator.wikimedia.org/P29641 and previous config saved to /var/cache/conftool/dbconfig/20220613-101537-marostegui.json [10:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:41] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:16:58] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:37:43] 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) [10:37:50] 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) p:05Triage→03High [10:37:55] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:37:58] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:17] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:20] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:12] (03CR) 10Slyngshede: [C: 03+2] LDAP sync. [puppet] - 10https://gerrit.wikimedia.org/r/805084 (https://phabricator.wikimedia.org/T310385) (owner: 10Slyngshede) [10:41:37] (03PS1) 10Marostegui: dbproxy2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805095 (https://phabricator.wikimedia.org/T310484) [10:43:56] (03CR) 10Marostegui: [C: 03+2] dbproxy2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805095 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui) [10:44:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T310011)', diff saved to https://phabricator.wikimedia.org/P29642 and previous config saved to /var/cache/conftool/dbconfig/20220613-104449-marostegui.json [10:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:54] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:45:28] (03PS1) 10Marostegui: x2 databases: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805096 (https://phabricator.wikimedia.org/T310485) [10:47:02] (03CR) 10Marostegui: [C: 03+2] x2 databases: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805096 (https://phabricator.wikimedia.org/T310485) (owner: 10Marostegui) [10:50:20] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:23] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:36] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:50:42] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:11] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:22] 10SRE, 10Infrastructure-Foundations: SSH host key verification failures in Ganeti intra node SSH calls after Bullseye update - https://phabricator.wikimedia.org/T309724 (10MoritzMuehlenhoff) [10:51:56] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:05] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:26] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:30] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:52] (03CR) 10Ayounsi: [C: 03+1] scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [10:56:22] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dsharpe out of all services on: 609 hosts [10:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dsharpe out of all services on: 609 hosts [10:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P29643 and previous config saved to /var/cache/conftool/dbconfig/20220613-105954-marostegui.json [10:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:15] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet [11:00:17] RECOVERY - IPMI Sensor Status on dns6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:19] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Dsharpe out of all services on: 1219 hosts [11:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:50] RECOVERY - IPMI Sensor Status on dns6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:01:57] RECOVERY - IPMI Sensor Status on cp6007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:02:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Dsharpe out of all services on: 1219 hosts [11:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:48] RECOVERY - IPMI Sensor Status on cp6006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:03:59] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet [11:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:16] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet [11:04:16] RECOVERY - IPMI Sensor Status on lvs6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:20] RECOVERY - IPMI Sensor Status on cp6010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:04:51] (03PS1) 10MMandere: smokeping: Temp mute smokeping for host lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/805098 (https://phabricator.wikimedia.org/T310470) [11:05:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10SLyngshede-WMF) a:03elukey [11:05:36] RECOVERY - IPMI Sensor Status on cp6004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:05:50] (03CR) 10Ayounsi: [C: 03+1] smokeping: Temp mute smokeping for host lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/805098 (https://phabricator.wikimedia.org/T310470) (owner: 10MMandere) [11:06:00] RECOVERY - Host mr1-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.46 ms [11:06:02] RECOVERY - IPMI Sensor Status on cp6005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:06:25] (03CR) 10MMandere: [C: 03+2] smokeping: Temp mute smokeping for host lvs6001 [puppet] - 10https://gerrit.wikimedia.org/r/805098 (https://phabricator.wikimedia.org/T310470) (owner: 10MMandere) [11:07:14] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki2002.codfw.wmnet [11:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:59] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet [11:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:17] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [11:08:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:22] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.56 ms [11:08:22] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.62 ms [11:08:37] RECOVERY - IPMI Sensor Status on ganeti6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:09:06] RECOVERY - IPMI Sensor Status on ganeti6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:10:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [11:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:31] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:36] RECOVERY - IPMI Sensor Status on cp6013 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:10:40] RECOVERY - IPMI Sensor Status on cp6011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:10:48] RECOVERY - IPMI Sensor Status on cp6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:10:56] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) [11:11:30] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet [11:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:52] RECOVERY - IPMI Sensor Status on cp6012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:12:24] RECOVERY - IPMI Sensor Status on lvs6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:12:30] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki2002.codfw.wmnet [11:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:55] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host netboxdb2002.codfw.wmnet [11:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:14:52] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netboxdb2002.codfw.wmnet [11:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P29644 and previous config saved to /var/cache/conftool/dbconfig/20220613-111459-marostegui.json [11:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:12] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet [11:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:34] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox2002.codfw.wmnet [11:15:34] RECOVERY - IPMI Sensor Status on ganeti6004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1131 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29645 and previous config saved to /var/cache/conftool/dbconfig/20220613-111621-root.json [11:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:42] RECOVERY - IPMI Sensor Status on cp6014 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:17:28] RECOVERY - IPMI Sensor Status on cp6009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:18:14] !log Reboot db1131 for kernel upgrade T310485 [11:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:35] !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=netbox,name=codfw [11:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:46] !log Reboot x2 hosts for kernel upgrade T310485 [11:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:04] !log jbond@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=netbox,name=eqiad [11:19:04] RECOVERY - IPMI Sensor Status on cp6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:08] !log jbond@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=netbox,name=eqiad [11:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:22] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox2002.codfw.wmnet [11:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:41] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox1002.eqiad.wmnet [11:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:09] RECOVERY - IPMI Sensor Status on cp6008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:22:09] RECOVERY - IPMI Sensor Status on cp6016 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:23:05] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:23:09] RECOVERY - IPMI Sensor Status on cp6015 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:23:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 1%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29646 and previous config saved to /var/cache/conftool/dbconfig/20220613-112356-root.json [11:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:26] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host netbox1002.eqiad.wmnet [11:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:38] !log jbond@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=netbox [11:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:41] PROBLEM - Confd template for /var/lib/gdnsd/discovery-netbox.state on authdns2001 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-netbox.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:24:49] !log jbond@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=netbox,name=codfw [11:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:53] PROBLEM - Confd template for /var/lib/gdnsd/discovery-netbox.state on dns3002 is CRITICAL: Compilation of file /var/lib/gdnsd/discovery-netbox.state is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [11:25:30] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp-test1002.wikimedia.org [11:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:43] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test2002.wikimedia.org [11:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:23] RECOVERY - IPMI Sensor Status on ganeti6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:27:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:39] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1002.wikimedia.org [11:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:51] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org [11:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:55] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test2002.wikimedia.org [11:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:07] RECOVERY - IPMI Sensor Status on cp6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:28:29] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host netbox-dev2002.codfw.wmnet [11:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:23] (03PS1) 10Jbond: ido: failover to preform reboot [dns] - 10https://gerrit.wikimedia.org/r/805107 (https://phabricator.wikimedia.org/T310483) [11:29:37] RECOVERY - IPMI Sensor Status on lvs6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:29:54] (03PS2) 10Jbond: idp: failover to preform reboot [dns] - 10https://gerrit.wikimedia.org/r/805107 (https://phabricator.wikimedia.org/T310483) [11:30:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T310011)', diff saved to https://phabricator.wikimedia.org/P29647 and previous config saved to /var/cache/conftool/dbconfig/20220613-113004-marostegui.json [11:30:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:30:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:10] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:04] (03PS1) 10David Caro: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789) [11:31:13] jbond: are you here? could we merge https://phabricator.wikimedia.org/T301104? [11:31:23] Mitar: hi and yes one sec [11:31:42] sorry i missed yuo friday [11:31:45] (03CR) 10Jbond: [C: 03+2] Add page metadata to Wikibase JSON dumps [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar) [11:32:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:34:36] Mitar: merged and deployed ot all the snapshot machines [11:35:28] (03CR) 10CI reject: [V: 04-1] Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro) [11:35:31] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host idp2002.wikimedia.org [11:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:41] PROBLEM - Check systemd state on idp2002 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:58] * jbond looking [11:36:08] .. at idp2002 [11:36:17] !log jbond@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netbox-dev2002.codfw.wmnet [11:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:53] awesome, thanks! [11:37:17] np [11:39:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 5%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29648 and previous config saved to /var/cache/conftool/dbconfig/20220613-113900-root.json [11:39:03] RECOVERY - Check systemd state on idp2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:05] (03PS1) 10Marostegui: Revert "dbproxy2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/804710 [11:39:19] (03PS1) 10Marostegui: Revert "x2 databases: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/804711 [11:40:31] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/804710 (owner: 10Marostegui) [11:42:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:04] (03CR) 10Marostegui: [C: 03+2] Revert "x2 databases: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/804711 (owner: 10Marostegui) [11:47:53] 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) Active proxies: ` # for i in m1 m2 m3 m5; do host $i-master | grep alias ;done m1-master.eqiad.wmnet is an alias for dbproxy1012.eqiad.wmnet. m2-master.eqiad.wmnet is an alias for dbproxy10... [11:52:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:52:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29649 and previous config saved to /var/cache/conftool/dbconfig/20220613-115238-marostegui.json [11:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:41] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:54:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 10%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29650 and previous config saved to /var/cache/conftool/dbconfig/20220613-115404-root.json [11:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:18] 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) [11:54:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/805107 (https://phabricator.wikimedia.org/T310483) (owner: 10Jbond) [11:54:26] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:54:31] 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) [11:56:46] 10SRE: an-tool1005 - memcached Connection refused - https://phabricator.wikimedia.org/T309886 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF Memcache was restarted by @elukey on Mon 2022-06-06 06:30:42 UTC [11:58:11] (03PS4) 10Muehlenhoff: Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214) [11:58:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:20] (03PS1) 10Marostegui: wmnet: Failover m1 and m2 [dns] - 10https://gerrit.wikimedia.org/r/805114 (https://phabricator.wikimedia.org/T310484) [12:00:57] 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Request to create new mailing lists for Chinese Wikipedia Administrators - https://phabricator.wikimedia.org/T310465 (10SLyngshede-WMF) p:05Triage→03Low We just need to clarify if there's an approval process for requesting new mailing lists. I'll try to... [12:02:39] looks like drmrs recovered [12:03:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:29] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti3001.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage [12:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti3001.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage [12:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29651 and previous config saved to /var/cache/conftool/dbconfig/20220613-120907-root.json [12:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:02] 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Request to create new mailing lists for Chinese Wikipedia Administrators - https://phabricator.wikimedia.org/T310465 (10SLyngshede-WMF) 05Open→03In progress p:05Low→03High [12:16:24] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:17:35] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH ganeti3001 is removed from the cluster, downtimed and needs the same firmware/NIC updates to enable the reimage to Bullseye. [12:19:18] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10Ottomata) I think the bigtop15 .deb packages can/should just be copied to bullsye? https://apt.wikimedia.org/wikimedia/pool/thirdparty/b... [12:19:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29652 and previous config saved to /var/cache/conftool/dbconfig/20220613-121949-marostegui.json [12:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:54] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:24:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29653 and previous config saved to /var/cache/conftool/dbconfig/20220613-122411-root.json [12:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:13] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T310478 (10MatthewVernon) 05Open→03Resolved a:05Cmjohnson→03MatthewVernon This turned out to be an incorrect config section - I've updated https://wikitech.wikim... [12:25:16] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [12:25:46] (03CR) 10Ayounsi: [C: 03+1] devices: override default timeout for mgmt routers [homer/public] - 10https://gerrit.wikimedia.org/r/799381 (owner: 10Volans) [12:26:26] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet - https://phabricator.wikimedia.org/T310478 (10MatthewVernon) [12:27:04] (03CR) 10Ayounsi: "I think this can be abandoned as we're not going with 2 SCAP repositories anymore." [puppet] - 10https://gerrit.wikimedia.org/r/789635 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:29:19] (03PS1) 10Btullis: Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/805119 (https://phabricator.wikimedia.org/T309526) [12:29:45] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet [12:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:59] (03CR) 10Jbond: [C: 03+2] idp: failover to preform reboot [dns] - 10https://gerrit.wikimedia.org/r/805107 (https://phabricator.wikimedia.org/T310483) (owner: 10Jbond) [12:30:56] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10MoritzMuehlenhoff) >>! In T310451#7998402, @Ottomata wrote: > I think the bigtop15 .deb packages can/should just be copied to bullsye? I... [12:31:06] (03Abandoned) 10Jbond: O:netbox::standalone: use netbox-next/deploy scap repo [puppet] - 10https://gerrit.wikimedia.org/r/789635 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:31:50] (03PS2) 10Btullis: Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/805119 (https://phabricator.wikimedia.org/T309526) [12:33:14] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10Ottomata) Ah, okay! [12:33:52] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet [12:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P29654 and previous config saved to /var/cache/conftool/dbconfig/20220613-123454-marostegui.json [12:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:31] (03PS1) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) [12:38:52] (03PS2) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) [12:39:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29655 and previous config saved to /var/cache/conftool/dbconfig/20220613-123915-root.json [12:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:28] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui) [12:42:56] (03CR) 10Btullis: [C: 03+2] Failover hive to the standby server [dns] - 10https://gerrit.wikimedia.org/r/805119 (https://phabricator.wikimedia.org/T309526) (owner: 10Btullis) [12:46:17] (03PS3) 10Elukey: admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) [12:48:31] 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Request to create new mailing lists for Chinese Wikipedia Administrators - https://phabricator.wikimedia.org/T310465 (10SLyngshede-WMF) 05In progress→03Resolved a:03SLyngshede-WMF Mailing list have been created, but please check that you have access vi... [12:49:32] (03CR) 10Muehlenhoff: [C: 03+2] Switch idp1001/idp2001 to role(insetup) [puppet] - 10https://gerrit.wikimedia.org/r/803892 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [12:50:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P29657 and previous config saved to /var/cache/conftool/dbconfig/20220613-124959-marostegui.json [12:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:58] (03CR) 10Ayounsi: "I haven't done a deep review of the python side, but the logic sgtm!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [12:51:01] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org [12:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:05] (03PS1) 10Kosta Harlan: NewcomerTasksStore: update quality gate config when the task queue is set [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/804712 (https://phabricator.wikimedia.org/T309768) [12:51:29] (03CR) 10Kosta Harlan: [C: 03+2] "Backport" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/804712 (https://phabricator.wikimedia.org/T309768) (owner: 10Kosta Harlan) [12:53:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1002.wikimedia.org [12:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1057.eqiad.wmnet with OS bullseye [12:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:08] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1057.eqiad.wmnet with OS bullseye [12:54:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: After ugprading kernel', diff saved to https://phabricator.wikimedia.org/P29658 and previous config saved to /var/cache/conftool/dbconfig/20220613-125419-root.json [12:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:56:25] (03CR) 10Volans: "reply inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [12:57:41] (03PS1) 10Jbond: Revert "idp: failover to preform reboot" [dns] - 10https://gerrit.wikimedia.org/r/804713 [12:57:47] (03PS2) 10Jbond: Revert "idp: failover to preform reboot" [dns] - 10https://gerrit.wikimedia.org/r/804713 [12:57:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [12:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T1300). [13:00:05] kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:26] \o [13:00:27] hi kostajh, do you want to self-serve, or should I deploy? [13:00:42] if it's convenient for you to do it, please do. Otherwise I don't mind [13:00:49] (03CR) 10Muehlenhoff: "I don't think we even need specifically fall back, one IDP node is a good as the other. For all past maintenances the failed over server s" [dns] - 10https://gerrit.wikimedia.org/r/804713 (owner: 10Jbond) [13:00:56] o/ [13:00:57] urbanecm: ^ [13:01:07] okay okay. I'll ping you once it's at the debug host kostajh :) [13:01:27] TheresNoTime: you had two patches in the morning window that apparently weren’t deployed, do you want to reschedule them? :) [13:01:42] urbanecm: cheers [13:02:35] (03Abandoned) 10Jbond: Revert "idp: failover to preform reboot" [dns] - 10https://gerrit.wikimedia.org/r/804713 (owner: 10Jbond) [13:03:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [13:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29659 and previous config saved to /var/cache/conftool/dbconfig/20220613-130504-marostegui.json [13:05:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1138.eqiad.wmnet with reason: Maintenance [13:05:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1138.eqiad.wmnet with reason: Maintenance [13:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:11] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [13:05:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T310011)', diff saved to https://phabricator.wikimedia.org/P29660 and previous config saved to /var/cache/conftool/dbconfig/20220613-130512-marostegui.json [13:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:16] (03PS1) 10Jbond: hieradata: netbox1001 to specify netbox1002 as the active server. [puppet] - 10https://gerrit.wikimedia.org/r/805125 (https://phabricator.wikimedia.org/T296452) [13:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:06:16] (03CR) 10Urbanecm: [C: 04-1] "svg should be optimized with svgo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [13:06:42] (03CR) 10Urbanecm: [C: 04-1] "svg should be optimized with svgo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [13:07:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35834/console" [puppet] - 10https://gerrit.wikimedia.org/r/805125 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:09:22] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805125 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:10:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet [13:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] hieradata: netbox1001 to specify netbox1002 as the active server. [puppet] - 10https://gerrit.wikimedia.org/r/805125 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:12:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 12 hosts with reason: reboots [13:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet [13:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 12 hosts with reason: reboots [13:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:07] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet [13:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:09] urbanecm: no... selenium failed for Minerva :( [13:14:14] :( [13:14:15] (03PS4) 10Samtar: crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) [13:14:36] (03CR) 10CI reject: [V: 04-1] NewcomerTasksStore: update quality gate config when the task queue is set [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/804712 (https://phabricator.wikimedia.org/T309768) (owner: 10Kosta Harlan) [13:14:49] kostajh: since it's selenium, perhaps let's re-run? [13:15:01] urbanecm: yeah, it's a random failure. Or is force merge acceptable in this situation? [13:15:20] i try to avoid force merges as much as possible [13:15:33] (03Merged) 10jenkins-bot: NewcomerTasksStore: update quality gate config when the task queue is set [extensions/GrowthExperiments] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/804712 (https://phabricator.wikimedia.org/T309768) (owner: 10Kosta Harlan) [13:15:41] * urbanecm is confused [13:15:42] (03PS4) 10Samtar: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) [13:15:48] oh [13:15:49] (03CR) 10Volans: "Did a first full pass." [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [13:15:53] main failed, gate succeeded [13:16:07] ha [13:16:13] I didn't think that was possible with gerrit, but ok [13:16:49] kostajh: pulled to mwdebug1001. can you check please? [13:17:14] urbanecm: checking [13:17:14] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:33] urbanecm: looks good to me [13:18:37] syncing :) [13:20:42] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:20:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet [13:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:13] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet [13:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1057.eqiad.wmnet with reason: host reimage [13:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:56] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.15/extensions/GrowthExperiments/modules/ext.growthExperiments.DataStore/NewcomerTasksStore.js: 67a5352b0bf9f6aa160cc93a42ca22a02aad883a: NewcomerTasksStore: update quality gate config when the task queue is set (T309768) (duration: 03m 41s) [13:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:01] T309768: TypeError: Cannot read properties of undefined (reading 'dailyLimit') - https://phabricator.wikimedia.org/T309768 [13:23:07] kostajh: and done. anything else? [13:23:19] urbanecm: no, thank you very much! [13:23:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet [13:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:25] happy to help! [13:24:26] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [13:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet [13:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:18] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1057.eqiad.wmnet with reason: host reimage [13:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:10] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [13:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:16] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:27:04] (03CR) 10Jforrester: "❤️" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800680 (owner: 10Ladsgroup) [13:27:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet [13:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [13:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:12] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Papaul) p:05Triage→03Medium [13:29:30] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Papaul) p:05Triage→03Medium [13:29:48] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) p:05Triage→03Medium [13:31:02] (03CR) 10Elukey: [C: 03+2] admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey) [13:31:11] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Papaul) a:03Papaul [13:31:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [13:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on dse-k8s-worker[1001-1004].eqiad.wmnet with reason: reboots [13:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dse-k8s-worker[1001-1004].eqiad.wmnet with reason: reboots [13:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) 05Open→03Resolved [13:32:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T310011)', diff saved to https://phabricator.wikimedia.org/P29662 and previous config saved to /var/cache/conftool/dbconfig/20220613-133239-marostegui.json [13:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:44] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [13:35:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:00] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:39:01] (03PS9) 10Eevans: Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) [13:39:18] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:40:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:40:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1089 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T310387 (10SLyngshede-WMF) p:05Triage→03Medium [13:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:38] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet [13:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:01] (03PS2) 10David Caro: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789) [13:43:28] (03PS3) 10David Caro: Use our own alert managing [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805108 (https://phabricator.wikimedia.org/T309789) [13:43:59] (03PS1) 10Klausman: ml-staging-codfw: Add override for cert names [deployment-charts] - 10https://gerrit.wikimedia.org/r/805127 (https://phabricator.wikimedia.org/T302195) [13:44:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [13:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:53] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet [13:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P29663 and previous config saved to /var/cache/conftool/dbconfig/20220613-134744-marostegui.json [13:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:54] (03CR) 10Ori: [C: 03+2] service::docker: refresh service when config file is changed [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [13:48:59] (03PS4) 10Ori: service::docker: refresh service when config file is changed [puppet] - 10https://gerrit.wikimedia.org/r/799420 [13:49:38] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:50:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10dcaro) +1 on moving the cloudcephosd hosts, should have no problem as long as it's done one by one. [13:50:44] PROBLEM - Check systemd state on karapace1001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1057.eqiad.wmnet with OS bullseye [13:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:08] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1057.eqiad.wmnet with OS bullseye completed: - ms-be1057 (**PASS**) - Downtim... [13:51:21] (03CR) 10MVernon: [C: 03+2] Configure AQS Cassandra hosts (codfw) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [13:52:57] (03CR) 10MVernon: [C: 03+2] Configure AQS Cassandra hosts (codfw) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802604 (https://phabricator.wikimedia.org/T307801) (owner: 10Eevans) [13:53:21] (03CR) 10Elukey: [C: 03+1] "The CI diff looks good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/805127 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:53:26] (03CR) 10CDanis: [C: 03+1] puppetmaster: update private repo pre-commit to error un-staged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond) [13:54:10] (03CR) 10Klausman: [C: 03+2] ml-staging-codfw: Add override for cert names [deployment-charts] - 10https://gerrit.wikimedia.org/r/805127 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:55:45] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dumpsdata1007.eqiad.wmnet [13:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [13:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:12] (03Merged) 10jenkins-bot: ml-staging-codfw: Add override for cert names [deployment-charts] - 10https://gerrit.wikimedia.org/r/805127 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [13:58:25] PROBLEM - Host ganeti6003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:58:25] PROBLEM - Host ganeti6001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:58:25] PROBLEM - Host ganeti6004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:58:25] PROBLEM - Host ganeti6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:58:47] PROBLEM - Host lvs6001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:58:47] PROBLEM - Host lvs6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:58:47] PROBLEM - Host lvs6003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:58:57] PROBLEM - Host asw1-b12-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:59:19] ^ expected? [13:59:21] PROBLEM - Host scs-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [13:59:27] PROBLEM - Host dns6001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:59:27] PROBLEM - Host dns6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:59:35] PROBLEM - Host asw1-b13-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:59:41] PROBLEM - Host ps1-b12-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [13:59:45] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 59, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:59:55] (03CR) 10Jcrespo: [C: 03+1] wmnet: Failover m1 and m2 [dns] - 10https://gerrit.wikimedia.org/r/805114 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui) [14:00:03] PROBLEM - Host cp6001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:00:09] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:12] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:00:13] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:33] PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 32, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:00:33] RECOVERY - Check systemd state on karapace1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:18] (03CR) 10Ayounsi: Netbox Ganeti sync: add groups support (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [14:01:31] PROBLEM - Host ps1-b13-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [14:01:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [14:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:06] (03PS1) 10Btullis: Fail back Hive services to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/805132 (https://phabricator.wikimedia.org/T309526) [14:02:25] PROBLEM - Host cp6002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:25] PROBLEM - Host cp6003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:25] PROBLEM - Host cp6004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:25] PROBLEM - Host cp6005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:25] PROBLEM - Host cp6006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:26] PROBLEM - Host cp6007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:27] PROBLEM - Host cp6008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:27] PROBLEM - Host cp6009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:27] PROBLEM - Host cp6010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:28] PROBLEM - Host cp6012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:28] PROBLEM - Host cp6011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:29] PROBLEM - Host cp6013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:29] PROBLEM - Host cp6014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:30] PROBLEM - Host cp6015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:30] PROBLEM - Host cp6016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:31] PROBLEM - Host cr2-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:31] PROBLEM - Host cr1-drmrs.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:02:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P29665 and previous config saved to /var/cache/conftool/dbconfig/20220613-140249-marostegui.json [14:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:45] (03CR) 10Btullis: [C: 03+2] Fail back Hive services to an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/805132 (https://phabricator.wikimedia.org/T309526) (owner: 10Btullis) [14:03:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:03:54] (03PS2) 10Ori: admin: steal Giuseppe's docker shortcuts [puppet] - 10https://gerrit.wikimedia.org/r/800122 [14:05:24] * TheresNoTime can't *hear* screaming... so that must be expected downtime :P [14:05:41] sukhe: this is related to some schedualed maintence that shouldn't have caused an issue (cc XioNoX ) [14:05:50] https://phabricator.wikimedia.org/T310470 (from XioNoX) [14:05:55] thanks jbond [14:06:08] > Since 8am UTC one of the drmrs power feed is down. Only impact is the management router down (and thus the management network). [14:06:13] thanks for the link sukhe :) [14:09:43] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:26] (CertManagerCertNotReady) firing: Certificate istio-system/knative-serving is not in a ready state - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [14:16:49] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui) [14:17:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T310011)', diff saved to https://phabricator.wikimedia.org/P29666 and previous config saved to /var/cache/conftool/dbconfig/20220613-141754-marostegui.json [14:17:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:17:58] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [14:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T310011)', diff saved to https://phabricator.wikimedia.org/P29667 and previous config saved to /var/cache/conftool/dbconfig/20220613-141802-marostegui.json [14:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:46] (JobUnavailable) firing: (2) Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:46] elukey|klausman is CertManagerCertNotReady on mlstaging expected? Let me know if you need help [14:20:23] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:20:59] PROBLEM - IPMI Sensor Status on cp6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:21:47] RECOVERY - Host ganeti6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.09 ms [14:21:47] RECOVERY - Host ganeti6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [14:21:57] PROBLEM - IPMI Sensor Status on lvs6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:22:15] RECOVERY - Host lvs6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [14:22:15] RECOVERY - Host lvs6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.09 ms [14:22:37] jayme: yes, it's expected, I am setting up stuff there now [14:22:37] RECOVERY - Host ganeti6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.07 ms [14:22:41] RECOVERY - Host dns6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [14:22:43] PROBLEM - IPMI Sensor Status on dns6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:22:43] ack [14:23:03] RECOVERY - Host cp6001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [14:23:07] RECOVERY - Host cp6006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [14:23:35] PROBLEM - IPMI Sensor Status on dns6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:24:59] PROBLEM - IPMI Sensor Status on cp6007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:25:37] RECOVERY - Host cp6014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [14:25:41] RECOVERY - Host cp6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [14:25:41] RECOVERY - Host cp6003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [14:25:41] RECOVERY - Host cp6005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [14:25:41] RECOVERY - Host cp6004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.03 ms [14:25:43] RECOVERY - Host cp6009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [14:25:43] RECOVERY - Host cp6010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [14:25:43] RECOVERY - Host cp6012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [14:25:43] RECOVERY - Host cp6013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [14:25:43] RECOVERY - Host cp6015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [14:25:44] RECOVERY - Host cp6016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [14:25:45] PROBLEM - DPKG on aqs2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:25:45] PROBLEM - AQS root url on aqs2003 is CRITICAL: connect to address 10.192.0.211 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:25:46] PROBLEM - AQS root url on aqs2010 is CRITICAL: connect to address 10.192.48.187 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:27:35] PROBLEM - IPMI Sensor Status on cp6006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:27:39] urbanecm: ref https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=prev&oldid=1988703&diffmode=source, do you think I could get them on the UTC late deployment? [14:28:05] TheresNoTime: assuming you addressed my -1 from few hours ago, why not? :) [14:28:19] PROBLEM - IPMI Sensor Status on lvs6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:28:19] yup, run them through svgo ^^ [14:28:23] PROBLEM - cassandra-a CQL 10.192.0.214:9042 on aqs2001 is CRITICAL: connect to address 10.192.0.214 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:28:23] PROBLEM - cassandra-a CQL 10.192.0.220:9042 on aqs2004 is CRITICAL: connect to address 10.192.0.220 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:28:23] PROBLEM - cassandra-a CQL 10.192.16.186:9042 on aqs2007 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:28:31] PROBLEM - IPMI Sensor Status on cp6010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:29:27] PROBLEM - IPMI Sensor Status on cp6014 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:30:09] PROBLEM - IPMI Sensor Status on cp6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:30:49] PROBLEM - IPMI Sensor Status on cp6005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:31:03] PROBLEM - cassandra-a CQL 10.192.16.183:9042 on aqs2006 is CRITICAL: connect to address 10.192.16.183 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:31:03] PROBLEM - cassandra-a SSL 10.192.0.220:7001 on aqs2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:31:03] PROBLEM - cassandra-a SSL 10.192.0.214:7001 on aqs2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:31:03] PROBLEM - cassandra-a SSL 10.192.16.186:7001 on aqs2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:32:33] btullis: o/ all downtimes expired? [14:32:38] TheresNoTime: great. see you in a few hours in that case :) [14:32:48] (for AQS I mean, the codfw nodes are not up yet right?) [14:33:31] PROBLEM - IPMI Sensor Status on cp6009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:33:31] PROBLEM - IPMI Sensor Status on ganeti6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:33:41] PROBLEM - cassandra-a CQL 10.192.0.218:9042 on aqs2003 is CRITICAL: connect to address 10.192.0.218 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:33:41] PROBLEM - cassandra-a CQL 10.192.16.174:9042 on aqs2005 is CRITICAL: connect to address 10.192.16.174 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:33:41] PROBLEM - cassandra-a SSL 10.192.16.183:7001 on aqs2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:33:43] PROBLEM - cassandra-a service on aqs2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:34:09] PROBLEM - IPMI Sensor Status on ganeti6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:34:22] (03PS1) 10Klausman: Add inference-staging service IP (10.2.1.58) [puppet] - 10https://gerrit.wikimedia.org/r/805134 [14:34:32] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:51] PROBLEM - IPMI Sensor Status on cp6013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:35:59] PROBLEM - IPMI Sensor Status on cp6011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:36:13] PROBLEM - IPMI Sensor Status on cp6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:36:19] PROBLEM - cassandra-b CQL 10.192.0.215:9042 on aqs2001 is CRITICAL: connect to address 10.192.0.215 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:36:19] PROBLEM - cassandra-b CQL 10.192.0.221:9042 on aqs2004 is CRITICAL: connect to address 10.192.0.221 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:36:19] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on aqs2007 is CRITICAL: connect to address 10.192.16.187 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:36:19] PROBLEM - cassandra-a SSL 10.192.16.174:7001 on aqs2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:36:19] PROBLEM - cassandra-a SSL 10.192.0.218:7001 on aqs2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:37:37] PROBLEM - IPMI Sensor Status on cp6012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:38:21] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:39] PROBLEM - IPMI Sensor Status on lvs6001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:38:45] (03PS2) 10Klausman: Add nference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134 [14:38:53] PROBLEM - cassandra-b CQL 10.192.16.185:9042 on aqs2006 is CRITICAL: connect to address 10.192.16.185 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:38:53] PROBLEM - cassandra-a CQL 10.192.48.194:9042 on aqs2010 is CRITICAL: connect to address 10.192.48.194 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:38:55] PROBLEM - cassandra-b SSL 10.192.0.215:7001 on aqs2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:38:55] PROBLEM - cassandra-b SSL 10.192.0.221:7001 on aqs2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:38:55] PROBLEM - cassandra-b SSL 10.192.16.187:7001 on aqs2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:39:04] (03PS5) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560 [14:39:17] (03CR) 10Jbond: "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond) [14:39:24] (03PS1) 10Klausman: Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) [14:39:49] 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) @Dzahn Yeah, sure. Let me close this now. Thanks. [14:40:02] 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) 05Open→03Resolved [14:40:08] (03PS3) 10Klausman: Add inference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134 [14:41:03] (03PS4) 10Klausman: Add inference-staging service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/805134 [14:41:06] (03PS2) 10Marostegui: wmnet: Failover m1 and m2 [dns] - 10https://gerrit.wikimedia.org/r/805114 (https://phabricator.wikimedia.org/T310484) [14:41:33] PROBLEM - cassandra-b CQL 10.192.0.219:9042 on aqs2003 is CRITICAL: connect to address 10.192.0.219 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:41:33] PROBLEM - cassandra-b CQL 10.192.16.179:9042 on aqs2005 is CRITICAL: connect to address 10.192.16.179 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:41:35] PROBLEM - cassandra-b SSL 10.192.16.185:7001 on aqs2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:41:35] PROBLEM - cassandra-a SSL 10.192.48.194:7001 on aqs2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:41:35] PROBLEM - cassandra-b service on aqs2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:42:03] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01119 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:42:34] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1 and m2 [dns] - 10https://gerrit.wikimedia.org/r/805114 (https://phabricator.wikimedia.org/T310484) (owner: 10Marostegui) [14:42:35] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:42:53] !log Failover m1 and m2 to a different proxy T310484 [14:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:56] T310484: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 [14:43:16] 10SRE, 10DBA, 10Patch-For-Review, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) m1 and m2 failed over. [14:43:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T310011)', diff saved to https://phabricator.wikimedia.org/P29668 and previous config saved to /var/cache/conftool/dbconfig/20220613-144337-marostegui.json [14:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:41] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [14:43:48] Emperor: I think the puppet failures alert is related to the aqs cassandra issues above, puppet is failing on them [14:43:57] Unable to locate package cassandra-tools-wmf [14:44:00] (03PS1) 10Marostegui: wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/805137 (https://phabricator.wikimedia.org/T300471) [14:44:02] (03PS3) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/805121 (https://phabricator.wikimedia.org/T300471) [14:44:09] PROBLEM - AQS root url on aqs2005 is CRITICAL: connect to address 10.192.16.42 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:44:09] PROBLEM - cassandra-b SSL 10.192.0.219:7001 on aqs2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:44:09] PROBLEM - cassandra-b SSL 10.192.16.179:7001 on aqs2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:44:14] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/805137 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui) [14:44:41] PROBLEM - IPMI Sensor Status on cp6016 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:44:41] PROBLEM - IPMI Sensor Status on cp6008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:44:59] PROBLEM - IPMI Sensor Status on cp6015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:45:27] (03CR) 10Jcrespo: [C: 03+1] wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/805137 (https://phabricator.wikimedia.org/T300471) (owner: 10Marostegui) [14:45:31] PROBLEM - IPMI Sensor Status on cp6003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:46:43] PROBLEM - AQS root url on aqs2007 is CRITICAL: connect to address 10.192.16.169 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:46:43] PROBLEM - cassandra-b CQL 10.192.48.195:9042 on aqs2010 is CRITICAL: connect to address 10.192.48.195 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:47:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:47:25] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:47:59] PROBLEM - IPMI Sensor Status on ganeti6004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:48:45] PROBLEM - IPMI Sensor Status on ganeti6002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:49:18] (03CR) 10Elukey: "Change looks good, for consistency I recall in the past that people asked to allocated eqiad as well even if not used." [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:49:21] PROBLEM - AQS root url on aqs2001 is CRITICAL: connect to address 10.192.0.111 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:49:23] PROBLEM - AQS root url on aqs2004 is CRITICAL: connect to address 10.192.0.212 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:49:23] PROBLEM - cassandra-b SSL 10.192.48.195:7001 on aqs2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:52:08] RECOVERY - Host ganeti6004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [14:52:12] RECOVERY - Host lvs6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.09 ms [14:52:17] (03CR) 10Elukey: [C: 03+1] "Looks good, the change should be a no-op but let's have also another pair of eyes to confirm!" [puppet] - 10https://gerrit.wikimedia.org/r/805134 (owner: 10Klausman) [14:52:37] (03PS2) 10Klausman: Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) [14:53:22] (03CR) 10Klausman: Add inference-staging service IP (10.2.1.58) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:53:44] (03PS3) 10Klausman: Add inference-staging service IP (10.2.1.58) [dns] - 10https://gerrit.wikimedia.org/r/805135 (https://phabricator.wikimedia.org/T302195) [14:54:15] RECOVERY - Host cp6007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.08 ms [14:54:16] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:32] RECOVERY - Host cp6008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.04 ms [14:54:32] RECOVERY - Host cp6011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.05 ms [14:55:26] PROBLEM - AQS root url on aqs2012 is CRITICAL: connect to address 10.192.48.189 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [14:56:32] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:58:00] PROBLEM - cassandra-a CQL 10.192.48.192:9042 on aqs2009 is CRITICAL: connect to address 10.192.48.192 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [14:58:38] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P29669 and previous config saved to /var/cache/conftool/dbconfig/20220613-145842-marostegui.json [14:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:08] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:00:37] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet [15:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:42] PROBLEM - cassandra-a SSL 10.192.48.192:7001 on aqs2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:01:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:35] (03CR) 10Filippo Giunchedi: [C: 03+1] only page for NEL after 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/804640 (owner: 10CDanis) [15:03:16] PROBLEM - AQS root url on aqs2011 is CRITICAL: connect to address 10.192.48.188 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring [15:03:42] RECOVERY - Host asw1-b12-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.45 ms [15:04:04] RECOVERY - IPMI Sensor Status on cp6009 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:04:05] RECOVERY - IPMI Sensor Status on ganeti6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:04:12] (03CR) 10CDanis: [C: 03+2] only page for NEL after 5 minutes [alerts] - 10https://gerrit.wikimedia.org/r/804640 (owner: 10CDanis) [15:04:15] RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:04:30] RECOVERY - Host ps1-b13-drmrs is UP: PING OK - Packet loss = 0%, RTA = 88.13 ms [15:04:42] RECOVERY - IPMI Sensor Status on ganeti6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:04:42] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet [15:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:50] RECOVERY - Host asw1-b13-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.41 ms [15:04:55] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:05:06] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 60, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:28] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:44] RECOVERY - Host cr1-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.42 ms [15:05:44] RECOVERY - Host cr2-drmrs.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.47 ms [15:06:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:06:28] RECOVERY - IPMI Sensor Status on cp6013 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:06:34] RECOVERY - IPMI Sensor Status on cp6011 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:06:50] RECOVERY - IPMI Sensor Status on cp6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:06:54] (03PS1) 10Muehlenhoff: acme_chief: Remove old buster IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/805140 (https://phabricator.wikimedia.org/T308214) [15:07:29] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10MoritzMuehlenhoff) [15:08:15] RECOVERY - IPMI Sensor Status on cp6012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:08:18] RECOVERY - Host scs-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.60 ms [15:08:20] RECOVERY - Host dns6002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 99.18 ms [15:08:20] RECOVERY - Host ps1-b12-drmrs is UP: PING OK - Packet loss = 0%, RTA = 87.81 ms [15:08:46] (JobUnavailable) firing: (2) Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [15:09:24] RECOVERY - IPMI Sensor Status on lvs6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:10:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you! LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [15:10:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [15:11:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [15:12:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [15:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:35] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:13:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P29670 and previous config saved to /var/cache/conftool/dbconfig/20220613-151347-marostegui.json [15:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:57] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) We may also need to install Buster on these, instead of Bullseye (see: https://phabricator.wikimedia.org/T307801#7999033). [15:14:11] (03PS1) 10David Caro: openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) [15:14:33] (03CR) 10David Caro: wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [15:15:42] RECOVERY - IPMI Sensor Status on cp6008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:15:42] RECOVERY - IPMI Sensor Status on cp6016 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:15:52] (03PS1) 10Muehlenhoff: Remove secteam-users group [puppet] - 10https://gerrit.wikimedia.org/r/805167 [15:16:02] RECOVERY - IPMI Sensor Status on cp6015 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:16:32] RECOVERY - IPMI Sensor Status on cp6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:17:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [15:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:58] (03PS1) 10Muehlenhoff: Remove LDAP access for johnben [puppet] - 10https://gerrit.wikimedia.org/r/805168 [15:19:02] RECOVERY - IPMI Sensor Status on ganeti6004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:19:50] RECOVERY - IPMI Sensor Status on ganeti6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:20:38] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for johnben [puppet] - 10https://gerrit.wikimedia.org/r/805168 (owner: 10Muehlenhoff) [15:21:28] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:21:32] PROBLEM - Host thumbor2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:22:41] RECOVERY - IPMI Sensor Status on lvs6003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:23:03] RECOVERY - IPMI Sensor Status on dns6001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:23:43] RECOVERY - IPMI Sensor Status on dns6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:24:35] PROBLEM - DNS on cp6016.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.136.128.30 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:25:05] RECOVERY - IPMI Sensor Status on cp6007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:25:19] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) [15:25:28] (03CR) 10Ahmon Dancy: [C: 03+1] multiversion: Simplify code and improve documentation (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785308 (owner: 10Krinkle) [15:26:16] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Eevans) @Cmjohnson It doesn't look like any of the OS installations succeeded (yet), is it too late to ask for Buster instead? [15:26:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:37] RECOVERY - Host thumbor2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.52 ms [15:27:59] RECOVERY - IPMI Sensor Status on cp6006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:28:11] RECOVERY - Host thumbor2004 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [15:28:34] (03CR) 10Muehlenhoff: [C: 03+2] noc: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803943 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:28:45] RECOVERY - IPMI Sensor Status on lvs6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:28:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T310011)', diff saved to https://phabricator.wikimedia.org/P29671 and previous config saved to /var/cache/conftool/dbconfig/20220613-152852-marostegui.json [15:28:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1148.eqiad.wmnet with reason: Maintenance [15:28:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1148.eqiad.wmnet with reason: Maintenance [15:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:58] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [15:28:59] RECOVERY - IPMI Sensor Status on cp6010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T310011)', diff saved to https://phabricator.wikimedia.org/P29672 and previous config saved to /var/cache/conftool/dbconfig/20220613-152900-marostegui.json [15:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:03] RECOVERY - IPMI Sensor Status on cp6002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:55] RECOVERY - IPMI Sensor Status on cp6014 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:29:58] (03PS2) 10David Caro: openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) [15:30:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T1530). [15:30:37] RECOVERY - IPMI Sensor Status on cp6004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:31:07] (03CR) 10David Caro: "Tested on codfw for validity, not 100% sure it will fix the issues. I might create another check that does an actual request every 5 min t" [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [15:31:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:31:20] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2001.codfw.wmnet with OS buster [15:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:23] RECOVERY - IPMI Sensor Status on cp6005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:34:48] (03CR) 10SBassett: [C: 03+1] "Yeah, as long as nothing else uses this, it should be removed." [puppet] - 10https://gerrit.wikimedia.org/r/805167 (owner: 10Muehlenhoff) [15:35:40] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805173 (https://phabricator.wikimedia.org/T128546) [15:36:21] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805173 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:37:52] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805173 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:39:29] PROBLEM - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:39:58] (03PS3) 10David Caro: openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) [15:40:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:40] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:44:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:44:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:31] RECOVERY - Host thumbor2004 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [15:47:32] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage [15:47:34] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:805173| Bumping portals to master (T128546)]] (duration: 03m 35s) [15:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:39] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:48:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:27] (03PS1) 10Muehlenhoff: Remove tendril leftover [puppet] - 10https://gerrit.wikimedia.org/r/805176 [15:50:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2001.codfw.wmnet with reason: host reimage [15:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:02] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:805173| Bumping portals to master (T128546)]] (duration: 03m 27s) [15:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:56:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:58:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:58:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:08] (03PS1) 10Clare Ming: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) [16:00:30] (03PS7) 10BCornwall: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) [16:00:55] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Dzahn) unfortunately this is purchase date 2016-12-12 .. so ...probably can't get it fixed [16:02:09] 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10RobH) all drmrs hosts have gone green in icinga on ipmi checks and mgmt dns (both went red from power removal) [16:02:24] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [16:02:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:02] (03PS1) 10RobH: Revert "smokeping: Temp mute smokeping for host lvs6001" [puppet] - 10https://gerrit.wikimedia.org/r/805156 [16:03:13] (03PS2) 10RobH: Revert "smokeping: Temp mute smokeping for host lvs6001" [puppet] - 10https://gerrit.wikimedia.org/r/805156 [16:03:34] (03CR) 10BCornwall: [C: 03+2] Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [16:03:57] (03CR) 10RobH: [C: 03+2] Revert "smokeping: Temp mute smokeping for host lvs6001" [puppet] - 10https://gerrit.wikimedia.org/r/805156 (owner: 10RobH) [16:04:09] (03PS1) 10Zabe: coal: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805180 (https://phabricator.wikimedia.org/T308013) [16:04:12] (03PS1) 10Zabe: cmd_checklist_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805181 (https://phabricator.wikimedia.org/T308013) [16:04:15] RECOVERY - DNS on cp6016.mgmt is OK: DNS OK: 0.017 seconds response time. cp6016.mgmt.drmrs.wmnet returns 10.136.128.30 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:04:18] (03PS1) 10Zabe: cloudnfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805182 (https://phabricator.wikimedia.org/T308013) [16:04:20] (03PS1) 10Zabe: cloudlib: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805183 (https://phabricator.wikimedia.org/T308013) [16:04:22] (03PS2) 10David Caro: nova: add user to libvirt-qemu [puppet] - 10https://gerrit.wikimedia.org/r/801336 (https://phabricator.wikimedia.org/T309342) [16:04:24] (03PS1) 10Zabe: cinderutils: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805184 (https://phabricator.wikimedia.org/T308013) [16:04:26] (03PS1) 10Zabe: cfssl: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805185 (https://phabricator.wikimedia.org/T308013) [16:04:28] (03PS1) 10Zabe: cergen: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805186 (https://phabricator.wikimedia.org/T308013) [16:04:30] (03PS1) 10Zabe: celery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805187 (https://phabricator.wikimedia.org/T308013) [16:04:32] (03PS1) 10Zabe: cacheproxy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805188 (https://phabricator.wikimedia.org/T308013) [16:04:34] (03PS1) 10Zabe: burrow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805189 (https://phabricator.wikimedia.org/T308013) [16:04:36] (03PS1) 10Zabe: bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) [16:04:38] (03PS1) 10Zabe: bigtop: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805191 (https://phabricator.wikimedia.org/T308013) [16:04:40] (03PS1) 10Zabe: backy2: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805192 (https://phabricator.wikimedia.org/T308013) [16:04:42] (03PS1) 10Zabe: atskafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805193 (https://phabricator.wikimedia.org/T308013) [16:04:44] (03PS1) 10Zabe: aqs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805194 (https://phabricator.wikimedia.org/T308013) [16:04:46] (03PS1) 10Zabe: apereo_cas: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805195 (https://phabricator.wikimedia.org/T308013) [16:04:48] (03PS1) 10Zabe: alternatives: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805196 (https://phabricator.wikimedia.org/T308013) [16:04:50] (03PS1) 10Zabe: airflow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805197 (https://phabricator.wikimedia.org/T308013) [16:05:05] (03CR) 10David Caro: [C: 03+2] openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [16:05:18] (03CR) 10David Caro: [C: 03+2] openstack,nova-api-metadata: add harakiri timeout [puppet] - 10https://gerrit.wikimedia.org/r/805166 (https://phabricator.wikimedia.org/T309930) (owner: 10David Caro) [16:06:17] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:06:19] (03Merged) 10jenkins-bot: Traffic: Add PyBal BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [16:06:43] (03PS5) 10BCornwall: Traffic: Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) [16:07:33] (03PS2) 10Zabe: cmd_checklist_runner: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805181 (https://phabricator.wikimedia.org/T308013) [16:10:16] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [16:10:34] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [16:11:04] (03PS2) 10Zabe: burrow: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805189 (https://phabricator.wikimedia.org/T308013) [16:11:43] (03CR) 10David Caro: [C: 03+2] nova: add user to libvirt-qemu [puppet] - 10https://gerrit.wikimedia.org/r/801336 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro) [16:12:18] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [16:12:25] PROBLEM - IPMI Sensor Status on thumbor2004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:14:37] RECOVERY - DPKG on aqs2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:15:51] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2001.codfw.wmnet with OS buster [16:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:41] (03CR) 10CI reject: [V: 04-1] bsection: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/805190 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:20:45] RECOVERY - cassandra-a SSL 10.192.0.214:7001 on aqs2001 is OK: SSL OK - Certificate aqs2001-a valid until 2024-06-07 14:43:29 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:20:45] RECOVERY - cassandra-a service on aqs2001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:20:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:21:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:02] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Dzahn) A1: serviceops: gitlab2002 is still in state "in setup". While we were going to change that we will hold back until this is done. [16:25:05] (03PS2) 10Clare Ming: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) [16:26:19] (03CR) 10Marostegui: [C: 03+1] Remove tendril leftover [puppet] - 10https://gerrit.wikimedia.org/r/805176 (owner: 10Muehlenhoff) [16:28:26] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host aqs2002.codfw.wmnet with OS buster [16:28:28] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/805180 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:57] RECOVERY - cassandra-a CQL 10.192.0.214:9042 on aqs2001 is OK: TCP OK - 0.032 second response time on 10.192.0.214 port 9042 https://phabricator.wikimedia.org/T93886 [16:29:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T310011)', diff saved to https://phabricator.wikimedia.org/P29673 and previous config saved to /var/cache/conftool/dbconfig/20220613-162914-marostegui.json [16:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:20] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [16:31:59] !log Reboot all codfw parsercache hosts T310485 [16:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:09] !log dancy@deploy1002 prep aborted: (duration: 00m 26s) [16:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:15] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti3001.esams.wmnet with OS bullseye [16:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:20] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti3001.esams.wmnet with OS bullseye [16:32:32] !log dbmaint x2@eqiad upgrade and reboot all x2 db hosts T310485 [16:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:41] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10hnowlan) Thanks for the heads-up @Dzahn. We don't have replacement hardware budgeted because of the planned move to k8s, but we're looking into stopgap options [16:34:41] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10WDoranWMF) Thanks for this @Dzahn. #platform_engineering are reaching out to #dc-ops to see what our options are. [16:35:00] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10WDoranWMF) p:05Medium→03High [16:35:04] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) @Dzahn we are not doing rack A1 until maybe the end os the year because we don't have the PDU's yet for that rack same for A8 [16:36:17] RECOVERY - cassandra-b CQL 10.192.0.215:9042 on aqs2001 is OK: TCP OK - 0.032 second response time on 10.192.0.215 port 9042 https://phabricator.wikimedia.org/T93886 [16:37:08] (03CR) 10JMeybohm: [C: 03+1] "I don't agree on this being a no-op as "inference-staging: [codfw]" is new. But as that's what the commit message said., +1 from me :)" [puppet] - 10https://gerrit.wikimedia.org/r/805134 (owner: 10Klausman) [16:37:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS buster [16:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster [16:38:10] (03CR) 10Ssingh: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [16:38:26] !log dancy@deploy1002 prep aborted: (duration: 06m 12s) [16:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:41] RECOVERY - cassandra-b SSL 10.192.0.215:7001 on aqs2001 is OK: SSL OK - Certificate aqs2001-b valid until 2024-06-07 14:43:32 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:39:47] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset and Tunilo for Caroline Myrick - https://phabricator.wikimedia.org/T310524 (10CMyrick-WMF) [16:40:25] !log dancy@deploy1002 prep aborted: (duration: 01m 40s) [16:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:40] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset and Tunilo for Caroline Myrick - https://phabricator.wikimedia.org/T310524 (10CMyrick-WMF) [16:41:05] RECOVERY - cassandra-b service on aqs2001 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:42:15] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0102 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [16:43:11] RECOVERY - IPMI Sensor Status on thumbor2004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:43:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:44:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P29674 and previous config saved to /var/cache/conftool/dbconfig/20220613-164419-marostegui.json [16:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage [16:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:11] (03PS1) 10David Caro: openstack: set the nova user groups on virts only [puppet] - 10https://gerrit.wikimedia.org/r/805200 (https://phabricator.wikimedia.org/T309342) [16:47:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs2002.codfw.wmnet with reason: host reimage [16:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1144.eqiad.wmnet with reason: host reimage [16:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster [16:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster [16:49:53] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:50:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:26] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35836/console" [puppet] - 10https://gerrit.wikimedia.org/r/805200 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro) [16:50:50] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti3001.esams.wmnet with reason: host reimage [16:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:13] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10brion) [16:53:09] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1145.eqiad.wmnet with OS buster [16:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster [16:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster exec... [16:54:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1144.eqiad.wmnet with reason: host reimage [16:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [16:54:28] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: set the nova user groups on virts only [puppet] - 10https://gerrit.wikimedia.org/r/805200 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro) [16:54:31] (03CR) 10David Caro: [V: 03+2 C: 03+2] openstack: set the nova user groups on virts only [puppet] - 10https://gerrit.wikimedia.org/r/805200 (https://phabricator.wikimedia.org/T309342) (owner: 10David Caro) [16:55:04] 10ops-drmrs: drmrs 1/2 power feed down due to maintenance - https://phabricator.wikimedia.org/T310470 (10RobH) all green and maint window end announce sent by drmrs [16:55:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:41] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:55:42] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti3001.esams.wmnet with reason: host reimage [16:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:13] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1146.eqiad.wmnet with OS buster [16:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1147.eqiad.wmnet with OS buster [16:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:49] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet [16:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:55] RECOVERY - Confd template for /var/lib/gdnsd/discovery-netbox.state on dns3002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:59:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P29675 and previous config saved to /var/cache/conftool/dbconfig/20220613-165925-marostegui.json [16:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster exec... [16:59:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster [17:00:05] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T1700). [17:02:55] (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: relabel alerts from wmcs cluster with wmcs team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [17:03:05] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet [17:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1148.eqiad.wmnet with OS buster [17:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster [17:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:05:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1144.eqiad.wmnet with OS buster [17:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1144.eqiad.wmnet with OS buster comp... [17:07:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:09:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1147.eqiad.wmnet with reason: host reimage [17:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:49] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti3001.esams.wmnet with OS bullseye [17:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:52] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti3001.esams.wmnet with OS bullseye completed: - ganeti3001 (**PASS**) - Downtimed on Icinga/Ale... [17:12:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:13:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1147.eqiad.wmnet with reason: host reimage [17:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:30] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10RobH) a:05RobH→03MoritzMuehlenhoff ganeti3001 firmware updates bios 2.2.11 to 2.14.2 nic 21.40.22.20 to 21.85.21.92 idrac 3.34.34.34 to 5.10.10.00 Moritz, ganeti3001 firmware updated an... [17:14:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T310011)', diff saved to https://phabricator.wikimedia.org/P29676 and previous config saved to /var/cache/conftool/dbconfig/20220613-171430-marostegui.json [17:14:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1142.eqiad.wmnet with reason: Maintenance [17:14:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1142.eqiad.wmnet with reason: Maintenance [17:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:35] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [17:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T310011)', diff saved to https://phabricator.wikimedia.org/P29677 and previous config saved to /var/cache/conftool/dbconfig/20220613-171438-marostegui.json [17:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:10] (03PS1) 10Clare Ming: Disable TOC A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805206 (https://phabricator.wikimedia.org/T309683) [17:16:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1148.eqiad.wmnet with reason: host reimage [17:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:37] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35837/console" [puppet] - 10https://gerrit.wikimedia.org/r/802074 (owner: 10David Caro) [17:18:49] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10Papaul) 05Open→03Resolved Nothing in the IDRAC log showing any HW issues. I did some firmware upgrade Bios from 2.3.4 to 2.13 IDRAC from 2.63.60.61 to 2.83.83 maybe with the new firmware we can see somet... [17:18:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS buster [17:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster [17:19:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1148.eqiad.wmnet with reason: host reimage [17:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:17] RECOVERY - puppet last run on thumbor2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:21:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:22:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS buster [17:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster [17:24:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1147.eqiad.wmnet with OS buster [17:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1147.eqiad.wmnet with OS buster comp... [17:26:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:26:55] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thumbor2004.codfw.wmnet [17:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs2002.codfw.wmnet with OS buster [17:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS buster [17:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster [17:30:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1143.eqiad.wmnet with reason: host reimage [17:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1148.eqiad.wmnet with OS buster [17:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1148.eqiad.wmnet with OS buster comp... [17:33:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1143.eqiad.wmnet with reason: host reimage [17:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:24] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1145.eqiad.wmnet with reason: host reimage [17:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:03] (03CR) 10Jdlrobson: [C: 03+1] Disable TOC A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805206 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming) [17:37:33] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1145.eqiad.wmnet with reason: host reimage [17:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1146.eqiad.wmnet with reason: host reimage [17:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1146.eqiad.wmnet with reason: host reimage [17:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1143.eqiad.wmnet with OS buster [17:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1143.eqiad.wmnet with OS buster comp... [17:49:50] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1145.eqiad.wmnet with OS buster [17:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1145.eqiad.wmnet with OS buster comp... [17:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T310011)', diff saved to https://phabricator.wikimedia.org/P29678 and previous config saved to /var/cache/conftool/dbconfig/20220613-175500-marostegui.json [17:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:09] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [17:55:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1146.eqiad.wmnet with OS buster [17:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1146.eqiad.wmnet with OS buster comp... [18:02:36] (03CR) 10BCornwall: [C: 03+2] Traffic: Add alert for Varnish child restart [alerts] - 10https://gerrit.wikimedia.org/r/804450 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [18:05:55] (03PS1) 10Muehlenhoff: Add Brion to contributors [puppet] - 10https://gerrit.wikimedia.org/r/805214 (https://phabricator.wikimedia.org/T308013) [18:07:30] (03CR) 10Muehlenhoff: [C: 03+2] Add Brion to contributors [puppet] - 10https://gerrit.wikimedia.org/r/805214 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:10:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P29679 and previous config saved to /var/cache/conftool/dbconfig/20220613-181005-marostegui.json [18:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:42] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1049.eqiad.wmnet [18:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P29680 and previous config saved to /var/cache/conftool/dbconfig/20220613-182510-marostegui.json [18:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:03] 10SRE, 10ops-codfw, 10Thumbor: thumbor2004 is down - https://phabricator.wikimedia.org/T310455 (10WDoranWMF) thank you @Papaul. @hnowlan would should review the other machine's state. [18:27:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10Cmjohnson) 05Open→03Resolved Finally resolved this, had some issues with network ports not being correct [18:40:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T310011)', diff saved to https://phabricator.wikimedia.org/P29681 and previous config saved to /var/cache/conftool/dbconfig/20220613-184015-marostegui.json [18:40:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:40:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:21] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [18:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:51] PROBLEM - Host mc1049 is DOWN: PING CRITICAL - Packet loss = 100% [18:46:35] ^ reboot in progress, weird that it didn't get downtimed? cc arnoldokoth [18:48:05] Yeah, looks like it hasn't come back yet. [18:48:43] oh, so the downtime just expired before it came back up [18:48:56] that's a little weird in itself, it shouldn't take that long [18:49:26] maybe it's running an fsck? [18:49:48] checked console yet, arnoldokoth ? [18:50:02] Checking. [18:51:11] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:30] 10SRE, 10Thumbor, 10Traffic: Thumbor URLs are too permissive - https://phabricator.wikimedia.org/T310528 (10TheDJ) This shouldn't be a problem as long as MediaWiki only generates url fragments that are lowercase (which is what it should be doing). In general, thumbor is a tad more permissive than MediaWiki (... [18:55:20] !log gitlab2002 - rebooting [18:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:01] RECOVERY - Host mc1049 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [19:00:55] \o/ [19:01:10] arnoldokoth: was that you, or did it just come back on its own? [19:01:59] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1049.eqiad.wmnet [19:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1141.eqiad.wmnet with reason: Maintenance [19:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1141.eqiad.wmnet with reason: Maintenance [19:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T310011)', diff saved to https://phabricator.wikimedia.org/P29682 and previous config saved to /var/cache/conftool/dbconfig/20220613-190314-marostegui.json [19:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:19] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [19:04:09] !log gitlab2003 - rebooting [19:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:48] rzl: i power cycled it from the mgmt console. no idea if that's what fixed it. [19:05:00] cool, sounds like it to me [19:05:22] fwiw, this has happened to me in the past.. every once in a while [19:05:59] like "cookbook asks for reboot but it does not come back and then it seems alright if you powercycle" [19:06:11] afair it was happening with 1 out of 20 mw appservers [19:07:07] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:07:58] !log gerrit2002 - rebooting [19:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:01] (JobUnavailable) firing: Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:09:19] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10KFrancis) @CDanis There is not an NDA on file. Please provide me with Ricardo Baeza-Yates postal address and I will put the agreement together. Please sen... [19:10:35] (03PS1) 10Marcelo1251: Point Wikimedia Enterprise HTML Dumps to trial API features [puppet] - 10https://gerrit.wikimedia.org/r/805223 (https://phabricator.wikimedia.org/T310075) [19:11:32] !log etherpad - minimal downtime - rebooting etherpad1003 [19:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:06] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade [19:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:09] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad1003.eqiad.wmnet with reason: kernel upgrade [19:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:27] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:27] if there is anyone around able to reimage hosts, who is also bored and looking for something to do, I can help :) [19:16:44] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10leila) @KFrancis I'm not sure if this will make a difference in your recommendation, however, please be aware that Ricardo has signed a contract with WMF an... [19:28:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T310011)', diff saved to https://phabricator.wikimedia.org/P29683 and previous config saved to /var/cache/conftool/dbconfig/20220613-192851-marostegui.json [19:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:57] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [19:29:17] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10KFrancis) Hi all, the agreement is out for signatures. Thanks! [19:38:16] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10KFrancis) @leila Thank you! I didn't see Ricardo's name on the contractor list at first, but I checked again and it's there. Thank you for bringing this t... [19:43:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29684 and previous config saved to /var/cache/conftool/dbconfig/20220613-194356-marostegui.json [19:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P29685 and previous config saved to /var/cache/conftool/dbconfig/20220613-195902-marostegui.json [19:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RoanKattouw, Urbanecm, and cjming: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T2000). [20:00:04] TheresNoTime and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] * TheresNoTime is here [20:00:16] * urbanecm waves [20:00:19] o/ [20:00:23] i can deploy [20:00:27] go ahead :) [20:00:51] urbanecm: do those logo patches lgtu? [20:01:11] looks i forgot to +1 [20:01:12] yes :) [20:01:22] (03CR) 10Urbanecm: [C: 03+1] crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [20:01:26] (03CR) 10Urbanecm: [C: 03+1] ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [20:01:45] (03CR) 10Clare Ming: [C: 03+2] crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [20:02:41] (03Merged) 10jenkins-bot: crhwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800856 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [20:03:56] TheresNoTime: can you check mwdebug1002 for your 1st patch? [20:04:01] looking [20:04:41] cjming: lgtm :) [20:04:49] great - syncing [20:06:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:15] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:08:21] !log cjming@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-wordmark-crh.svg: Config: [[gerrit:800856|crhwiki: Add localized mobile wordmark (T309431)]] (duration: 03m 16s) [20:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:24] T309431: Change Wikimedia Wordmark for crhwiki and ugwiki - https://phabricator.wikimedia.org/T309431 [20:08:47] (03PS5) 10Clare Ming: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [20:09:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:09:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:29] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:11:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:57] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800856|crhwiki: Add localized mobile wordmark (T309431)]] (duration: 03m 27s) [20:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:09] (03CR) 10Clare Ming: [C: 03+2] ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [20:12:23] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet [20:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:49] (03Merged) 10jenkins-bot: ugwiki: Add localized mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/800857 (https://phabricator.wikimedia.org/T309431) (owner: 10Samtar) [20:14:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T310011)', diff saved to https://phabricator.wikimedia.org/P29686 and previous config saved to /var/cache/conftool/dbconfig/20220613-201407-marostegui.json [20:14:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1121.eqiad.wmnet with reason: Maintenance [20:14:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1121.eqiad.wmnet with reason: Maintenance [20:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:12] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [20:14:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [20:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:20] TheresNoTime: your 1st patch should be live -- 2nd patch is up on mwdebug1002 - can you test? [20:14:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T310011)', diff saved to https://phabricator.wikimedia.org/P29687 and previous config saved to /var/cache/conftool/dbconfig/20220613-201420-marostegui.json [20:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:27] looking [20:15:00] cjming: yup, lgtm as well :) [20:15:11] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:15:12] cool - syncing [20:15:19] 10ops-eqiad, 10DC-Ops: Move network on cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10nskaggs) p:05Triage→03Low [20:15:21] 10ops-eqiad, 10DC-Ops: Move network connections on cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs) p:05Triage→03Low [20:15:57] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move network connections on cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs) [20:16:30] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Move network connections on cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs) [20:16:34] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Move network on cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10nskaggs) [20:16:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:28] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Move network on cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10nskaggs) [20:17:34] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Move network connections on cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs) [20:17:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:17:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:25] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Recable cloudcephosd1015 from cloudsw1-d5-eqiad to cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T310547 (10nskaggs) [20:18:32] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Recable cloudcephosd1021 from cloudsw1-c8-eqiad to cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T310546 (10nskaggs) [20:18:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:57] !log cjming@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-wordmark-ug.svg: Config: [[gerrit:800857|ugwiki: Add localized mobile wordmark (T309431)]] (duration: 03m 36s) [20:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:02] T309431: Change Wikimedia Wordmark for crhwiki and ugwiki - https://phabricator.wikimedia.org/T309431 [20:19:05] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet [20:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:29] 10SRE, 10ops-eqiad, 10Cloud-Services, 10DC-Ops, and 2 others: move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5 - https://phabricator.wikimedia.org/T304096 (10nskaggs) [20:19:46] thank you for the deploy cjming :-) [20:19:57] you're welcome! [20:20:08] 2nd patch should be live here shortly [20:20:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10nskaggs) Filed T310546 and T310547 to free ports and allow cloudnet1005 and cloudnet1006 connections to cloudsw1*. [20:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:22:34] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:800857|ugwiki: Add localized mobile wordmark (T309431)]] (duration: 03m 30s) [20:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:42] (03CR) 10Clare Ming: [C: 03+2] Disable TOC A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805206 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming) [20:23:33] (03Merged) 10jenkins-bot: Disable TOC A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805206 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming) [20:27:35] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:805206|Disable TOC A/B test for beta cluster (T309683)]] (duration: 03m 29s) [20:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:40] T309683: Turn off table of contents A/B test - https://phabricator.wikimedia.org/T309683 [20:28:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:03] !log end of UTC late backport window [20:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:29:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:37] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [20:55:41] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [21:00:04] Reedy, sbassett, Maryum, and manfredi: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220613T2100). [21:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:06:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T310011)', diff saved to https://phabricator.wikimedia.org/P29688 and previous config saved to /var/cache/conftool/dbconfig/20220613-210603-marostegui.json [21:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:10] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [21:21:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29689 and previous config saved to /var/cache/conftool/dbconfig/20220613-212108-marostegui.json [21:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:39] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Dzahn) ah ACK, ok, in that case we will just move forward as planned. Thanks Papaul [21:24:25] (03PS3) 10Ryan Kemper: Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking) [21:25:05] (03PS3) 10Ryan Kemper: elastic: remove decommissioned hosts in beta [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [21:25:11] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [21:35:09] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS13030/IPv6: Active - Init7, AS13030/IPv4: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:36:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P29690 and previous config saved to /var/cache/conftool/dbconfig/20220613-213613-marostegui.json [21:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:15] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:40:37] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:44:37] !log gitlab-runner1001 - pause from accepting jobs - rebooting [21:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:15] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 159, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:48:40] !log gitlab-runner* - sequentially pausing, rebooting, resuming one by one [21:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:52] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10XCollazo-WMF) [21:49:53] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:51:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T310011)', diff saved to https://phabricator.wikimedia.org/P29691 and previous config saved to /var/cache/conftool/dbconfig/20220613-215118-marostegui.json [21:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:23] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [21:51:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [21:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2110.codfw.wmnet with reason: Maintenance [21:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 12 hosts with reason: Maintenance [21:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 12 hosts with reason: Maintenance [21:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:41] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:53:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:53:49] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Active - Init7, AS13030/IPv4: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:54:31] PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:54:34] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10Aklapper) 05Resolved→03Open Not yet fully done per https://wikitech.wikimedia.org/w/index.php?title=SRE%2FLDAP&type=revision&diff=1929377&oldid=1924287 [21:54:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:55:32] ACKNOWLEDGEMENT - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn maintenance [21:56:06] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-runner[1001-1004].eqiad.wmnet with reason: maintenance reboot [21:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:12] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab-runner[1001-1004].eqiad.wmnet with reason: maintenance reboot [21:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:37] RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [22:00:43] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 23, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:01:51] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 25, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:10:28] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-runner[2001-2004].codfw.wmnet with reason: maintenance reboot [22:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:35] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab-runner[2001-2004].codfw.wmnet with reason: maintenance reboot [22:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:38] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10WDoranWMF) As Xabriel's manager, I approve. [22:15:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [22:15:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [22:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29692 and previous config saved to /var/cache/conftool/dbconfig/20220613-221522-marostegui.json [22:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:27] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [22:17:38] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:18:17] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:22:38] 10SRE, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Request to create new mailing lists for Chinese Wikipedia Administrators - https://phabricator.wikimedia.org/T310465 (10KirkLU) @SLyngshede-WMF Thank you for doing all these for us. [22:25:58] (03PS1) 10BCornwall: Traffic: add varnishkafka delivery error alarms [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) [22:30:15] 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10BCornwall) @ssingh @KOfori Is there a need/desire to have these three instances around? If so, is there any objection to following the above and termin... [22:30:32] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Dzahn) [22:31:17] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Dzahn) [22:32:13] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics for xcollazo - https://phabricator.wikimedia.org/T310555 (10Dzahn) Confirming @XCollazo-WMF exists and was introduced in SRE meeting today :) welcome to WMF. Confirmed signature and checked all other boxes. Just one is open for clinic duty. [22:33:57] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10Dzahn) done! added @XCollazo-WMF to https://phabricator.wikimedia.org/tag/wmf-nda/ [22:36:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48250 bytes in 5.830 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:37:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.563 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:37:52] 10SRE, 10Traffic: pontoon.traffic.eqiad1.wikimedia.cloud unable to run puppet agent due to certificate mismatch - https://phabricator.wikimedia.org/T310303 (10Dzahn) I also saw certificate errors pop up in a different project that uses a local puppetmaster. And we felt like we had not touched anything. Did not... [22:38:56] (03PS1) 10BCornwall: Traffic: Reorganize into more, smaller files [alerts] - 10https://gerrit.wikimedia.org/r/805241 [22:40:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29693 and previous config saved to /var/cache/conftool/dbconfig/20220613-224014-marostegui.json [22:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:20] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [22:49:19] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 58090 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [22:51:21] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:55:10] 10SRE, 10Traffic: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10Huji) Likely. But the point about an error message shown which appears to only exist in unit test code is also worth investigating. [22:55:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P29694 and previous config saved to /var/cache/conftool/dbconfig/20220613-225519-marostegui.json [22:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:21] 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) [23:03:33] 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10RLazarus) p:05Triage→03Medium [23:09:01] (JobUnavailable) firing: Reduced availability for job cassandra in analytics@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:10:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P29695 and previous config saved to /var/cache/conftool/dbconfig/20220613-231024-marostegui.json [23:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:37] PROBLEM - Check systemd state on gitlab-runner2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens14.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:02] ^ that would be me because I rebooted that.. not expecting it though [23:12:09] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:12:55] PROBLEM - BGP status on cr3-knams is CRITICAL: BGP CRITICAL - AS13030/IPv4: Idle - Init7, AS13030/IPv6: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:14:13] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:16:15] RECOVERY - Check systemd state on gitlab-runner2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:16:30] !log gitlab-runner2001 - systemctl reset-failed to clear alert about failed ifup for ens14 which is actually up. race condiation caused by reboot [23:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:21] (03PS1) 10Bartosz Dziewoński: Make new topic tool available as opt-out almost everywhere (phase 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805245 (https://phabricator.wikimedia.org/T310392) [23:19:03] 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) Did we determine whether the most recent spike was legitimate user traffic or malicious/DoS? The Abstract Wikipedia team has a proposal somewhere for rendering some fragments async, we could... [23:21:19] (03PS11) 10Tim Starling: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) [23:21:31] (03PS10) 10Tim Starling: Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 [23:21:59] 10SRE, 10Shellbox, 10serviceops: Shellbox resource management - https://phabricator.wikimedia.org/T310557 (10Legoktm) Also, one of the Wikisources has some Lua magic that renders each score like 4 times because they're PNGs. I think if we switched to/enabled SVG rendering (T49578) we could cut that down to j... [23:22:06] (03PS2) 10Bartosz Dziewoński: Disable DiscussionTools' visualenhancements feature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 (owner: 10Esanders) [23:22:35] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805245 (https://phabricator.wikimedia.org/T310392) (owner: 10Bartosz Dziewoński) [23:22:38] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220614T1300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804395 (owner: 10Esanders) [23:23:49] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:24:27] (03CR) 10Tim Starling: [C: 03+2] Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [23:24:31] RECOVERY - BGP status on cr3-knams is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:25:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T310011)', diff saved to https://phabricator.wikimedia.org/P29696 and previous config saved to /var/cache/conftool/dbconfig/20220613-232529-marostegui.json [23:25:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1147.eqiad.wmnet with reason: Maintenance [23:25:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1147.eqiad.wmnet with reason: Maintenance [23:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:36] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [23:25:38] (03Merged) 10jenkins-bot: Add the master from the primary DC to the secondary DC load arrays [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799685 (https://phabricator.wikimedia.org/T134809) (owner: 10Tim Starling) [23:25:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T310011)', diff saved to https://phabricator.wikimedia.org/P29697 and previous config saved to /var/cache/conftool/dbconfig/20220613-232537-marostegui.json [23:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:35] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/803902 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [23:29:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:30:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:54] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: T134809 g 799685 codfw master DBs (duration: 03m 30s) [23:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:58] T134809: App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 [23:31:04] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/804593 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis) [23:31:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:11] !log tstarling@deploy1002 Synchronized wmf-config/etcd.php: T134809 g 799685 codfw master DBs (duration: 03m 36s) [23:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:32] (03CR) 10Tim Starling: [C: 03+2] Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling) [23:39:03] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10KFrancis) -Confirming the NDA has been signed. Please proceed with the access request. Thanks! [23:39:20] (03Merged) 10jenkins-bot: Clean up scap sequencing workaround for I0cd5dbeab0e6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801836 (owner: 10Tim Starling) [23:45:26] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: T134809 g 801836 remove variable wmgDbconfigFromEtcd (duration: 03m 26s) [23:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:32] T134809: App servers <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 [23:46:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [23:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [23:47:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [23:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:08] (03PS2) 10Legoktm: mediawiki: Disable useless mostlinkedcategories update job [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456) [23:49:10] (03PS2) 10Legoktm: mediawiki: Remove absented mostlinkedcategories job [puppet] - 10https://gerrit.wikimedia.org/r/804804 [23:50:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [23:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T310011)', diff saved to https://phabricator.wikimedia.org/P29698 and previous config saved to /var/cache/conftool/dbconfig/20220613-235053-marostegui.json [23:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:56] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [23:52:48] (03CR) 10Legoktm: [C: 03+2] mediawiki: Disable useless mostlinkedcategories update job [puppet] - 10https://gerrit.wikimedia.org/r/804803 (https://phabricator.wikimedia.org/T310456) (owner: 10Legoktm) [23:56:36] (03CR) 10Legoktm: [C: 03+2] mediawiki: Remove absented mostlinkedcategories job [puppet] - 10https://gerrit.wikimedia.org/r/804804 (owner: 10Legoktm) [23:57:38] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook