[00:06:45] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:13:13] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:20:38] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:22:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [00:23:05] (03PS1) 10Andrew Bogott: Another attempt to get clouddumps hosts past partman [puppet] - 10https://gerrit.wikimedia.org/r/804741 [00:25:24] (03CR) 10Andrew Bogott: [C: 03+2] Another attempt to get clouddumps hosts past partman [puppet] - 10https://gerrit.wikimedia.org/r/804741 (owner: 10Andrew Bogott) [00:27:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [00:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [00:34:59] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:43:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [00:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:11] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:07:53] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:16:58] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye [01:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [01:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:22:41] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye [01:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:43] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [01:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:55] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:32:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:36:07] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:43:09] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:43:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [01:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [01:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:56:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [01:59:03] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [01:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:19] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:02:13] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:29:18] (03PS1) 10Andrew Bogott: Possible fixes to hwraid-seconddev.cfg (don't use DOS partitions) [puppet] - 10https://gerrit.wikimedia.org/r/804744 [02:32:43] (03PS2) 10Andrew Bogott: Possible fixes to hwraid-seconddev.cfg (don't use DOS partitions) [puppet] - 10https://gerrit.wikimedia.org/r/804744 [02:33:06] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:34:48] (03CR) 10Andrew Bogott: [C: 03+2] Possible fixes to hwraid-seconddev.cfg (don't use DOS partitions) [puppet] - 10https://gerrit.wikimedia.org/r/804744 (owner: 10Andrew Bogott) [02:37:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [02:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:20] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:44:09] (03PS1) 10Andrew Bogott: clouddumps partman: qualify path to hwraid-seconddev [puppet] - 10https://gerrit.wikimedia.org/r/804746 [02:47:55] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps partman: qualify path to hwraid-seconddev [puppet] - 10https://gerrit.wikimedia.org/r/804746 (owner: 10Andrew Bogott) [02:48:54] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [02:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:50] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:58:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [02:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:56] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1002.wikimedia.org with OS bullseye [02:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:13] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [03:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:19] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1002.wikimedia.org with OS bullseye [03:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:30] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:02:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [03:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:32] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [03:13:32] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [03:19:34] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected v [03:19:34] path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [03:20:39] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1002.wikimedia.org with OS bullseye [03:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [03:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:54] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1002.wikimedia.org with OS bullseye [03:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:50] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:35:00] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [03:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:28] PROBLEM - MariaDB Replica Lag: s1 #page on db1099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 37367.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:01:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:03:44] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:06:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:14:27] db1099 seems to have expired downtime. it's depooled already [04:15:02] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:15:08] re-setting downtime until Monday am [04:20:38] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:20:54] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:29:02] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1002.wikimedia.org with OS bullseye [04:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:42] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:05:38] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:11:18] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:23:06] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:36:56] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298560)', diff saved to https://phabricator.wikimedia.org/P29625 and previous config saved to /var/cache/conftool/dbconfig/20220612-060125-ladsgroup.json [06:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:30] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [06:01:41] RECOVERY - MariaDB Replica Lag: s1 #page on db1099 is OK: OK slave_sql_lag Replication lag: 0.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:06:22] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:09:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:11:30] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:16:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P29626 and previous config saved to /var/cache/conftool/dbconfig/20220612-061630-ladsgroup.json [06:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:23:14] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:24:40] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:31:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P29627 and previous config saved to /var/cache/conftool/dbconfig/20220612-063135-ladsgroup.json [06:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:26] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:46:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298560)', diff saved to https://phabricator.wikimedia.org/P29628 and previous config saved to /var/cache/conftool/dbconfig/20220612-064640-ladsgroup.json [06:46:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:46:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:45] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [06:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220612T0700) [07:06:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:31:28] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:35:18] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:35:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:36:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:36:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [07:36:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [07:36:40] PROBLEM - Apache HTTP on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:36:40] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:36:42] PROBLEM - Apache HTTP on mw1431 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:36:54] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:36:56] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [07:36:58] PROBLEM - Apache HTTP on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:36:58] PROBLEM - Apache HTTP on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:36:58] PROBLEM - Apache HTTP on mw1420 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:36:58] PROBLEM - Apache HTTP on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:36:58] PROBLEM - Apache HTTP on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:00] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:00] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:06] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:06] PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:06] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:08] PROBLEM - Apache HTTP on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:08] PROBLEM - Apache HTTP on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:10] PROBLEM - Apache HTTP on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:12] PROBLEM - Apache HTTP on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:12] PROBLEM - Apache HTTP on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:12] PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:12] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:12] PROBLEM - Apache HTTP on mw1429 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:14] PROBLEM - Apache HTTP on mw1434 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:14] PROBLEM - Apache HTTP on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:14] PROBLEM - Apache HTTP on mw1442 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:14] PROBLEM - Apache HTTP on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:17] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:37:30] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:32] PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:40] PROBLEM - Apache HTTP on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:40] PROBLEM - Apache HTTP on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:40] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:42] PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:42] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:46] PROBLEM - Apache HTTP on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:46] PROBLEM - Apache HTTP on mw1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:46] PROBLEM - Apache HTTP on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:56] PROBLEM - Apache HTTP on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:37:58] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:00] PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:00] PROBLEM - Apache HTTP on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:02] PROBLEM - Apache HTTP on mw1430 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:02] PROBLEM - Apache HTTP on mw1432 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:04] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:04] PROBLEM - Apache HTTP on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:04] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:10] PROBLEM - Apache HTTP on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:10] PROBLEM - Apache HTTP on mw1452 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:10] PROBLEM - Apache HTTP on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:20] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:38:24] PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:30] PROBLEM - Apache HTTP on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:30] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:30] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:34] PROBLEM - Apache HTTP on mw1418 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:34] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:38] PROBLEM - Apache HTTP on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:38] PROBLEM - Apache HTTP on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:40] PROBLEM - Apache HTTP on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:40] PROBLEM - Apache HTTP on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:40] PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:40] PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:44] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9726 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:38:44] PROBLEM - Apache HTTP on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:46] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:48] PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:50] PROBLEM - Apache HTTP on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:52] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:54] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:38:54] PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:39:16] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [07:39:18] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:39:20] PROBLEM - Apache HTTP on mw1435 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:39:22] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:39:26] RECOVERY - Apache HTTP on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:26] RECOVERY - Apache HTTP on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:26] RECOVERY - Apache HTTP on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.931 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:26] PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:39:28] RECOVERY - Apache HTTP on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.177 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:28] RECOVERY - Apache HTTP on mw1419 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.202 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:28] RECOVERY - Apache HTTP on mw1407 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 5.494 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:28] RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.025 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:28] RECOVERY - Apache HTTP on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.902 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:29] RECOVERY - Apache HTTP on mw1434 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:29] RECOVERY - Apache HTTP on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.157 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:30] RECOVERY - Apache HTTP on mw1442 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.420 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:30] RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.396 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:36] PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:42] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:44] RECOVERY - Apache HTTP on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.483 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:52] RECOVERY - Apache HTTP on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:52] RECOVERY - Apache HTTP on mw1409 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:52] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:54] RECOVERY - Apache HTTP on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:54] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:56] RECOVERY - Apache HTTP on mw1456 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:56] RECOVERY - Apache HTTP on mw1417 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:39:58] RECOVERY - Apache HTTP on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:08] RECOVERY - Apache HTTP on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:08] RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:10] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.7258 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:40:10] RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:12] RECOVERY - Apache HTTP on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:12] RECOVERY - Apache HTTP on mw1430 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:12] RECOVERY - Apache HTTP on mw1432 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:16] RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:16] RECOVERY - Apache HTTP on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:16] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:22] RECOVERY - Apache HTTP on mw1397 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:22] RECOVERY - Apache HTTP on mw1399 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:22] RECOVERY - Apache HTTP on mw1452 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:34] RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:40] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:40] RECOVERY - Apache HTTP on mw1389 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:40] RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:46] RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:46] RECOVERY - Apache HTTP on mw1418 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:50] RECOVERY - Apache HTTP on mw1411 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:50] RECOVERY - Apache HTTP on mw1391 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:50] RECOVERY - Apache HTTP on mw1441 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:50] RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:50] RECOVERY - Apache HTTP on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:51] RECOVERY - Apache HTTP on mw1453 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:56] RECOVERY - Apache HTTP on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:40:58] RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:00] RECOVERY - Apache HTTP on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:00] RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:04] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:04] RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:04] RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:12] RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:12] RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:14] RECOVERY - Apache HTTP on mw1431 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:18] (ProbeDown) resolved: (25) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:41:26] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:28] RECOVERY - Apache HTTP on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:28] RECOVERY - Apache HTTP on mw1332 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:28] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:30] RECOVERY - Apache HTTP on mw1420 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:30] RECOVERY - Apache HTTP on mw1325 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:30] RECOVERY - Apache HTTP on mw1329 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:32] RECOVERY - Apache HTTP on mw1435 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:32] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:32] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [07:41:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [07:41:38] RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:38] RECOVERY - Apache HTTP on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:38] RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:38] RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:41:42] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:42:17] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:42:30] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:43:02] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:43:16] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:43:24] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [07:43:56] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [07:45:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:45:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:05:58] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:08:48] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:09:02] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 33.98 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:10:07] getting "upstream connect error or disconnect/reset before headers. reset reason: overflow" errors right now (UK/Europe) [08:10:53] NotASpy: thank you, we're investigatin [08:11:19] (ProbeDown) firing: (15) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:11:19] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:11:22] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 13.93 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:11:30] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:11:32] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:11:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [08:11:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [08:11:46] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 36.99 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:11:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:12:02] PROBLEM - Apache HTTP on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:04] PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:06] PROBLEM - Apache HTTP on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:10] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:10] PROBLEM - Apache HTTP on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:10] PROBLEM - Apache HTTP on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:16] PROBLEM - Apache HTTP on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:16] PROBLEM - Apache HTTP on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:18] PROBLEM - Apache HTTP on mw1431 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:27] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:12:30] PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:32] PROBLEM - Apache HTTP on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:32] PROBLEM - Apache HTTP on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:32] PROBLEM - Apache HTTP on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:34] PROBLEM - Apache HTTP on mw1420 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:34] PROBLEM - Apache HTTP on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:34] PROBLEM - Apache HTTP on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:36] PROBLEM - Apache HTTP on mw1435 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:36] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:36] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:42] PROBLEM - Apache HTTP on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:42] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:42] PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:42] PROBLEM - Apache HTTP on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:44] PROBLEM - Apache HTTP on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:44] PROBLEM - Apache HTTP on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:46] PROBLEM - Apache HTTP on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:48] PROBLEM - Apache HTTP on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:48] PROBLEM - Apache HTTP on mw1429 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:48] PROBLEM - Apache HTTP on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:48] PROBLEM - Apache HTTP on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:48] PROBLEM - Apache HTTP on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:49] PROBLEM - Apache HTTP on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:50] PROBLEM - Apache HTTP on mw1434 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:50] PROBLEM - Apache HTTP on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:50] PROBLEM - Apache HTTP on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:51] PROBLEM - Apache HTTP on mw1442 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:12:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:13:06] * Asartea pokes their head in [08:13:08] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:10] PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:13] Is Wikimedia down for anybody else? [08:13:18] PROBLEM - Apache HTTP on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:18] PROBLEM - Apache HTTP on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:18] PROBLEM - Apache HTTP on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:20] PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:20] PROBLEM - Apache HTTP on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:22] PROBLEM - Apache HTTP on mw1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:22] PROBLEM - Apache HTTP on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:24] PROBLEM - Apache HTTP on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:29] I think the above spam gives a hint Asartea [08:13:34] PROBLEM - Apache HTTP on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:34] PROBLEM - Apache HTTP on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:36] Also please move to another channel [08:13:38] PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:40] PROBLEM - Apache HTTP on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:40] PROBLEM - Apache HTTP on mw1430 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:40] PROBLEM - Apache HTTP on mw1432 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:44] PROBLEM - Apache HTTP on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:44] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:44] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:48] PROBLEM - Apache HTTP on mw1452 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:48] PROBLEM - Apache HTTP on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:48] PROBLEM - Apache HTTP on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:13:59] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 233 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Phabricator [08:14:00] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 23.4 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:14:02] PROBLEM - Apache HTTP on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:06] PROBLEM - Apache HTTP on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:06] PROBLEM - Apache HTTP on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:06] PROBLEM - Apache HTTP on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:10] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:14:14] PROBLEM - Apache HTTP on mw1418 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:14] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:16] PROBLEM - Apache HTTP on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:16] PROBLEM - Apache HTTP on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:20] PROBLEM - Apache HTTP on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:20] PROBLEM - Apache HTTP on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:20] PROBLEM - Apache HTTP on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:20] PROBLEM - Apache HTTP on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:22] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [08:14:28] PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:14:28] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1433.eqiad.wmnet, mw1365.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1366.eqiad.wmnet, mw1432.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1387.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw [08:14:28] ad.wmnet, mw1420.eqiad.wmnet, mw1333.eqiad.wmnet, mw1393.eqiad.wmnet, mw1454.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1389.eqiad.wmnet, mw1395.eqiad.wmnet, mw1397.eqiad.wmnet, mw1325.eqiad.wmnet, mw1385.eqiad.wmnet, mw1436.eqiad.wmnet, mw1369.eqiad.wmnet, mw1367.eqiad.wmnet, mw1409.eqiad.wmnet, mw1455.eqiad.wmnet, mw1326.eqiad.wmnet, mw1332.eqiad.wmnet, mw1452.eqiad.wmnet, mw1414.eqiad.wmnet, mw1417.eqiad.wmnet, mw1371.eqiad [08:14:28] mw1453.eqiad.wmnet, mw1322.eqiad.wmnet, mw1355.eqiad.wmnet, mw1323.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1351.eqiad.wmnet, mw1391.eqiad.wmnet, mw135 https://wikitech.wikimedia.org/wiki/PyBal [08:14:38] PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [08:14:44] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a respo [08:14:44] received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [08:14:44] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1389.eqiad.wmnet, mw1414.eqiad.wmnet, mw1417.eqiad.wmnet, mw1371.eqiad.wmnet, mw1365.eqiad.wmnet, mw1455.eqiad.wmnet, mw1453.eqiad.wmnet, mw1442.eqiad.wmnet, mw1323.eqiad.wmnet, mw1434.eqiad.wmnet, mw1432.eqiad.wmnet, mw1385.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1327.eqiad.wmnet, mw1387.eqiad.wmnet, mw [08:14:44] ad.wmnet, mw1430.eqiad.wmnet, mw1351.eqiad.wmnet, mw1409.eqiad.wmnet, mw1405.eqiad.wmnet, mw1329.eqiad.wmnet, mw1352.eqiad.wmnet, mw1441.eqiad.wmnet, mw1326.eqiad.wmnet, mw1435.eqiad.wmnet, mw1420.eqiad.wmnet, mw1454.eqiad.wmnet, mw1431.eqiad.wmnet, mw1319.eqiad.wmnet, mw1407.eqiad.wmnet, mw1366.eqiad.wmnet, mw1324.eqiad.wmnet, mw1372.eqiad.wmnet, mw1391.eqiad.wmnet, mw1370.eqiad.wmnet, mw1429.eqiad.wmnet, mw1451.eqiad.wmnet, mw1331.eqiad [08:14:44] mw1418.eqiad.wmnet, mw1321.eqiad.wmnet, mw1401.eqiad.wmnet, mw1325.eqiad.wmnet, mw1373.eqiad.wmnet, mw1411.eqiad.wmnet, mw1369.eqiad.wmnet, mw1367.eqiad.wmnet, mw1399.eqiad.wmnet, mw141 https://wikitech.wikimedia.org/wiki/PyBal [08:14:52] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are ma [08:14:52] n but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:14:52] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp500 [08:14:52] wmnet are marked down but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wi [08:15:04] PROBLEM - PyBal backends health check on lvs3007 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3058.esams.wmnet, cp3062.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3058.esams.wmnet, cp3062.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled [08:15:04] 6_443: Servers cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:15:20] PROBLEM - PHP7 rendering on mw1430 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:24] PROBLEM - PHP7 rendering on mw1442 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:24] PROBLEM - PHP7 rendering on mw1352 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:26] PROBLEM - PHP7 rendering on mw1452 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:36] PROBLEM - PHP7 rendering on mw1414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:38] PROBLEM - PHP7 rendering on mw1370 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:42] PROBLEM - PHP7 rendering on mw1367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:42] PROBLEM - PHP7 rendering on mw1329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:44] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is CRITICAL: 0.9726 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:15:46] PROBLEM - PHP7 rendering on mw1397 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:46] PROBLEM - PHP7 rendering on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:48] PROBLEM - PHP7 rendering on mw1431 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:48] PROBLEM - PHP7 rendering on mw1384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:48] PROBLEM - PHP7 rendering on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:52] PROBLEM - PHP7 rendering on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:00] PROBLEM - PHP7 rendering on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:00] PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:06] PROBLEM - PHP7 rendering on mw1415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:06] PROBLEM - PHP7 rendering on mw1432 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:06] PROBLEM - PHP7 rendering on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:06] PROBLEM - PHP7 rendering on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:08] PROBLEM - PHP7 rendering on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:08] PROBLEM - PHP7 rendering on mw1385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:08] PROBLEM - PHP7 rendering on mw1434 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:12] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp305 [08:16:12] wmnet are marked down but pooled: testlb6_443: Servers cp3050.esams.wmnet, cp3054.esams.wmnet, cp3058.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3056.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3062.esams.wmnet, cp3064.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:16:12] PROBLEM - PHP7 rendering on mw1407 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:12] PROBLEM - PHP7 rendering on mw1411 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:16] PROBLEM - PHP7 rendering on mw1389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:22] PROBLEM - PHP7 rendering on mw1456 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:22] PROBLEM - PHP7 rendering on mw1451 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:28] PROBLEM - PHP7 rendering on mw1441 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:30] PROBLEM - PHP7 rendering on mw1387 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:30] PROBLEM - PHP7 rendering on mw1391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:30] PROBLEM - PHP7 rendering on mw1453 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:30] PROBLEM - PHP7 rendering on mw1333 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:36] PROBLEM - PHP7 rendering on mw1417 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:36] PROBLEM - PHP7 rendering on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:36] PROBLEM - PHP7 rendering on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:36] PROBLEM - PHP7 rendering on mw1365 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:36] PROBLEM - PHP7 rendering on mw1354 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:40] PROBLEM - PHP7 rendering on mw1435 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:40] RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:40] PROBLEM - PHP7 rendering on mw1429 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:45] (JobUnavailable) firing: Reduced availability for job swagger_check_restbase_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:16:50] PROBLEM - PHP7 rendering on mw1325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:52] PROBLEM - PHP7 rendering on mw1373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:52] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:52] PROBLEM - PHP7 rendering on mw1351 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:54] PROBLEM - PHP7 rendering on mw1419 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:54] PROBLEM - PHP7 rendering on mw1416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:58] PROBLEM - PHP7 rendering on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:58] PROBLEM - PHP7 rendering on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:58] PROBLEM - PHP7 rendering on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:16:58] PROBLEM - PHP7 rendering on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:04] PROBLEM - PHP7 rendering on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:08] PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:10] PROBLEM - PHP7 rendering on mw1418 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:10] PROBLEM - PHP7 rendering on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:10] PROBLEM - PHP7 rendering on mw1436 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:10] PROBLEM - PHP7 rendering on mw1366 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:12] PROBLEM - PHP7 rendering on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:12] PROBLEM - PHP7 rendering on mw1454 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:18] (ProbeDown) firing: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:17:26] PROBLEM - PHP7 rendering on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:26] PROBLEM - PHP7 rendering on mw1349 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:26] PROBLEM - PHP7 rendering on mw1355 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:30] PROBLEM - PHP7 rendering on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:34] PROBLEM - PHP7 rendering on mw1393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:40] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled [08:17:40] 6_443: Servers cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1081.eqiad.wmnet, cp1083.eqiad.wmnet, cp1085.eqiad.wmnet, cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1079.eqiad.wmnet, cp1077.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:17:40] PROBLEM - PHP7 rendering on mw1420 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:40] PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:42] PROBLEM - PHP7 rendering on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:42] PROBLEM - PHP7 rendering on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:17:58] PROBLEM - PHP7 rendering on mw1364 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:18:00] PROBLEM - PHP7 rendering on mw1455 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:18:04] PROBLEM - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is CRITICAL: CRITICAL - exception while fetching the URL. 503 Server Error: Service Unavailable for url: https://en.planet.wikimedia.org/ https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [08:18:06] PROBLEM - PHP7 rendering on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:18:14] PROBLEM - PHP7 rendering on mw1409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:18:16] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:18:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [08:18:41] PROBLEM - wiki content on commons #page on commons.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string Picture of the day not found on https://commons.wikimedia.org:443/wiki/Main_Page - 233 bytes in 0.004 second response time https://phabricator.wikimedia.org/project/view/1118/ [08:18:44] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:18:50] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [08:18:51] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 39581 bytes in 6.462 second response time https://wikitech.wikimedia.org/wiki/Phabricator [08:19:08] 10SRE, 10Traffic, 10Wikimedia-Incident: Unable to view all Wikimedia projects - https://phabricator.wikimedia.org/T310431 (10Bugreporter) See also: https://grafana.wikimedia.org/d/000000170/wikidata-edits [08:19:14] RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1634 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [08:19:22] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [08:19:36] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:19:36] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:19:52] RECOVERY - Apache HTTP on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.795 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:20:02] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:20:22] RECOVERY - PHP7 rendering on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 8.524 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:20:26] RECOVERY - check updates on en.planet.wikimedia.org on en.planet.wikimedia.org is OK: OK - Website content is current (1049 = 86400) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [08:20:50] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 70.98 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:21:45] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:21:50] (JobUnavailable) firing: (2) Reduced availability for job swagger_check_restbase_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:21:57] PROBLEM - MariaDB Replica SQL: s8 #page on db1111 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:22:06] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [08:22:46] RECOVERY - Apache HTTP on mw1409 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.791 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:22:50] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:22:50] RECOVERY - PHP7 rendering on mw1397 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.754 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:22:52] RECOVERY - Apache HTTP on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.965 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:23:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [08:23:10] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 91.28 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:23:14] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 9.823 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:23:16] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:23:16] RECOVERY - PHP7 rendering on mw1411 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.819 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:23:25] RECOVERY - wiki content on commons #page on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 191183 bytes in 0.013 second response time https://phabricator.wikimedia.org/project/view/1118/ [08:23:42] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [08:23:45] PROBLEM - MariaDB Replica IO: s8 #page on db1111 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:24:16] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6129 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:24:27] RECOVERY - MariaDB Replica SQL: s8 #page on db1111 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:25:07] 10SRE, 10Traffic, 10Wikimedia-Incident: Unable to view all Wikimedia projects - https://phabricator.wikimedia.org/T310431 (10GhostInTheMachine) If you report this error to the Wikimedia System Administrators, please include the details below. Request from 2.98.121.99 via cp3064 cp3064, Varnish XID 552143256... [08:25:36] PROBLEM - Apache HTTP on mw1433 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:25:50] (JobUnavailable) resolved: Reduced availability for job swagger_check_restbase_esams in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:25:58] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:25:58] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 73.87 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:26:04] RECOVERY - PHP7 rendering on mw1399 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 8.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:04] RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.483 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:04] RECOVERY - PHP7 rendering on mw1417 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 8.491 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:04] RECOVERY - Apache HTTP on mw1418 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 8.800 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:04] RECOVERY - Apache HTTP on mw1391 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.165 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:05] RECOVERY - Apache HTTP on mw1441 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.214 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:05] RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 9.186 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:06] RECOVERY - Apache HTTP on mw1453 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.780 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:06] RECOVERY - PHP7 rendering on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 6.776 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:07] RECOVERY - Apache HTTP on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 6.831 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:07] RECOVERY - Apache HTTP on mw1411 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 7.996 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:08] RECOVERY - PHP7 rendering on mw1435 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 6.882 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:08] RECOVERY - MariaDB Replica IO: s8 #page on db1111 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:26:09] RECOVERY - Apache HTTP on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 3.224 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:10] RECOVERY - Apache HTTP on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.194 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:12] RECOVERY - Apache HTTP on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.879 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:14] RECOVERY - PHP7 rendering on mw1325 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.467 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:14] RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 4.947 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:16] RECOVERY - Apache HTTP on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.968 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:16] RECOVERY - Apache HTTP on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 1.967 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:16] RECOVERY - PHP7 rendering on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:16] RECOVERY - PHP7 rendering on mw1416 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:16] RECOVERY - PHP7 rendering on mw1419 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:17] RECOVERY - PHP7 rendering on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.248 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:17] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 302 Found - 547 bytes in 2.605 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:18] RECOVERY - PHP7 rendering on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 2.763 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:18] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:26:20] RECOVERY - PHP7 rendering on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 560 bytes in 0.564 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:20] RECOVERY - PHP7 rendering on mw1323 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 1.186 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:20] RECOVERY - PHP7 rendering on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 1.515 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:20] RECOVERY - PHP7 rendering on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 561 bytes in 1.597 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:22] RECOVERY - Apache HTTP on mw1373 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.247 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:22] RECOVERY - Apache HTTP on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 546 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:24] RECOVERY - Apache HTTP on mw1431 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:24] RECOVERY - PHP7 rendering on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:30] RECOVERY - PHP7 rendering on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:30] RECOVERY - PHP7 rendering on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:30] RECOVERY - PHP7 rendering on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:30] RECOVERY - PHP7 rendering on mw1418 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:30] RECOVERY - PHP7 rendering on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:34] RECOVERY - PHP7 rendering on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:34] RECOVERY - PHP7 rendering on mw1454 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:34] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:26:34] RECOVERY - Apache HTTP on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:38] RECOVERY - Apache HTTP on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:38] RECOVERY - Apache HTTP on mw1332 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:38] RECOVERY - Apache HTTP on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:38] RECOVERY - Apache HTTP on mw1420 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:38] RECOVERY - Apache HTTP on mw1325 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:39] RECOVERY - Apache HTTP on mw1329 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:40] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:40] RECOVERY - Apache HTTP on mw1435 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:40] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:48] RECOVERY - Apache HTTP on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:48] RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:48] RECOVERY - PHP7 rendering on mw1349 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:48] RECOVERY - PHP7 rendering on mw1355 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:48] RECOVERY - PHP7 rendering on mw1324 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:50] RECOVERY - Apache HTTP on mw1455 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:50] RECOVERY - Apache HTTP on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:50] RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:50] RECOVERY - Apache HTTP on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:51] RECOVERY - PHP7 rendering on mw1326 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:26:54] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:26:56] RECOVERY - Apache HTTP on mw1429 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:56] RECOVERY - Apache HTTP on mw1385 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:56] RECOVERY - Apache HTTP on mw1407 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:56] RECOVERY - Apache HTTP on mw1434 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:56] RECOVERY - Apache HTTP on mw1419 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:57] RECOVERY - Apache HTTP on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:57] RECOVERY - Apache HTTP on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:58] RECOVERY - Apache HTTP on mw1442 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:58] RECOVERY - Apache HTTP on mw1436 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:59] RECOVERY - Apache HTTP on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:26:59] RECOVERY - PHP7 rendering on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:02] RECOVERY - PHP7 rendering on mw1420 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:02] RECOVERY - PHP7 rendering on mw1430 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:02] RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:04] RECOVERY - PHP7 rendering on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:04] RECOVERY - PHP7 rendering on mw1320 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:06] RECOVERY - PHP7 rendering on mw1442 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:06] RECOVERY - PHP7 rendering on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:08] RECOVERY - PHP7 rendering on mw1452 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:14] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:18] RECOVERY - Apache HTTP on mw1319 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:20] RECOVERY - PHP7 rendering on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:20] RECOVERY - PHP7 rendering on mw1364 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:22] RECOVERY - PHP7 rendering on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:23] 10SRE, 10Traffic, 10Wikimedia-Incident: Unable to view all Wikimedia projects - https://phabricator.wikimedia.org/T310431 (10MdsShakil) Looks like it's okay at the moment, I can see everything [08:27:24] RECOVERY - Apache HTTP on mw1370 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:24] RECOVERY - PHP7 rendering on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:24] RECOVERY - Apache HTTP on mw1393 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:24] RECOVERY - PHP7 rendering on mw1329 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:28] RECOVERY - Apache HTTP on mw1330 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:28] RECOVERY - PHP7 rendering on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:28] RECOVERY - PHP7 rendering on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:28] RECOVERY - Apache HTTP on mw1322 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:30] RECOVERY - Apache HTTP on mw1456 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:30] RECOVERY - Apache HTTP on mw1417 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:30] RECOVERY - PHP7 rendering on mw1431 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:30] RECOVERY - PHP7 rendering on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:30] RECOVERY - PHP7 rendering on mw1384 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:34] RECOVERY - PHP7 rendering on mw1409 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:36] RECOVERY - PHP7 rendering on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:42] RECOVERY - Apache HTTP on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:42] RECOVERY - Apache HTTP on mw1352 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:42] RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:42] RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:42] RECOVERY - PHP7 rendering on mw1332 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:44] RECOVERY - Apache HTTP on mw1432 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:44] RECOVERY - Apache HTTP on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:44] RECOVERY - Apache HTTP on mw1430 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:50] RECOVERY - PHP7 rendering on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:50] RECOVERY - PHP7 rendering on mw1368 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:50] RECOVERY - PHP7 rendering on mw1415 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:50] RECOVERY - PHP7 rendering on mw1432 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:50] RECOVERY - Apache HTTP on mw1433 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:51] RECOVERY - Apache HTTP on mw1414 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:51] RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:52] RECOVERY - PHP7 rendering on mw1385 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:52] RECOVERY - PHP7 rendering on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:53] RECOVERY - PHP7 rendering on mw1434 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:54] RECOVERY - PHP7 rendering on mw1407 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:27:56] RECOVERY - Apache HTTP on mw1452 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:56] RECOVERY - Apache HTTP on mw1397 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:27:56] RECOVERY - Apache HTTP on mw1399 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:28:00] RECOVERY - PHP7 rendering on mw1389 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:04] RECOVERY - PHP7 rendering on mw1456 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:04] RECOVERY - PHP7 rendering on mw1451 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:06] RECOVERY - Apache HTTP on mw1351 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:28:10] RECOVERY - PHP7 rendering on mw1441 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:12] RECOVERY - Apache HTTP on mw1389 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:28:12] RECOVERY - Apache HTTP on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:28:12] RECOVERY - Apache HTTP on mw1366 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:28:14] RECOVERY - PHP7 rendering on mw1453 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:14] RECOVERY - PHP7 rendering on mw1391 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:14] RECOVERY - PHP7 rendering on mw1387 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:14] RECOVERY - PHP7 rendering on mw1333 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:20] RECOVERY - PHP7 rendering on mw1354 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:20] RECOVERY - PHP7 rendering on mw1365 is OK: HTTP OK: HTTP/1.1 302 Found - 559 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:28:22] RECOVERY - Apache HTTP on mw1367 is OK: HTTP OK: HTTP/1.1 302 Found - 545 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:29:00] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:29:30] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:29:34] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:29:54] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.0137 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [08:30:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 11.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:30:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 19.31 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:30:26] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [08:30:38] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 49.43 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:31:19] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:31:19] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:31:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [08:31:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [08:32:23] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:32:46] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 8.978 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:35:04] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [08:35:04] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:35:14] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 73.27 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:35:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [08:37:08] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:37:10] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 76.78 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:40:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [08:44:30] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:51:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:51:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:05:39] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:07:31] 10SRE, 10Traffic, 10Wikimedia-Incident: Unable to view all Wikimedia projects - https://phabricator.wikimedia.org/T310431 (10fgiunchedi) p:05Unbreak!→03Medium Thanks folks, there was indeed widespread unavailability to all sites. We're back now so I'm lowering the severity, there will be followups as well [09:08:33] (03CR) 10JMeybohm: [C: 03+1] service::docker: refresh service when config file is changed [puppet] - 10https://gerrit.wikimedia.org/r/799420 (owner: 10Ori) [09:11:12] 10SRE, 10Traffic, 10Wikimedia-Incident: Unable to view all Wikimedia projects - https://phabricator.wikimedia.org/T310431 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The immediate issue has been resolved, closing. There are some actionables, but rather sub tasks to existing tasks and... [09:14:38] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for [09:14:38] ource sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [09:30:44] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:04:22] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected v [10:04:22] path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [10:08:46] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at pat [10:08:46] ences[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [10:48:30] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for [10:48:30] ource sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [11:00:22] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:34:16] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:13:54] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:19:28] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:20:38] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:42:58] PROBLEM - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:48:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:19] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:04] * godog sighs [12:51:16] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:51:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:51:49] it is already recovered btw [12:53:09] cpu spike in shellbox [12:53:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:18] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:04:40] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:05:39] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:15:10] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:16:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Revision table maint [13:16:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Revision table maint [13:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:50] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given sour [13:42:50] ons) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [13:44:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:28] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for [13:47:28] ource sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [13:49:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:51:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:56:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:01:00] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:21:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:53] (03PS1) 10Andrew Bogott: clouddumps: remove partman/hwraid-seconddev.cfg from partman request [puppet] - 10https://gerrit.wikimedia.org/r/804758 [14:24:45] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps: remove partman/hwraid-seconddev.cfg from partman request [puppet] - 10https://gerrit.wikimedia.org/r/804758 (owner: 10Andrew Bogott) [14:25:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [14:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:52] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:36:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:36:44] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 55.21 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [14:36:48] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1002.wikimedia.org with reason: host reimage [14:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:06] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 71.93 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [14:39:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1002.wikimedia.org with reason: host reimage [14:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:56] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:47:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:29] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1002.wikimedia.org with OS bullseye [14:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:57:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:40:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:55:17] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) is CRITICAL: Test Machine translate an HTML fragment using TestClient, adapt the links to target language wiki. returned the unexpected status 500 (expecting: 200): /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for [15:55:17] ource sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [15:59:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:03] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:51] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:20:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:39] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:28:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:33:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:45:09] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:48:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:48:57] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [16:51:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:53:23] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [17:05:39] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:17:13] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:18:45] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:37:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:42:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:46:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:47:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:56:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.299 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:03:08] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1002.wikimedia.org with OS bullseye [18:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:14:34] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1002.wikimedia.org with reason: host reimage [18:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:08] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1002.wikimedia.org with reason: host reimage [18:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:26:04] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:29:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:31:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1002.wikimedia.org with OS bullseye [18:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:40:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:44:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:51:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:56:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:05:24] (03CR) 10Andrew Bogott: [C: 03+2] put clouddumps100[12] into service [puppet] - 10https://gerrit.wikimedia.org/r/802600 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [19:05:30] (03PS5) 10Andrew Bogott: put clouddumps100[12] into service [puppet] - 10https://gerrit.wikimedia.org/r/802600 (https://phabricator.wikimedia.org/T309346) [19:07:41] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:09:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:13:30] (03PS1) 10Andrew Bogott: clouddumps100x: create hiera host files [puppet] - 10https://gerrit.wikimedia.org/r/804767 [19:14:42] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps100x: create hiera host files [puppet] - 10https://gerrit.wikimedia.org/r/804767 (owner: 10Andrew Bogott) [19:22:57] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:27:17] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:30:01] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:32:29] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:53:07] RECOVERY - MariaDB Replica Lag: s1 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:02:41] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) is WARNING: Test Suggest target section titles for given source sections responds with unexpected value at path /References[0] = {type: Buffer, data: [82, 101, 102, 101, 114, 101, 110, 99, 105, 97, 115]} https://wikitech.wikimedia.org/wiki/CX [20:05:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:12:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:17:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:20:39] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:23:07] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:23:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) 05Open→03Resolved I have these hosts partitioned now (sdb by hand) so closing this task. Thanks for your help papaul! [20:26:07] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): hdfs client packages for debian Bullseye - https://phabricator.wikimedia.org/T310451 (10Andrew) [20:28:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:33:33] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:38:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:45] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:46:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:46:45] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:51:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1004:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:51:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:58:53] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:39] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:07:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:19] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:12:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:13:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:23:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:33:19] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:43] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:36:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:41:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:58:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:00:09] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:03:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:08:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:30:05] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:35:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:39:43] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:40:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:47:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:51:01] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:52:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:12:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:17:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:24:59] PROBLEM - MegaRAID on analytics1068 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:25:33] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:40:51] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 59311 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [23:58:57] RECOVERY - MegaRAID on analytics1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring