[01:25:40] FIRING: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [01:30:40] RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:40:18] Hi again, I know there was a change to thanos retention, but disk usage doesn't seem to have been impacted, and is still growing - https://grafana.wikimedia.org/goto/of1PBnGHg?orgId=1 [09:06:57] I notice in the past SAL entries suggesting further action to delete older blocks e.g. https://phabricator.wikimedia.org/T351927#10170614 [09:07:27] ...because the compactor isn't running automatically ATM, so needs doing manually [09:08:33] So can you delete these older blocks, please? [13:51:25] Emperor: bucket cleanup job is running now [13:55:29] 👍 [14:49:28] Emperor: Cleanup finished - looks like we freed up about 1TB which might not be enough? [14:51:53] cwhite: Mmm, the disks are still full enough to be alerting :( [14:59:17] Ok. Instructions left for us were to keep whittling down 5m retention to a lower bound of "a few weeks". I'll assume we can bring 5m to 3w retention. After that, we'll have to start dipping into 1h and raw data retention. [15:03:13] I really am hoping the new hardware will be here soon :-/ [15:04:07] Let's see what it's like once you've run the cleanup again? [15:47:43] I'm not sure it's making much of a dent. :/ [15:48:33] I wonder if we should trim up the raw retention instead. What do you think? [15:49:24] What's weird is after setting the 5m retention down another 5 weeks, the bucket cleanup command doesn't remove much. [16:01:43] FIRING: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-metrics - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [16:06:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in logging-eqiad for group benthos-mw-accesslog-metrics - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-mw-accesslog-metrics - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [17:32:48] * kamila_ ^ not sure what the spike was about but looks happy now, will follow up if it happens again [17:34:31] kamila_: thanks!