[08:30:03] FIRING: ErrorBudgetBurn: logging - logstash-availability - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:40:03] RESOLVED: ErrorBudgetBurn: logging - logstash-availability - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:18:03] FIRING: ErrorBudgetBurn: logging - logstash-availability - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:28:03] RESOLVED: ErrorBudgetBurn: logging - logstash-availability - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:30:17] oh I see, we're using promtool, version 2.48.1 → we're compiling the binary ourselves which makes total sense to me! how do we update it? I'm using a bit more recent version of promtool, I noticed while checking which was the latest version, that it brings very helpful features: https://github.com/prometheus/prometheus/pull/15196 as well happy to [09:30:17] send a CR with the necessary changes if it helps! [10:22:11] thanos looks to be growing about 1T/day (if I read https://grafana.wikimedia.org/goto/aHDp444Hg?orgId=1 correctly), the automatic & gradual addition of capacity from the two new backends is easing the load on the over-full disks just a bit faster than they're filling up again - we're down to only a couple of disks over the WARN threshold [10:23:31] if that continues we'll need more capacity early next FY [10:24:29] (and that's having trimmed quite a lot of older metrics) [11:12:27] interesting, definitely unexpected as far as I'm aware [11:13:26] I've been watching it because it's been a bit tight in terms of getting the new backends into service before thanos ran out of space and needed to trim more metrics again [11:17:12] indeed [12:56:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [16:56:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [17:48:00] hello o11y friends - do we have any mechanism to detect and / or deny-list prom metrics with high-cardinality labels? (e.g., randomized IDs) [17:48:00] encountered a case where this may already be happening, and wanted to see if there's any meta-monitoring to identify when this happens [19:22:15] swfrench-wmf: Good questions. We graph the number of time series tracked, but aren't alerting on it. Could you point me to which metrics you found? [19:23:04] cwhite: thanks! function_orchestrator_function_execute_count and function_orchestrator_function_duration_milliseconds are the ones immediately on my radar [19:24:32] function_orchestrator_function_implementation_error_count "should" be doing the same thing (i.e., similarly using a request ID as label), but spot-checking /metrics on one service instance does not appear(?) to be [19:31:52] I see it now. I'll investigate further what we can do. Thanks for letting us know and please do hit me up with any more you find! [19:33:19] cwhite: great, thanks! FYI, I'm opening a phab task for the owning team to remove these (there are a couple of other labels with similar properties). would you like me to cc you? [19:35:07] Please do! I'd call it a high-priority issue because it may be partially explain why we're running up against full disks for Thanos. [19:35:47] I see another concerning label `implementationZID` [19:36:04] yeah, that's the other one [19:36:07] and `Z7_function_identity` [19:36:40] these are all "effectively unbounded" so yeah, concerning :) [20:56:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag