[11:25:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [11:30:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [11:33:08] herron: o/ [11:33:11] hellloooo [11:33:44] I was chatting with ML about the new SLO dashboards in pyrra, and I noticed that we are still suffering from the gaps https://phabricator.wikimedia.org/T352756 [11:34:12] for example, if you select "liftwing-revert-risk-lang-agnostic-availability" in slo.w.o and then expand the window to 4w you can see the problem [11:34:41] it happens also with others that I am randomly selecting (all lift wing based ones) [14:05:08] elukey: hiiiiii [14:05:49] great timing I wanted to check in w you about these as well was just looking at them yesterday [14:08:14] yeah I see the gap as well in the requests/errors metrics it seems the ruler is getting stuck I'm not sure why yet [14:09:57] :( [14:10:04] I think it is the same issue repeating [14:10:16] but it must be in the recording rules [14:14:35] yeah I think so too [14:15:12] I'll spend some time on it this week, do you have any ideas in the mean time? [14:29:29] questions more than ideas - IIRC when I checked the metrics behind the recording rules showing gaps I noticed that they were good (namely no missing datapoints) [14:30:02] what component is responsible for the recording rules generation? Is it thanos, or prometheus? [14:35:23] thanos rule for these [14:36:41] so question -- how does liftwing articlequality look to you in pyrra? [14:37:15] because this uses the raw metrics istio_requests_total istio_request_duration_milliseconds_bucket vs the recording rules used by the others that were just onboarded [14:38:52] the error budget calculation shows a big gap of days, but the rest looks good [14:38:59] both latency and availability slos [14:39:21] how can it use the raw metrics? Can pyrra skip the recording rules? [14:39:53] pyrra makes recording rules of its own behind the scenes [14:40:20] yep yep but you mentioned the raw metrics above, this is why I asked [14:40:35] I also find https://github.com/thanos-io/thanos/issues/896 that may be interesting [14:40:49] old task, but they mention index-cache-size and chunk-pool-size tuning for thanos [14:41:26] ahh I got ya, I mean in terms of the pyrra configs vs the recently onboarded lift wing slos which are using the :increase5m recording rule metrics in the pyrra config vs the raw metrics [14:41:52] ahhh okok [14:41:57] interesting yeah worth a shot too [14:42:27] but lemme get one thing - if you use :increase5m in the pyrra config, does it apply a recording rule over that? (so rec rule over rec rule) ? [14:42:33] basically the recently onboarded ones are just a forklift from grizzly, but I think it'd make sense to adjust them [14:42:40] yes it does [14:42:46] ah wow [14:46:30] it feels strange to apply a recording rule over another one, consistency wise (speaking from a very ignorant point of view thouhg) [14:51:45] yeah lets un-do that, I'll work on switching these to use the raw metrics in the pyrra config and see how that goes [19:05:49] TIL `t a` as a keyboard shortcut in Grafana