[03:52:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [03:57:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [13:14:35] I was checking titan2001 for thanos-compact and noticed the network is fully utilized since yesterday after moving pyrra to raw metric for liftwing :( I think we need to revert cc elukey herron [13:18:29] +1 totally [13:19:02] it seems that the 12w rec rule that pyrra creates it too aggressive with the istio time series [13:19:11] and we reduced a lot their cardinality, sigh [13:21:10] yeah it is a bummer [13:25:59] ok "fully" is not quite correct, port is at 2gbit [13:26:25] https://librenms.wikimedia.org/device/device=311/tab=port/port=33632/ with a few discards, so probably ok in the sense that it isn't an outage [13:27:43] this is the impact e.g. on titan2001 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=titan2001&var-datasource=thanos&var-cluster=titan&from=1733925826488&to=1733955017945 [13:45:18] what if we bring titan to 10g and at the same time try to further reduce cardinality? [13:46:27] the first part is already the case so we're good there, the second we could try for sure [13:51:20] ah my bad, misread as a 2g link. ok yeah sounds good [13:55:08] what I don't know offhand is how hard is going to be to reduce cardinality and how much effect it'll have, vs having less datapoints [18:40:17] is it possible to have our toolforge-based services' metrics pulled into one of your prometheus instances? i'm not sure which of the instances would make the most sense for this https://wikitech.wikimedia.org/wiki/Prometheus#Instances [18:45:11] derenrich: I think Toolforge has its own Prometheus instance. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Toolforge [18:45:32] Do you know if there's a reason why a different instance is required? [18:46:00] denisse: yeah unfortunately i tried adding us to their instance and i was told no (see this ticket https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039850) [18:47:05] they give some reasons why it's a bad choice in the ticket [19:08:02] derenrich: Understood, then I think that the `cloud` Prometheus instance may be a good fit for it, but I'd like to see what others in the team think about that. [20:13:57] denisse: ok thanks. what would be a good way to move this forward? a meeting or a ticket? [21:16:03] derenrich: I think that opening a ticket that provides some context regarding the issue would be a good first step to provide visibility and awareness on the situation. [23:07:10] FIRING: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:15] RESOLVED: SystemdUnitFailed: statograph_post.service on alert1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed