[02:51:43] FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [02:56:43] RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag [07:50:55] Does anyone know why there's no metrics data available for citoid prior to June? https://grafana-rw.wikimedia.org/d/NJkCVermz/citoid?forceLogin=&from=1707119350039&orgId=1&refresh=5m&to=1722844150039&var-dc=codfw+prometheus%2Fk8s&var-service=citoid Are there backups of this anywhere? [07:55:53] mvolz: the reason is that those dashboards use "prometheus k8s" datasource, which has shorter term data, for long term data the dashboards need to be adapted to use "thanos" datasource [07:57:19] mvolz: the easiest short term fix if you are interested in a single panel is to hit "explore" on the panel then change the datasource to "thanos" [07:57:40] mvolz: and add {site="eqiad"} (for example) in the metric labels to pick a site [08:00:52] ah [08:01:27] yeah I switched it to k8s because I wanted it by data centre [08:01:41] didn't realise it was shorter! [08:02:08] tnx! [08:02:18] mvolz: sure np, thanks for reaching out [08:02:55] FWIW you can still do per site, it needs to be passed in the metric labels in the query but other than that it is functionally the same to what you are doing now [11:14:12] FIRING: ThanosQueryHttpRequestQueryRangeErrorRateHigh: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [11:19:12] RESOLVED: ThanosQueryHttpRequestQueryRangeErrorRateHigh: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [12:53:07] <_joe_> cwhite: I'd appreciate a sanity check on the format for some ECS logs - see https://gitlab.wikimedia.org/repos/sre/conftool/-/blob/audit_log/conftool/audit.py?ref_type=heads#L42-55 [12:53:17] <_joe_> if you think I should add more metadata or anything [15:50:30] _joe_: Looks good! There are two more event categorization fields that may interest you: `event.type` and `event.category`. Adding these completes categorization ("what it is"), but there's no hard requirement to include them. :) [15:51:03] <_joe_> yeah that's easy to add if we want [15:51:12] <_joe_> thanks