[09:17:00] <XioNoX>	 Hello! Is it possible for a Prometheus alert to notify WMCS and netops ? For example by passing a list to https://gerrit.wikimedia.org/r/c/operations/alerts/+/1126030/8/team-netops/bgp.yaml#32 ?
[09:23:26] <godog>	 XioNoX: not a list per se, though for example collab did set up a new team in modules/alertmanager/templates/alertmanager.yml.erb (team: 'collaboration-services-releng') which could work depending on the case, an alternative is to keep one team and then route alerts to a team1+team2 receiver in alertmanager
[09:24:03] <godog>	 or actually a receiver: for team2 + continue
[09:24:37] <godog>	 I tend to prefer solution #2
[09:33:33] <XioNoX>	 godog: noted, thx, is there doc on how to do it?
[09:39:22] <godog>	 XioNoX: not yet no, something similar to this though https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/alertmanager/templates/alertmanager.yml.erb#216
[09:39:54] <godog>	 except that you'd be matching on sth else, the general structure/idea is the same
[09:41:04] <XioNoX>	 cool, thx!
[09:44:14] <godog>	 XioNoX: np, FWIW for cases like these and similar what we could do is have sth like a 'scope' label in the alert, e.g. scope: cloud
[09:45:00] <XioNoX>	 godog: I was also wondering about a scope "network"
[09:45:28] <XioNoX>	 but then I worry it becomes too complex with some stuff being borderline between different scopes
[09:45:36] <XioNoX>	 and the possible confusing between team netops and scope network
[09:46:34] <godog>	 yeah fair enough
[10:13:54] <XioNoX>	 godog: I might need your help for the alert that we just received : "problem = prometheus "ops" at http://127.0.0.1:9900/ops has "gnmi_bgp_neighbor_session_state" metric with "instance" label but there are no series matching {instance=~"cloudsw.*"} in the last 1w"
[10:20:48] <godog>	 XioNoX: ah yes, the problem is that on pops there's no cloudsw, the easiest solution is to split the cloud alerts in a separate file then use # deploy-site: eqiad, codfw in addition to # deploy-tag: ops
[10:29:43] <XioNoX>	 godog: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1126944
[10:31:37] <godog>	 XioNoX: yes exactly, LGTM
[10:50:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:51:43] <jinxer-wm>	 FIRING: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag
[14:50:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:26:28] <cwhite>	 ^^ appears to be fallout from the event earlier today - `Memcached error: SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY`
[16:00:01] <godog>	 :(
[16:00:59] <godog>	 also my bad I totally missed  LogstashKafkaConsumerLag earlier today
[16:05:24] <godog>	 I'm looking at BenthosKafkaConsumerLag btw, there's just a few partitions lagged which is sus
[16:07:37] <godog>	 it was indeed mtail on centrallog1002, I bounced it
[16:16:43] <jinxer-wm>	 RESOLVED: BenthosKafkaConsumerLag: Too many messages in jumbo-eqiad for group benthos-webrequest-sampled-live-franz - TODO - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=jumbo-eqiad&var-datasource=eqiad%20prometheus/ops&var-consumer_group=benthos-webrequest-sampled-live-franz - https://alerts.wikimedia.org/?q=alertname%3DBenthosKafkaConsumerLag
[18:50:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[20:50:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[23:26:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: rsync-loki-data.service on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:36:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: rsync-loki-data.service on grafana2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed