[00:50:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logstash1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:50:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logstash1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:50:56] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logstash1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:40] <jinxer-wm>	 FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:10:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:15:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:20:40] <jinxer-wm>	 RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:39:10] <hashar>	 I have noticed a gap in MediaWiki log count on  https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&from=1726122923270&to=1726126286428  
[11:39:49] <hashar>	 that is between 6:56 UTC and 7:14  but I do see logs in logstash. So I guess some scraping failed to work this morning :)
[11:39:58] <hashar>	 the metric is either `log_mediawiki_servergroup_level_channel_doc_count` or `log_mediawiki_level_channel_doc_count`
[11:41:58] <hashar>	 then I can see them in https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors  so it is not really a concern
[14:18:52] <jinxer-wm>	 FIRING: ThanosRuleSenderIsFailingAlerts: Thanos Rule is failing to send alerts to alertmanager. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleSenderIsFailingAlerts
[14:19:06] <denisse>	 ^ looking.
[14:23:52] <jinxer-wm>	 RESOLVED: ThanosRuleSenderIsFailingAlerts: Thanos Rule is failing to send alerts to alertmanager. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleSenderIsFailingAlerts
[14:25:37] <godog>	 I hope the irony of the alert is not lost on (collective) you
[14:56:08] <arturo>	 heh
[15:13:05] <godog>	 that was in jest, though on a serious note things are working as expected in the sense that it was a partial failure of some (not all) alertmanagers during the alert failover, hence the alert
[17:30:58] <dcausse>	 o/ we're investigating a weird request logged in logstash: https://logstash.wikimedia.org/goto/f5dff72f669ce0fecad15aa8e8b46022
[17:31:24] <denisse>	 dcausse: Looking...
[17:31:56] <dcausse>	 it appears to be the same php request but the logs suggest that it started at 4:51 and failed 2 hours later at 7:00
[17:32:28] <dcausse>	 I doubt that's possible, is timestamp assigned by logstash and logstash could possibly have been late in ingesting these log lines? 
[17:35:10] <dcausse>	 same request_id, same k8s pod (so can't be a job I guess)
[17:36:43] <denisse>	 The timestamp shown in there seems to correspond to the one contained in the object Logstash received, so I don't think it has to do with Logstash ingesting that logs late.
[17:38:14] <dcausse>	 denisse: ok, thanks! I guess we'll ping service-ops to see if they can make some sense of these log lines
[19:16:54] <cdanis>	 dcausse: oh, I'm not 100% sure but I think that jobqueue jobs receive the reqIds of the requests that enqueued them