[00:50:27] FIRING: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logstash1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:50:27] FIRING: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logstash1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:50:56] RESOLVED: [2x] SystemdUnitFailed: curator_actions_cluster_wide.service on logstash1026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:40] FIRING: LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:10:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:15:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:20:40] RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:39:10] I have noticed a gap in MediaWiki log count on https://grafana.wikimedia.org/d/000000102/mediawiki-production-logging?orgId=1&from=1726122923270&to=1726126286428 [11:39:49] that is between 6:56 UTC and 7:14 but I do see logs in logstash. So I guess some scraping failed to work this morning :) [11:39:58] the metric is either `log_mediawiki_servergroup_level_channel_doc_count` or `log_mediawiki_level_channel_doc_count` [11:41:58] then I can see them in https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors so it is not really a concern [14:18:52] FIRING: ThanosRuleSenderIsFailingAlerts: Thanos Rule is failing to send alerts to alertmanager. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleSenderIsFailingAlerts [14:19:06] ^ looking. [14:23:52] RESOLVED: ThanosRuleSenderIsFailingAlerts: Thanos Rule is failing to send alerts to alertmanager. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleSenderIsFailingAlerts [14:25:37] I hope the irony of the alert is not lost on (collective) you [14:56:08] heh [15:13:05] that was in jest, though on a serious note things are working as expected in the sense that it was a partial failure of some (not all) alertmanagers during the alert failover, hence the alert [17:30:58] o/ we're investigating a weird request logged in logstash: https://logstash.wikimedia.org/goto/f5dff72f669ce0fecad15aa8e8b46022 [17:31:24] dcausse: Looking... [17:31:56] it appears to be the same php request but the logs suggest that it started at 4:51 and failed 2 hours later at 7:00 [17:32:28] I doubt that's possible, is timestamp assigned by logstash and logstash could possibly have been late in ingesting these log lines? [17:35:10] same request_id, same k8s pod (so can't be a job I guess) [17:36:43] The timestamp shown in there seems to correspond to the one contained in the object Logstash received, so I don't think it has to do with Logstash ingesting that logs late. [17:38:14] denisse: ok, thanks! I guess we'll ping service-ops to see if they can make some sense of these log lines [19:16:54] dcausse: oh, I'm not 100% sure but I think that jobqueue jobs receive the reqIds of the requests that enqueued them