[08:07:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:12:40] FIRING: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:17:40] RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:18:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [08:23:40] RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [12:44:40] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:45:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [12:50:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [12:59:40] FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:04:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:58:29] inflatador: I added elasticsearch-curator_5.8.5-1~wmf5+deb11u1 to component:thirdparty/opensearch1. lmk if that doesn't resolve the issue [14:05:15] cwhite it's all good, we installed it directly using apt component, just wanted to make sure y'all didn't have a time bomb [14:08:34] Ah, got it. We stopped updating the opensearch1 component after we moved to opensearch2. [14:39:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:44:27] * cwhite looking [14:49:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:50:54] cwhite: I was taking a look as well... while checking the DLQ, I noticed many messages like 'Limit of total fields [2048] has been exceeded' [15:03:17] mobileapps is logging query parameters as fields. [15:03:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [15:08:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [15:28:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [15:29:50] cwhite: Could the outputs from mw-scripts also be contributing? [15:32:54] tappof: which field do you think is contributing? [15:38:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [15:39:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [15:40:44] tappof: There are several fields mw-script fields that look suspicious to me: auxiliary_text, source_text, category, and template [15:44:40] FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:54:50] cwhite: honestly i had a look on template too but don't know really well how to debug .. however, If I'm not mistaken, doing this query https://logstash.wikimedia.org/goto/5e7dfcc410f63d34f96b39ba7c08744c and the same filtering without mw-script gives me more or less the same result ... that's why I'm thinking about mw-script [15:56:17] That query indicates the symptoms of the problem. `Limit of total fields [2048] has been exceeded` means that log event is a victim of index field exhaustion. [15:57:49] Slow ingest is often caused by abnormally large events. The difficult part is locating which stream is producing those events. [15:59:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:01:30] ^^ Is there any way to make those logstash failure tickets fire less often? This is kind of a me problem because I have highlights on the word 'elastic', but it does seem they fire and clear a lot [16:01:39] errr..failure alerts that is [16:02:33] * bd808 fails saving throw vs the "fix the problem" answer [16:03:01] you need a natural 20 for that one ;P [16:19:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:24:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:26:11] if this last patch doesn't work, I'm going to target all of DumpIndex.php [16:27:16] The curve on the graph now seems to have a negative slope cwhite [16:28:05] I see that for codfw, but not eqiad yet. [16:33:03] cwhite: a question ... auxiliary_text, source_text, category, and template are arrays. I looked at the pod logs directly on deploy1003 using kubectl, and I was under the impression they would count as a single field in OpenSearch, unless they're parsed as nested objects (which I don't think is the case)... [16:35:40] So the question is: how do these fields contribute to exceeding the field count limit? [16:37:40] FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:37:48] AFAIK, these arrays aren't responsible for exceeding the field count limit. The first patch removing `request.query` was to address the fields explosion problem. [16:39:05] These arrays from mw-script are likely contributers to indexing back pressure due to their size. [16:39:40] FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:42:40] RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:46:12] cwhite: Ok, thank you. One last thing (I think): it's still not clear to me why I'm seeing "Limit of total fields [2048] has been exceeded" for mw-script on the DLQ... [16:47:12] Given that the fields we're "cutting off" don't count toward the field limit... [16:47:35] The reason for that error message is that the index itself has run out of fields to assign. Once this happens, any event attempting to declare a new field has exceeded the field limit and gets dead-lettered. [16:47:57] aaaaaaaaaaaaaaaaaaaaaahhhh [16:47:59] okok [16:48:07] clear, thank you [16:51:27] > means that log event is a victim of index field exhaustion. [16:51:31] It was clear here as well, but it took me a while to fully digest [18:09:40] RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:10:10] FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:15:10] RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag