[08:07:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:12:40] <jinxer-wm>	 FIRING: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:17:40] <jinxer-wm>	 RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:18:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[08:23:40] <jinxer-wm>	 RESOLVED: [2x] LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors  - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[12:44:40] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:45:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[12:50:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[12:59:40] <jinxer-wm>	 FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:04:40] <jinxer-wm>	 FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[13:58:29] <cwhite>	 inflatador: I added elasticsearch-curator_5.8.5-1~wmf5+deb11u1 to component:thirdparty/opensearch1.  lmk if that doesn't resolve the issue
[14:05:15] <inflatador>	 cwhite it's all good, we installed it directly using apt component, just wanted to make sure y'all didn't have a time bomb
[14:08:34] <cwhite>	 Ah, got it.  We stopped updating the opensearch1 component after we moved to opensearch2.
[14:39:40] <jinxer-wm>	 FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:44:27] * cwhite looking
[14:49:40] <jinxer-wm>	 FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:50:54] <tappof>	 cwhite: I was taking a look as well... while checking the DLQ, I noticed many messages like 'Limit of total fields [2048] has been exceeded'
[15:03:17] <cwhite>	 mobileapps is logging query parameters as fields.
[15:03:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[15:08:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[15:28:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[15:29:50] <tappof>	 cwhite: Could the outputs from mw-scripts also be contributing?
[15:32:54] <cwhite>	 tappof: which field do you think is contributing?
[15:38:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[15:39:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[15:40:44] <cwhite>	 tappof: There are several fields mw-script fields that look suspicious to me: auxiliary_text, source_text, category, and template
[15:44:40] <jinxer-wm>	 FIRING: [4x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:54:50] <tappof>	 cwhite: honestly i had a look on template too but don't know really well how to debug .. however, If I'm not mistaken, doing this query https://logstash.wikimedia.org/goto/5e7dfcc410f63d34f96b39ba7c08744c and the same filtering without mw-script gives me more or less the same result ... that's why I'm thinking about mw-script
[15:56:17] <cwhite>	 That query indicates the symptoms of the problem.  `Limit of total fields [2048] has been exceeded` means that log event is a victim of index field exhaustion.
[15:57:49] <cwhite>	 Slow ingest is often caused by abnormally large events.  The difficult part is locating which stream is producing those events.
[15:59:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:01:30] <inflatador>	 ^^ Is there any way to make those logstash failure tickets fire less often? This is kind of a me problem because I have highlights on the word 'elastic', but it does seem they fire and clear a lot
[16:01:39] <inflatador>	 errr..failure alerts that is
[16:02:33] * bd808 fails saving throw vs the "fix the problem" answer
[16:03:01] <inflatador>	 you need a natural 20 for that one ;P
[16:19:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:24:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:26:11] <cwhite>	 if this last patch doesn't work, I'm going to target all of DumpIndex.php
[16:27:16] <tappof>	 The curve on the graph now seems to have a negative slope cwhite  
[16:28:05] <cwhite>	 I see that for codfw, but not eqiad yet.
[16:33:03] <tappof>	 cwhite: a question ... auxiliary_text, source_text, category, and template are arrays. I looked at the pod logs directly on deploy1003 using kubectl, and I was under the impression they would count as a single field in OpenSearch, unless they're parsed as nested objects (which I don't think is the case)...
[16:35:40] <tappof>	 So the question is: how do these fields contribute to exceeding the field count limit?
[16:37:40] <jinxer-wm>	 FIRING: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:37:48] <cwhite>	 AFAIK, these arrays aren't responsible for exceeding the field count limit.  The first patch removing `request.query` was to address the fields explosion problem.
[16:39:05] <cwhite>	 These arrays from mw-script are likely contributers to indexing back pressure due to their size.
[16:39:40] <jinxer-wm>	 FIRING: [3x] LogstashKafkaConsumerLag: Too many messages in logging-codfw for group logstash7-eqiad - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:42:40] <jinxer-wm>	 RESOLVED: LogstashIndexingFailures: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:46:12] <tappof>	 cwhite: Ok, thank you. One last thing (I think): it's still not clear to me why I'm seeing "Limit of total fields [2048] has been exceeded" for mw-script on the DLQ...
[16:47:12] <tappof>	 Given that the fields we're "cutting off" don't count toward the field limit...
[16:47:35] <cwhite>	 The reason for that error message is that the index itself has run out of fields to assign.  Once this happens, any event attempting to declare a new field has exceeded the field limit and gets dead-lettered.
[16:47:57] <tappof>	 aaaaaaaaaaaaaaaaaaaaaahhhh
[16:47:59] <tappof>	 okok
[16:48:07] <tappof>	 clear, thank you 
[16:51:27] <tappof>	 > means that log event is a victim of index field exhaustion.
[16:51:31] <tappof>	 It was clear here as well, but it took me a while to fully digest
[18:09:40] <jinxer-wm>	 RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:10:10] <jinxer-wm>	 FIRING: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:15:10] <jinxer-wm>	 RESOLVED: [2x] LogstashKafkaConsumerLag: Too many messages in logging-eqiad for group logstash7-codfw - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag