[14:10:45] Hello ServiceOps, are you aware of any changes to wikikube around 1600 UTC on the 17th? I'm tracking down an opensearch-ipoid latency issue with kostajh that started around that time, ref https://w.wiki/J$Eq or https://wikimedia.slack.com/archives/C055QGPTC69/p1773864576149609 [14:14:05] inflatador: there was a mediawiki deploy window 1500-1600 utc where several patches were deployed and also some maintenance jobs run against some database tables, could that be connected? [14:17:04] cdanis I'm not 100% but it seems unlikely. I don't think iPoid relies on any other datastores besides OpenSearch. What we've seen since that time is increased P95 and P99 latency and more connection timeouts [14:21:28] The ipoid pods have churned since then but the problem remains. When it happened in the past, it was due to 1Gbps k8s workers. We just got rid of all our 1Gbps workers. It's possible we still have a pod on a misbehaving worker, running that down now [14:32:53] hmm, wasn't the start time actually more like 11:25Z that day? [14:33:47] the closest thing it correlates with in SAL is a dse-k8s-worker reimage cookbook invocation [14:43:38] I was hopeful that maybe we could see this with a bit more clarity in envoy metrics, but it looks like this was set up to circumvent the mesh entirely [0]. I'm wondering whether that's actually technically necessary for some reason ... [14:43:38] [0] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/3ff486473879abe7eafcf61e60f09784ee5071eb/wmf-config/ProductionServices.php#78 [14:48:43] in a bit more detail, the issue with this being the mediawiki metric (mediawiki_IPReputation_ipoid_data_lookup_time_bucket) is that we don't actually see _where_ the latency excursion happened - i.e., down to the specific mediawiki / worker node [14:49:19] it's all obscured by the statsd exporters, which are shared across a given mediawiki service [14:51:46] what I _can_ say, though, is that when one of these excursions happens, it's reflected across most / all statsd exporters simultaneously, which I think suggests the client (mediawiki) that experienced it also has some spread (though there's a lot of hand waving there in terms of connection spread) [14:53:26] (and to use the word "connection" here is also misleading, since it's UDP) [14:57:36] Forgive my ignorance, but what is connecting over UDP? [14:58:50] re: mesh avoidance, we are working on a solution for that, ref T419289 [15:00:19] inflatador: apologies, what I mean is that mediawiki (the opensearch client) is sending statsd (UDP) exporter, and that's where we actually see the latency exported as a prometheus metric. that layer of indirection loses information about where the latency excursion happened (i.e., specific mediawiki pod) [15:00:36] *to the exporter [15:00:47] Ah, OK, that makes sense [15:01:05] And we don't have the normal envoy telemetry because the opensearch pods are currently terminating TLS at the pod level [15:04:19] per https://w.wiki/J$MB, the issues don't seem correlated with increased search rate or latency [15:04:38] so, looking at T419289, I'm not sure that blocks using the mesh on the _client_ side (i.e., mediawiki) [15:04:47] maybe something goofy in the return side? [15:05:16] Interesting, does that mean we could enable the mesh without terminating TLS at Envoy? [15:06:12] so, I _think_ T419289 is about having envoy as the external-facing TLS terminator in the opensearch pod. is that correct? [15:06:20] yes [15:06:58] cool, that should be orthogonal to whether mediawiki, as a client, proxies requests to this service via its local envoy [15:07:28] they way this is set up at the moment has it hitting the discovery service directly, which leaves us flying blind [15:08:39] inflatador: one thing I just noticed - the port is 30443, so presumably this is using k8s ingress in the dse-k8s cluster. is it possible that something is intermittently hammering your istio envoys since the 17th? [15:09:44] swfrench-wmf we've checked but nothing jumps out at me so far, ref https://w.wiki/J$Mz [15:10:15] that being said, we do have a new service called "queryhammer" so maybe I should check when that was deployed ;) [15:11:09] it's not deployed ATM [15:12:03] so, some of these are ... maybe ooming? e.g., https://grafana.wikimedia.org/goto/ffghyv2j7lkw0b?orgId=1 [15:12:36] not widespread enough to explain things, but curious [15:14:27] blackbox probes for the k8s services (including opensearch-ipoid) don't show a change in latency https://w.wiki/J$NP [15:15:08] in any case, I need to step away for a bit. probably the most useful thing for the moment visibility-wise would be if the T&S folks reconsider their decision in [0] - i.e., set up the ipoid extension to use envoy rather than hitting the discovery service directly [15:15:08] [0] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/3ff486473879abe7eafcf61e60f09784ee5071eb/wmf-config/ProductionServices.php#78 [15:17:27] no worries. And I'm not sure if mesh is an option yet. I ported the OpenSearch chart from upstream and I may not have added all the pieces. I'll check [15:22:40] quick diff of a service that (presumably) supports mesh https://w.wiki/J$Pd vs. the Opensearch cluster chart ( https://w.wiki/J$Pk ) tells me we'd need to update the chart to offer OpenSearch over the mesh? [15:23:58] inflatador: looking at logstash errors are coming mainly (only?) from wikikube-worker1350.eqiad.wmnet (https://logstash.wikimedia.org/goto/b92daac72cf91b7eabb4f39bb0356b97) [15:31:17] inflatador: sorry scratch that, wrong logstash query my bad [15:31:45] Yeah, I was wondering, seems like if you look at individual entries there are different hosts involved [15:32:51] the only thing I can see is that both latency issues and errors are from mw@eqiad, not mw@codfw [15:33:26] That's interesting, since iPoid is only active in EQIAD at the moment [15:34:40] yes strange, mw@codfw -> opensearch-ipoid@eqiad is generally slower (as expected) but does not seem to suffer from the same bad p99 seen with mw@eqiad -> opensearch-ipoid@eqiad [15:40:06] which dashboard are you looking at for MW -> Opensearch latency? [15:43:24] inflatador: unfortunately the ipoid dashboard blends mw@eqiad & mw@codfw so I used this: https://w.wiki/J$T6 [15:44:00] Nice, thanks [15:44:33] you have the same split in logstash errors if you filter codfw vs eqiad hosts