[07:47:27] o/ [07:58:45] o/ I’ll catch up on emails… [08:00:41] o/ [10:33:59] sigh... the sup producer job is stuck in codfw failing with "java.lang.IllegalArgumentException: Missing property '/CirrusSearchConcreteNamespaceMap'" [10:34:40] "similar" to what happened this week-end... seems like a wiki is not working as expected [10:54:27] :( [13:12:49] going to ship https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/173 to unblock the pipeline [13:14:02] but seems like Gergo found the reason of the failure, the wiki user used by the SUP got blocked by an admin and caused the API to respond something that the SUP did not understand [13:23:13] actually Gergo did fix the user and updates are flowing again, the fix above is no longer urgent [13:29:13] dcausse ack [13:29:46] shoot. I've been monitoring SUP last week, but did not catch that issue :( [13:34:07] gmodena: started to fail this week-end, applied a quick fix to the "consumer" flink job but got worse this morning affecting the "producer" job [13:34:42] ah! It sensed you came back :( [13:36:32] dcausse hope you had a good time last week! [13:36:48] yes, was nice, thanks! :) [13:36:58] taking a quick break [13:37:20] break++. ttyl. [13:37:25] dcausse Trey314159 gehel just FYI: I added a draft doc for T385972 in Search's google drive. Hopefully you have been notified, holler if not :) [13:37:26] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [14:15:46] looks like we had some more search latency issues over the weekend...checking [14:18:26] gmodena: I don't seem to find the doc you're mentioning... [14:18:57] correction, I found it! https://docs.google.com/document/d/1WAhryEpEQDFjkc9FaNSY5rtkYSyHtU_8CgWNm2ExuAA/edit?tab=t.0 [14:22:16] dcausse thanks for looking in to the search latency stuff...I'm gonna ban elastic1066 and reboot it unless you think it's a bad idea [14:29:09] looks like we're getting alerts for Cloudelastic weighted tags as well [14:34:50] gmodena: I didn't get a notification and I couldn't find it... but gehel provided a link, so I have it now [14:37:21] gehel Trey314159 ack! [14:40:33] inflatador: ack, not sure it's something specifically related to elastic1066 or just that it has many "hot" shards [14:40:54] banning/unbanning will certainly shuffle the shards around [14:42:47] dcausse do you think there is room to increase the search queue value, or would you prefer that I just add the relforge hosts back into the eqiad cluster (or any other suggestions)? [14:44:06] inflatador: the rejection is just a symptom of the system being too slow, increasing it won't help [14:45:57] dcausse ACK, what do you think is the bottleneck? I'm not seeing any high load/memory usage etc on any of the hosts, am I just missing it? [14:46:53] if the latency issues are due to a couple nodes hosting too many hot shards we should try to find a way to monitor this, elastic1066 rejecting a lot more than others could be an indication of this [14:48:30] inflatador: well, the poolcounter & threadpools are here to make sure that the system does not get overloaded by limiting concurrency [14:50:56] dcausse ACK, and when the search queue is full, that means it's not doing enough work to clear the queue...but I would expect that to translate into some kind of resource limitation [14:57:41] inflatador: the limiting factor is probably the size of the thread pool which limits the concurrency, if the system is already slowing down few queries then increasing that size won't help [14:58:32] if on the other hand the problem is "traffic" related where we see a higher of complex queries to run then it's another issue and we need to identify this traffic pattern and limit it [14:59:19] given it's always happening at the busiest time of the day I'm suspecting the former but I don't have any proof for this [14:59:40] I'll get a CR started for monitoring rejections per node [15:01:37] inflatador: I think we have already in elasticsearch_thread_pool_rejected_count [15:03:11] \o [15:03:17] o/ [15:04:31] Welcome back! [15:04:35] welcome back! :) [15:06:38] random guess is that is often IO related, particularly if it has to hit disk too much queries slow down by a large factor which reduces throughput [15:08:52] meh, replaced my phone and the codes didn't switch over....gotta go power up the old one :P [15:09:13] Another random guess: individual queries for busy shards are maxing out single-threaded performance? [15:11:06] it's happened in the past, but that was due to some analysis chain problems [15:11:17] in theory i guess regex could if the timeouts aren't working [15:11:20] elastic1066 rejections happened in bulk for two latency spike instances: 2025-02-28T20:00 and 2025-03-01T20:00, but not other instances [15:12:00] ah OK, that was what I was trying to solve for with the write queue, but I guess that's not much of a factor [15:12:21] it's hardly visible in the host overview https://grafana-rw.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=elastic1066&var-datasource=thanos&var-cluster=elasticsearch&from=now-7d&to=now [15:13:31] I'm also not seeing the thread pool rejections alert? Checked https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/icinga/manifests/monitor/elasticsearch . We do have the pool counter rejection alerts, but AFAIK they never actually fired [15:16:33] inflatador: I'm not sure we ever added an alert on elastic thread pool rejections [15:17:21] poolcounter there must be one, why it did not fire perhaps means that we never actually rejected enough during those spikes? [15:20:41] yea pool counter rejections, at least on 3/1, were pretty minor. When it's bad it rejects tons, this was 30-50 qps [15:21:12] True. Which is good! But I didn't look at rejections on individual hosts, I think that would've helped https://phabricator.wikimedia.org/P73928 [15:22:13] hmm, node comparison dashboard is still graphite :( [15:22:53] damn, I thought I got all of them ;( . I'll make a task for that [15:26:46] seeing some threads in com.o19s.es.ltr.feature.store.index.Caches.cacheLoad, in the elastic cache, updating its LRU info [15:30:58] curious, i guess we did have trouble with that cache some time ago but it's been nice since it was fixed [15:33:44] yes... I'm not sure how to interpret that in hotthreads tho... I would expect updating the lru map to be quite fast but seems like I see it being captured waiting for a lock a couple times [15:35:56] grep "com.o19s.es.ltr.feature.store.index.Caches.cacheLoad" /var/log/elasticsearch/elasticsearch_hot_threads-production-search-eqiad.log.1 | wc -l -> 162 [15:36:12] grep "ContextIndexSearcher.java:199" /var/log/elasticsearch/elasticsearch_hot_threads-production-search-eqiad.log.1 | wc -l -> 496 [15:37:13] did we add churn metrics to that cache? [15:37:26] i remember we thought about it, but not sure if it was ever done [15:37:26] there's some metrics, looking [15:38:42] it's just a count and a mem usage [15:38:58] no eviction info [15:40:29] i guess if mem usage was mostly full we could guess it needs a size increase [15:42:36] ah no there's a nicer api localhost:9200/_ltr/_stats [15:45:49] highest misscount I see is 1788 vs 39887511 hits (.0044% miss rate) [15:46:35] doesn't seem that high, but we should probaby get it collected and graphed somewhere. I forget how long it takes to load a model, but i wouldn't expect more than a second or so [15:47:05] so I'm not sure why cacheLoad was captured 162 times hotthreads vs 496 in ContextIndexSearcher.java:199 which is supposedled the busy part of the search query [15:52:55] not sure if i'll make meeting...auth app on my phone is refusing to generate codes for ebernhardson@wikimedia, or to let me export it into the new phone :S [15:53:28] We can invite ya as a guest [15:53:39] ah, yea that will work [15:54:03] i wonder if it's refusing because no sim card...but then i have to find that tiny tool to pop out the sim card [16:00:31] oh! dumb new security rules...i never set a pin on old phone and it wont export without a pin. but now it crashes when exporting :P [16:01:09] :( [16:01:41] someone drop me a link? i can't log into calendar :P [16:02:00] ebernhardson: https://meet.google.com/eki-rafx-cxi?authuser=0 [16:52:07] gmodena: whenever you have a chance could you publish an updated version of the opensearch image (https://www.mediawiki.org/wiki/MediaWiki-Docker/Configuration_recipes/OpenSearch) [16:52:52] it should have a typo fixed in analysis-stconvert this might be useful for Trey and the tests he plans to run [17:08:22] ebernhardson: random things before I forget, I moved cindy to cirrus-integ4.search.eqiad1.wikimedia.cloud because I messed up cirrus-integ03 so badly and I could not get it work because of bullseye, it's running bookworm now [17:09:27] haven't deleted the cirrus-integ03 instance yet in case you had things to recover there [17:12:16] dcausse: sounds good, it needs to be moved forward to new versions regularly anyways [17:12:58] looks reasonable, and now my bash history should remember the right host. thanks! [18:09:02] workout/lunch, back in ~1h [18:10:15] dcausse shoot. I published 1.3.20-1, but did not update the docs (https://hub.docker.com/repository/docker/gmodena/cirrussearch-opensearch-image/general) [18:11:28] updated .Thanks for the heads up! [18:12:50] gmodena: [18:12:52] thanks! [18:13:07] np [18:13:23] it's the version I'm using now btw, and so far so good [18:15:17] dcausse don't know if you saw it already - but Trey314159 reviewed and +2ed the articlecountry CR. It should be rolled out this week. T386068 [18:15:18] T386068: Implement articlecountry a new CirrusSearch keyword - https://phabricator.wikimedia.org/T386068 [18:15:36] nice! [18:16:44] relatedly, regarding the backfill of articlecountry tags, waiting for fawiki to finish and will bump the ratelimit from 20 to 60 tags/sec [18:16:57] ack [18:17:45] dcausse the etherpad you left me for SUP was super useful. Should we move it on wikitech? [18:19:30] gmodena: oh sure, https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater could be a good place for this? [18:19:42] sounds good! [18:20:55] I might bump the parallelism of consumer-cloudelastic from 2 to 3 to align it with prod, I think I kept it with 2 because it was just fine but now it feels odd to have different throughput settings for it [18:24:14] sounds reasonable [18:29:14] if anyone has a sec: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124166 I'll deploy this and re-enable the saneitizer everywhere now that the Cirrus_Streaming_Updater user is fixed [18:33:03] looking [18:33:54] dcausse: +1'd, i suppose i'm not sure if we might run into quota limits somewhere, but that will be quickly obvious [18:34:17] oh right, haven't checked that yet [18:34:31] esp. regarding backfills [18:36:38] values-backfill.yaml should reset the parallelism to 2 so hopefully this one will stay at 2 [18:37:22] but overall that's an additional 4cpus + 6Gb in eqiad... [18:40:54] not sure where to find the default quota per namespace... seeing resourcequota: in values/common.yaml but it's 90cpus & 100Gb, seems like a lot [18:41:30] hmm, yea seems like plenty [18:45:35] not sure what happens to the sanitizer state when being disabled/enabled, might start from scrath I suppose? [18:46:39] dcausse: did you try kubectl get resourcequota -n ? [18:47:02] gmodena: no, did not even know that existed :) [18:47:03] dcausse: it should continue where it left off, i changed it awhile ago to use a "noop-loop" when disabled so it's still in the graph, still in state, just not doing anything [18:47:04] will try [18:47:16] ebernhardson: oh nice [18:47:17] A quick check on cirrus-streaming-updater confirms the values you reported from values/common.yaml [18:47:40] gmodena: cool, then we should be good, thanks for checking [18:47:58] dcausse np :) [18:53:45] meh, helm-lint taking for ever in zuul gate-and-submit... [19:01:30] Trey314159 I took a stab at some of the visual/readability improvements you proposed for the T385972 notebookos. https://people.wikimedia.org/~gmodena/search/mlr/ab/2025-02/1/ [19:01:31] T385972: Deploy and test new MLR models - https://phabricator.wikimedia.org/T385972 [19:02:08] (same data as in top level) [19:03:08] ebernhardson when you are settled back in, would you like to touch base on some MLR things? FYI: we deployed 20241122, and performed a new experiment. [19:03:51] and while we migrated the airflow scheduler to k8s, we had to tweak mjolnir a bit [19:04:20] there should be a paper trails ,but happy to have chat sometimes :) [19:04:37] gmodena: sure, i'm mostly reading things this morning. looked through commits in gitlab, reading the ja analysis report [19:04:48] no particular plans yet, so i have time :) [19:05:12] got the login working, have access to normal things [19:05:47] ebernhardson ack! [19:06:41] dinner [19:06:47] mjolnir was kinda the first airflow thing we did...so it's a bit odd partially from lack of understanding, and partially because it was initially designed to use oozie [19:07:33] i did see the problems with feature selection, i've wondered if we really even need that step these days. We should have a few years of selection history, could probably determine a static set of features [19:08:02] but it did vary between wikis, so maybe it is worthwhile? hard to say :P [19:18:48] that whole wierd thing in mjolnir.kafka.client might be able to go away too, but hasn't seemed worthwhile to reengineer. That was originally built because we couldn't open the firewall between elastic and hadoop at the time [19:19:41] gmodena: I'll take a look at the updates to the report. [19:19:44] Relatedly, I'm wondering if we should run another A/B test on Japanese with the same models. It would be a good data point for understanding whether the differences comes from the models, or if we just got wildly unlucky on the bucketing of users—which would be very interesting to discover! [19:32:22] hm... not sure the saneitizer came back online... https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-6h&to=now&viewPanel=35 is still flat ... [19:34:30] hmm, indeed thats not doing anything :( will poke around [19:34:51] the flink UI reports Source: saneitizer -> saneitizer-fetch -> saneitizer-collect -> FINISHED [19:35:02] hmm, it should never finish :S [19:35:30] end time 2025-03-03 19:45:45, should be when I deployed it I suppose [19:38:12] hmm, my clock says its 19:38 now? [19:38:23] i wonder if i'm on daylight savings wonkiness:P [19:38:36] that's the right time in UTC :) [19:40:06] not sure how many others do this, but I did end up setting my work computer to UTC. Makes it easier to read graphs ;) [19:41:29] i set my irc time to UTC, keeps my system clock normal but i have a reference [19:41:49] for saneitizer, indeed the logs look like it shuts down just after startup, but i don't see any logs about why :S [19:52:08] suspect i'm missing logs somewhere...at a minimum i thought there should be some info logging as it starts up [19:52:17] probably just forgetting :P [20:07:24] ryankemper (or anyone else) , PR for repurposing the relforge hosts back to elastic: https://phabricator.wikimedia.org/T387782 [20:27:43] inflatador: seems reasonable, might have an issue with elasticsearch when removing half the masters i think you have to do https://www.elastic.co/guide/en/elasticsearch/reference/7.10/add-elasticsearch-nodes.html#modules-discovery-removing-nodes [20:38:39] ebernhardson ACK, we've been bitten by this one in the past. I'll double-check your link, was gonna follow https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Replacing_master-eligibles [20:40:01] I didn't know about this voting API, very cool [20:42:25] i've never used that api before either :) [20:43:37] assuming it works, this'll be easier than one-offing hosts. I'll give a try! [20:44:51] hmm, didn't realize before but saneitizer is stopped everywhere, not just cloudelastic [20:48:48] `curl -s http://0:9400/_cluster/state | jq .'metadata.cluster_coordination'` to check the current voting config exclusions, FWiW [20:55:55] hmm. I excluded relforge1007 from the alpha cluster, but it didn't lose it's master eligible status according to /_cat/nodes ...that's implied by the docs, but still a little surprising [21:01:43] i would guess /_cat/nodes isn't referencing the full available state, simply saying what the node is configured as [21:03:43] I'm suspicious in SUP that we aren't capturing application logs somehow, at a minimum the initial setup in ConsumerGraphFactory should log something on the jobmanager, and the taskmanager should log something when receiving the state from flink, but not seeing either log :S [21:04:40] maybe i forgot where to find these things, but the log4j-console.properties implies they should be in `kubectl logs ...` [21:04:56] * inflatador vaguely remembers looking that stuff up in Logstash [21:06:38] I guess if it's not in kubectl logs, it probably won't be in logstash [21:06:40] i checked there under `App Logs (Kubernetes)` but it doesn't have app logs either [21:06:53] yea the pipeline iirc is from kubectl logs into logstash [21:08:12] makes sense. Going down to one master in alpha is triggering some stack traces, but it's because of some one-off indices I created to test mixed cluster behavior [21:24:32] hmm, well actually we certainly get logs from our other bits like RouteFetchFailures, logging clearly works ... just not finding relevant saneitizer logs :S [21:38:36] restarting consumer-cloudelastic just to ensure it has fresh logs with a known start time [21:38:50] helmfile diff looks clean, just the restartNonce changing [21:50:38] well, found the log in job manager, confirms it's configured as unbounded. doesn't explain why it finishes without doing anything :S [23:15:36] meh, added some logs...but still no closer to an answer :( Only additional info i suppose is that state is empty [23:19:46] well, i have one suspicion i guess. SanitySourceSplitEnumerator::addReader should result in a log, but nothing came out, suggesting it wasn't run. But i'm not clear how that could happen :S [23:47:35] Anyone from search/WDQS team attending the Istanbul hackathon?