[12:43:53] <inflatador>	 <o/
[12:55:02] <JolanW>	 @inflatador Hiya!
[13:02:08] <inflatador>	 Greetings!
[13:14:30] <ebernhardson>	 \o
[13:43:13] <ebernhardson>	 hmm, can't load https://en.wikipedia.org/wiki/Special:ApiSandbox
[13:43:34] <ebernhardson>	 in my logged in browser it fails, in incognito the spinner simply keeps spinning
[13:50:19] <jakob_WMDE>	 ebernhardson: thanks for the quick review yesterday! could you take another look at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/1164448/5..6 - those scoring weights in particular are not something where I feel like I know what I'm doing.
[13:51:54] <ebernhardson>	 jakob_WMDE: doh, i did look i just forgot to finish the review. thanks for the reminder. It does look reasonable as is, and indeed the weights are not easy to set by hand, but in this case we can mostly guess
[13:52:49] <ebernhardson>	 the primary way to set weights is to export a bunch of queries and their scoring information into a program and then sweep (via coordinate ascent) over some possible values to find what works...but thats kinda involved :)
[13:54:13] <ebernhardson>	 iirc with wikidata the top level weights with the constant score are more about creating a hierarchy of results, with the higher constant score typically beating the lower score, and the other scoring variations having more influence on sorting within a single level of the hierarchy
[13:59:49] <jakob_WMDE>	 ahh ok, I had a feeling that doing this properly would require something more sophisticated than picking a number that looks like it would work :D
[14:01:22] <jakob_WMDE>	 EntityFullTextQueryBuilder apparently got away with using the same weights for the stem fields and the .plain one for a few years now
[14:02:09] <ebernhardson>	 jakob_WMDE: interesting, conceptually i suppose the question is,  are stemmed results as good as exact matches, and should we prefer exact matches? Sadly in search though there are never easy answers, only different use cases :)
[14:02:47] <ebernhardson>	 at a general level i like them separate because it allows tuning, but indeed the results are probably going to be fairly similar
[14:02:55] <jakob_WMDE>	 my gut feeling is that exact matches should be slightly preferred, which is why I liked your suggestion on the patch
[14:03:41] <jakob_WMDE>	 do you think tuning these weights properly is something we should look into, or is that something best left to you experts? :)
[14:04:56] <ebernhardson>	 jakob_WMDE: if you're quite interested we could probably work out running it, but it's only ever been run by the author in the past. Doc's are minimal, and we've changed from elasticsearch to opensearch since the last run
[14:05:29] <ebernhardson>	 jakob_WMDE: i guess i'm saying, if it sounds fun we could, but otherwise better to perhaps have a ticket and get that into the pipeline in the next month or two
[14:06:12] <ebernhardson>	 might be useful to get some new tuning in there anyways, i think it's been a couple years and we have the new `mul` fields as well since then
[14:06:40] <jakob_WMDE>	 ebernhardson: sounds reasonable. we also haven't heard from any users that the order of search results was bad, but maybe we'll get more feedback once we advertise those REST routes officially
[14:07:53] <ebernhardson>	 jakob_WMDE: +2'd the relevant patches, all looks like it should work
[14:08:16] <jakob_WMDE>	 ebernhardson: great, thanks for the help!
[14:24:35] <ebernhardson>	 ls
[14:34:18] <inflatador>	 new plugins pkg is applied on relforge/cloudelastic, moving on to prod
[14:36:57] <ebernhardson>	 nice!
[14:41:48] <ebernhardson>	 was poking at the code for that wikidata explain optimization...i'm midly scared of the Makefile that i wrote for it :P https://gerrit.wikimedia.org/r/c/wikimedia/discovery/relevanceForge/+/472077/19/Makefile.tf_autocomplete
[16:50:10] <inflatador>	 I'm restarting cirrussearch CODFW at the moment. This will also apply the row-rack awareness change we made in T391392 (the current cookbook logic won't work until row/rack awareness is applied, so I'm doing it manually)
[16:50:10] <stashbot>	 T391392: Use profile::netbox::host instead of regex.yaml for Cirrussearch rack/row awareness - https://phabricator.wikimedia.org/T391392
[16:53:58] <inflatador>	 apparently if you try to restart a service that doesn't exist, it'll set off systemd unit failure alerts ;( . I guess I need to clear the failed omega units from the psi hosts ;)
[17:09:26] <ebernhardson>	 sadly, i've also done that before :)
[17:30:27] <inflatador>	 lunch, back in ~40
[18:32:37] <Trey314159>	 Ugh.. thanks to pfischer for recognizing the problem with my docker image—I'd upgraded on autopilot—and thanks to gmodena for creating an arm image for opensearch. My sample load time is down from 5m to 37s, which is about where I expected (30-35s would have been my guess). Thanks to inflatador for looking at it with me yesterday and opening T398461 so we might get both up-to-date and built-for-arm in the same image.
[18:32:38] <stashbot>	 T398461: Attempt to build multi-arch cirrussearch-opensearch docker image - https://phabricator.wikimedia.org/T398461
[18:34:10] <Trey314159>	 I do still have the problem that uncompressing the plugins over the existing directory—which I think might be because I erase something necessary—but I can just insert sudachi and I'm good to go!
[18:34:43] <ebernhardson>	 nice!
[18:36:27] <inflatador>	 {◕ ◡ ◕}
[19:19:50] <gmodena>	 Trey314159 happy to have helped! 
[19:34:14] <ryankemper>	 lunch
[20:11:13] <inflatador>	 3 out of 4 CODFW rows are restarted, waiting for the main cluster to settle down before doing row D
[21:40:06] <inflatador>	 ryankemper looks like I got out of taking my son to climbing class...starting on CODFW row D now
[21:40:19] <ryankemper>	 ack
[22:00:32] <inflatador>	 OK CODFW is completely done. We still need to roll-restart EQIAD before T397227 can be closed
[22:00:33] <stashbot>	 T397227: Build and deploy OpenSearch plugins package for updated regex search - https://phabricator.wikimedia.org/T397227
[22:03:41] <ryankemper>	 inflatador: nice!
[22:44:55] <ryankemper>	 meh, wdqs2009 caused a p4ge for `elevated 5xx errors from wdqs2009.codfw.wmnet in eqsin`
[22:45:03] <ryankemper>	 wdqs2009 is our legacy full graph host
[22:45:39] <ryankemper>	 i expect the host to lock up quite frequently, so will need to find a way to shut some of these pages off. really don't want this host causing that kind of noise
[22:52:21] <inflatador>	 ryankemper dunno if it would solve the problem, but maybe making it a non-paging service in Puppet would help?
[22:53:18] <ryankemper>	 inflatador: it's `page: false` in service.yaml
[22:54:54] <ryankemper>	 ATS has the (reasonable, but in this case annoying) assumption that its backends shouldn't have multiple 5xxs per second
[22:55:26] <ryankemper>	 a bit torn on what to do. the last thing I want to do is spend time spelunking through logs and banning whoever's slamming us so hard for a service that we don't make any guarantees on
[22:55:36] <inflatador>	 ACK, I agree
[22:56:00] <ryankemper>	 but doesn't seem like there's an easy "don't make noise about this service" button for the traffic monitoring
[22:56:11] <inflatador>	 I don't love doing it, but I'd consider disabling all alerts and waiting for the user to scream
[22:56:32] <ryankemper>	 well I would love to but we're hitting alerts like https://github.com/wikimedia/operations-alerts/blob/master/team-sre/cdn.yaml#L5
[22:57:14] <inflatador>	 On the other hand, if we can make some easy changes and keep the host from getting pegged all the time, maybe that's the way to go. If we need to do work to avoid more work, maybe that's the answer ;)
[22:57:56] <inflatador>	 maybe there's some nginx magic we can do to keep it from emitting 5xx errors
[22:58:38] <ryankemper>	 it's possibly going in the wrong direction but the `wdqs` lvs pool isn't yet torn down, and there's still 2 other hosts with the full graph
[22:58:58] <ryankemper>	 if ichanged backend.yaml instead of pointing to wdqs2009 to just point to the wdqs full lvs cluster
[22:59:07] <ryankemper>	 then we'd have 3x the number of hosts
[22:59:19] <ryankemper>	 but there's a risk that whatever's slamming us would topple 3 hosts as well...although I somewhat doubt it
[22:59:40] <inflatador>	 worth a try, assuming the other hosts still have the right data
[22:59:42] <ryankemper>	 actually the best hack I can think of is just to have a systemd timer to restart blazegraph every 5 minutes
[22:59:51] <ryankemper>	 that might do enough to prevent any pages
[22:59:59] <inflatador>	 Yeah I was gonna say, a sloppy mitigation like that could help too
[23:00:35] <ryankemper>	 that's probably the best place to start
[23:01:39] <inflatador>	 I gotta step away to cook dinner, but if you need a review or something feel free to text me anytime today or tomorrow, my # is on officewiki
[23:27:57] <ryankemper>	 swfrench pointed out there's an exclusion list in that linked alert so for the timebeing I have merged https://gerrit.wikimedia.org/r/c/operations/alerts/+/1166016 which will hopefully prevent further pages (at least for that specific issue)
[23:38:21] <inflatador>	 excellent