[10:09:04] lunch [13:05:40] \o [13:06:16] o/ [13:12:51] We have a request from netops to reimage 2 of the relforge servers, LMK if/when this is possible. We can do one-at-a-time without breaking the OpenSearch cluster (I think ;p ) ref T421718 [13:44:29] T421718: Search Platform: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T421718 [13:46:03] if we're confident that the current docker compose setup & existing docker volumes will survive I'm fine, otherwise I'd prefer to postpone this if possible to wait a bit more [13:47:13] cindy still itermittently failing, but on different things :S [13:48:07] the ticket makes it sound non-urgent, so I'm cool w/waiting. ebernhardson any thoughts re: how hard it would be to put stuff back if we reimage? [13:55:28] :/ [14:03:03] inflatador: on the prod hosts? In theory they just join the cluster and sync shards [14:03:03] inflatador: needs to skip pairing sorry [14:04:06] ebernhardson: this is relforge I think but there I disabled replications to save space, so the re-image would have to keep the docker volumes [14:04:07] dcausse NP [14:05:02] ebernhardson yeah, I was talking about relforge. If it's not in an easily-reproducible state, then I can just punt this for a bit, maybe a month or two? [14:06:08] inflatador: once T419397 & T419409 are done I should no longer need relforge in its current shape [14:06:09] T419397: Get search results for different embedding models from semantic search - https://phabricator.wikimedia.org/T419397 [14:06:09] T419409: Get search results from semantic search using MIRACL benchmark dataset - https://phabricator.wikimedia.org/T419409 [14:07:19] dcausse np, this doesn't seem urgent. I'll ask again once we finish all the other reimages. That'll probably be a month or so [14:11:10] thanks! [14:14:25] hmm, well to be fair i guess cindy failed the same test three times in a row from an npm dependency update. curious [14:29:59] weird, I think I rebased it this morning so might well be related to these new deps [14:39:33] maybe it's a mw update, i have a patch from yesterday that fails now too [17:00:20] dinner [18:15:14] the failure is from PdfHandler extension, reverting the commit 'Replace global shell functions with Shell class' avoids the problem [18:15:18] i guess we need some additonal configuration [18:42:31] no additional configuration, PdfHandler is just broken :P That change double escapes the string, patch up [18:54:53] looks like cloudelastic is red from my reimage...taking a look now [18:58:37] hmm, it does have the downside of having less replicas [18:59:01] whole lotta orphan aliases, I wonder if the automation that's supposed to clean that up is running [18:59:36] seeing green and yellow now, so no data loss hopefully [18:59:47] yeah, I just manually deleted all the orphan aliases [18:59:53] That's all it was, no real data [18:59:54] hmm, orphan aliases is curious, we usually only have aliases for prod wikis [19:00:10] and usually wikis don't go away, they just get closed and stay in that state [19:00:46] Maybe I'm using the wrong phrase...it's the indices that have no data and no aliases, from failed reimages [19:00:52] err...failed reindexes [19:01:13] oh, ok. I think we are supposed to clean them up automagically these days, [19:02:03] There were 27 of 'em, so maybe it doesn't run on cloudelastic or something? [19:02:47] maybe it's something with our detection, there is an extra limit before deleting the index that it's not "live", if it might be a live index we leave it there [19:03:26] it looks like that's just fetching the index aliases and considering any index with an alias to be live and undeletable. not sure [19:03:34] not sure why the reindexes wouldn't be deleted then [19:04:54] I also noticed our reimages hang waiting for all icinga checks to clear, not sure why that is but it doesn't seem like a good idea to prolong reimages when things are in a bad state [19:14:33] hmm, looks like Puppet can't handle IP changes either, manual restart of ferm is required ;( [21:26:39] ebernhardson do you happen to know if there are any reindexes in flight? Unlike cloudelastic, I see only a few unaliased indices in prod, mostly on psi/omega: https://etherpad.wikimedia.org/p/unaliased . I can hold up deleting if it'll break anything [21:27:59] also, where does the auto-cleanup automation live? I'm guessing it's an airflow DAG?