[11:44:39] Indeed, sorry for that wording, don't take it as a inhibition to continue doing workarounds as needed, that is always welcome and encouraged (even when not supported, maybe specially when not supported). I case it was lost, point was that we should make clear the support expectations, as that is something we have not been very good at historically and has hurt (and is hurting) both users and the team, so for any publicized [11:44:39] feature/service we should add a "this is supported/this is not supported, might break" as needed. [11:59:51] blancadesal: are you around? the fix for replica_cnf is blocking some other fixes on the envvars-api side, would be good to get it fixed as soon as possible [12:04:37] I see you are OOO xd, ignore [14:11:22] andrewbogott: I see some placeholders in https://wikitech.wikimedia.org/wiki/Incidents/2024-06-12_WMCS_toolforge_k8s_control_plane are you planning on updating those or should I ? [14:12:16] If you are happy to do it please do :) [14:12:31] ok [14:13:07] thanks! [15:35:56] * arturo offline [15:39:47] gtg, be back in a bit [23:15:07] bd808: do you know if the elasticsearch clustering model presumes that every node has all the data, or if it does some kind of fancy replica counting? [23:15:24] (I was expecting it to just be exact copies but it seems to be occupying different amounts of disk space on different nodes) [23:26:26] andrewbogott: it should all be replica based per index. You could decide to make an index that only exists on one node or you could set the count such that there are multiple copies of a given index on each node. [23:27:33] OK. So when I replace nodes some data will be lost, but that's because the client elected to have the data easily lost [23:27:36] andrewbogott: https://bd808-test.toolforge.org/elastic7.php is one way to look at the cluster as an entity. [23:28:09] uhhhh.... data loss shouldn't be required [23:28:09] But it also sounds like I need to stop one node and then wait a while for it to sync before I stop the next one. [23:28:46] If an index only exists on one node, how will that be saved when that node is lost? Can I drain ahead of time? [23:28:49] you should add new nodes and then taint the old nodes so that the cluster moves data from the old to the new [23:29:15] ah, ok! I started to read the cluster guide but got overwhelmed :) I will search for 'taint' [23:29:22] But also probably won't do this until tomorrow anyway [23:29:27] you can always increase the replica count to put a copy on each node too if there is disk space [23:31:30] wait, but I thought replica count was a thing determined by the client? [23:31:42] I wonder if there are good notes left from past iterations of this dance? [23:32:16] andrewbogott: replica count is an attribute of each index [23:34:05] T236606 was the last bit rebuild [23:34:06] T236606: Rebuild Toolforge elasticsearch cluster with Stretch or Buster - https://phabricator.wikimedia.org/T236606 [23:34:09] *big [23:34:15] that page you linked makes it look like the new nodes are behaving ok (although they aren't ready for prime time because of haproxy and vip and such) [23:34:51] yeah, it looks like 4-6 are joined to the cluster for sure [23:35:23] so this should be simpler than the migration in that ticket (which seems to involve an entirely fresh cluster) [23:36:56] the es versions ended up not being compatible last time so Jason had to do a dump and load cycle to populate the new nodes [23:37:12] https://phabricator.wikimedia.org/T236606#5901064 [23:38:20] * andrewbogott nods [23:38:29] We didn't have the service name layer in the old cluster either so there was a per-client config change to move to that [23:38:38] this looks like what I want: https://opster.com/guides/elasticsearch/operations/elasticsearch-remove-node/ [23:40:43] yeah, that seems generally what you would want. Ideally you would find a way to keep the new replicas from landing on old nodes that will be decommed too between steps 4 & 5 [23:41:48] That would be more efficient but it seems to only take a few minutes to resync in any case. [23:42:01] At least, that's how long it took the new nodes to go green. [23:43:21] it looks like the biggest shards we have are less than 2GB and most are much smaller [23:43:37] Assuming pcc is happy with that haproxy check, can I ping you to look on when I cut things over tomorrow? [23:43:55] (mostly I want someone who has actually used ES to say "it's still working" a few times during the process) [23:44:18] * bd808 squints at calendar [23:44:58] My morning is pretty booked with meetings, but I can try to help test, yeah [23:45:33] I have some meetings in the morning too, we can find a time on the fly. [23:45:34] thanks! [23:45:41] https://sal.toolforge.org/ and https://bash.toolforge.org/ will also tell you if those indexes are working [23:46:10] oh yeah? if bash loads then all is well? [23:46:16] oh and https://csp-report.toolforge.org/ [23:46:18] that's a quick test [23:48:03] the cluster could be working with a particular index somehow lost, but yeah the cluster needs to be working for bash to show you quips [23:48:38] cool [23:48:52] ok, going to go cook, I will bother you tomorrow. Thank you! [23:52:56] https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html has lots of data on tuning allocation