[05:50:07] ebernhardson: (for tomorrow) we'll need to do a bit of work to forecast elastic expansion needs, see https://phabricator.wikimedia.org/T379079#10298203 & https://phabricator.wikimedia.org/T379079#10298414 . Last we left off somewhat arbitrarily requesting 10% expansion for next year to stay ahead of the demand curve. I'm hoping we can cool off on subsequent years after that, maybe steady 5% growth or something? [10:40:31] errand+lunch [14:19:37] ^^ based on our previous discussions, I don't think we need to grow Elastic. It seems we're overprovisioned as-is [14:20:58] o/ [14:30:21] gehel we should have the forecast stuff by EoD today. Can you review https://phabricator.wikimedia.org/T379079 and let us know if we're missing anything? I'm going to rework my portion to be more like Ryan's, since it seems more relevant for Finance [14:35:03] I suppressed those morelike alerts , if I can help w/resharding LMK (ref T379002 ) [14:35:04] T379002: Consider resharding cebwiki_content - https://phabricator.wikimedia.org/T379002 [14:57:19] \o [15:17:06] .o/ [15:18:35] o/ [17:03:09] workout, back in ~40 [17:11:37] * ebernhardson ponders over how exactly things would be structured moving relforge to opensearch. I guess renaming is in order [17:12:37] thinking profile::elasticsearch::relforge -> profile::opensearch::cirrus::relfroge, profile::elasticsearch::cirrus -> profile::opensearch::cirrus::server [17:13:03] have to figure out how much the opensearch puppet has diverged though.. [17:16:29] Did some napkin math on query volume increase for elasticsearch over last few years. Seems like we're averaging about 3% growth [17:16:42] https://www.irccloud.com/pastebin/ZA8LZX7D/elastic_query_volume_growth.txt [17:52:17] dinner [17:56:45] ebernhardson yeah, I was thinking of that too...do we want to call it something other search, elasticsearch, or opensearch? Hmmm [18:06:27] inflatador: well, i was thinking it integrates with the existing profile::opensearch it would make sense to put ours as a namespace under it [18:07:10] or at least, i'm expecting to since profile::opensearch::server is a copy of the old profile::elasticsearch [18:07:20] fork [18:08:48] ebernhardson yeah, makes sense. We should at least start there [18:10:01] i'm mildly dreading if this needs tls certs work...haven't yet found where the o11y bits handle that [18:11:01] it's mandatory for the transport layer apparently, so they must do someting somewhere :) [18:11:13] acme-chief has made things a lot less painful. I think we will need TLS certs, but it should be more or less plug and play...at least based on what I've done in my homelab [18:11:40] oh nice, hopefully [18:12:00] You can set "insecure=true" or something like that and run w/out TLS completely, but I don't think we'll have to...at least I hope not ;) [18:13:01] since we will be rolling a restart from elastic, i'm thinking the safest will be initial deploy with rest security disabled, but turning it on in a cluster restart later [18:13:13] i guess i would have to test how that works, but without having looked seems safer [18:14:36] yeah, that makes sense. 0lly had to do the same type of migration, so we should probably touch base at some point [18:15:24] randomly guessing, we can probably start sending the credentials early and elasticsearch will ignore them. probably. maybe it can start with them on. [18:21:19] opensearch guide to selecting permissions to grant: Add permissions as you encounter errors [18:22:26] re: credentials, I thought you could do transport layer TLS without actually enabling any kind of auth or RBAC stuff. I definitely need to brush up on this stuff though [18:24:21] oh you certainly can, they are separate. i'm just thinking about both due to cloudelastic initially [18:25:51] Oh yeah, are you thinking of using auth roles instead of the current situation with read-only ports? [18:26:20] Looking at https://opensearch.org/docs/latest/security/configuration/configuration/ I'd be inclined to use mTLS for auth, but that's always my choice ;P [18:26:58] inflatador: yea, migrating the profile for cirrus over to opensearch i noticed we don't need tlsproxy anymore of course, but cloudelastic would still need to pull it in. But there are probably better (but more work) solutions [18:28:03] maybe it wouldn't be too much work on the cirrus/sup side, mostly config, but questions to answer around how to manage users [18:29:33] Yeah, was thinking about envoy too...seems like we'd want it for TLS termination on the REST API, even though we could terminate TLS directly at opensearch, I think we'd lose some metrics if we did that? [18:29:47] hmm, yea actually we would [18:31:01] I wonder if the OS config is flexible enough to do TLS on the cluster port and cleartext on the REST port [18:31:24] yea, it says somewhere that tls is required on trasport and optional on rest [18:32:50] * inflatador really needs to get the OS playground back up and running again [18:33:00] I guess that's what you're doing on relforge [18:33:23] i suppose i'm using relforge as an excuse to figure out how our existing puppet migrates to opensearch [18:33:46] not all of it, but at least get started :) [18:42:20] thanks for stepping in front of that bullet ;) . I dunno how bonkers you wanna get, but based on memory used, it does look like you could spin up an OS instance to run alongside ES on both hosts [18:43:18] maybe assigning both roles to the host at the same time, using different ports? [18:43:37] not going to do that :P We want opensearch to take over the node and load the same state [18:45:07] should be able to rolling-restart into opensearch, thinking to test that at least partially on relforge by doing 1 and then waiting a bit while poking around before doing the second [18:46:19] yeah, we also need to think about a rollback plan. Maybe take a snapshot first? [18:47:14] hmm, yea can't hurt [18:47:29] not that we care about relforge, just thinking general procedure [18:48:26] on the regular clusters, i dunno we've never done that before prior. We do a single cluster at a time and assume if something was lost we could snapshot from another cluster [18:49:39] no, we definitely haven't done it before. I guess when we do it for real, we'd be shutting off one DC [18:50:14] And if things go wrong, we might not want to snapshot the only prod DC [18:50:50] depending on the performance impact of snapshots...can't remember how disruptive they were [18:53:42] i don't think it was that disruptive, and iirc it's rate limited [18:55:34] ACK, maybe it's overkill then [18:56:13] https://wikitech.wikimedia.org/wiki/Search/S3_Plugin_Enable#Overloading_LVS_when_creating_a_snapshot yup, rate-limited as you said [18:57:00] once we have Ceph up and running we can cut around LVS [18:58:29] lunch, back in ~40 [19:34:32] back [19:57:38] appointment, back in ~90 [21:27:14] back