[08:26:51] <_joe_> moritzm: having to reinstall the laptop with ubuntu (only distro, apart from RHEL, officially supported by lenovo...) I decided to start fresh and use wmf-sre-laptop, that resulted in a few patches :P [08:27:28] <_joe_> in particular - I've moved git-review to 'Recommends:' [08:27:54] <_joe_> I hate git review and I have an alias in git to push to the refs/for/master instead [08:28:35] <_joe_> there's also one patch to add a functionality to auto-update the ssh config whenever one updates the package [08:30:26] cool, will check these late [09:41:40] thank you, I wish we could update the 'how to use git' docs to allow one to easier see the path for git-review or plain git instead of plain git just not bing covered [09:41:52] as a non git review user [09:43:43] apergos: use gitlab :-P [09:44:09] wow the trolls are awake early today I see, and in good form too! ;-D [09:44:40] :) [16:46:29] db1077 is running out of space, but this is a testing host, so don't worry too much (alerts disabled) [17:11:54] just purged its binlogs [17:19:54] shdubsh: can you link me to the ECS labels field? [17:20:00] looking for it in https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html [17:21:29] AH FOUND IT [17:21:30] in base [17:23:07] ottomata: Yes, they nest their documentation which can make it a bit harder to find fields. Because we need the option to extend ECS, we host docs here which (IMO) are a bit easier to search: https://doc.wikimedia.org/ecs/ [17:24:14] cool, thanks, just commented on https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/639529 to use labels [17:35:02] someone working while they are on vacation [17:35:11] I'm lookin at you, maro stegui [17:50:45] effie: multi dc meeting with SRE/CPT/perf settled last month on using dc-local redis for ChronologyProtector. And CPT are also leaning toward Redis for CentralAuth tokens long term. Is the new task meant to explore a different direction for those? [17:51:26] for our redis needs that do not need to be multi-dc [17:51:35] we have the rdb* hosts per datacenter [17:52:23] this task is about the redis instances living within the memcached clusted [17:53:49] Right. So we want to merge those clusters. Ok [17:53:56] not merge [17:54:14] we want to eliminate one of them [17:54:19] and keep the other one [17:54:46] now, as we mentioned in the meeting we had about ChronologyProtector [17:54:47] Right. That's what I mean [17:55:02] if you do to not need any specific redis features [17:55:33] or there are serious reasons, it is preferable to use memcached, we already have a large cluster, mcrouter has been working well for us [17:55:57] and it is fault tolerant, on our rdb cluster, if one rdb host goes down [17:56:12] we have to manually elevate a secondary server to a master [17:56:30] and there will be some data loss in the meantime [17:56:36] it is not fault tolerant [17:58:30] Afaik very little is specific to anything in that area. Everything is possible :) [17:58:46] what do you mean ? [17:59:42] If given time and priority we can move almost most any key value store to something else. Our uses are generally simple [18:00:11] then, if this has not been implemented yet [18:00:43] you should consider using memcached [18:01:10] No, it is implemented. That's not what I meant [18:01:16] ah [18:01:18] sorry [18:01:20] :) [18:01:28] I'm saying it's not special, and that nothing else is special either [18:01:57] But also it's good enough and we need to paying attention to other fires [18:02:37] But moving around etc seems fine. I can ignore this mostly then. [18:38:42] Does anyone have a good handle on puppet class parameters? I've set some parameters for a specific class but it looks like the defaults are still getting taken (context incoming...) [18:39:01] At a high level, `icinga::monitor::elasticsearch::base_checks` is ultimately where the variables I care about end up getting used for real. `icinga::monitor::elasticsearch::base_checks` sets defaults that I don't want for my use case [18:39:22] so I've initialized the variables to the value I want in `profile::icinga`, which gets passed through to `icinga::monitor::elasticsearch::cirrus_cluster_checks`, which should then call `icinga::monitor::elasticsearch::base_checks` (code links incoming...) [18:39:51] My recent patch added these variables: https://github.com/wikimedia/puppet/blob/bfb5060cf85ef281d45d0dffd61b2f1de54b8db7/modules/profile/manifests/icinga.pp#L29-L32 which should get passed through from here: https://github.com/wikimedia/puppet/blob/bfb5060cf85ef281d45d0dffd61b2f1de54b8db7/modules/profile/manifests/icinga.pp#L72-L78 [18:40:32] Passed through to https://github.com/wikimedia/puppet/blob/bfb5060cf85ef281d45d0dffd61b2f1de54b8db7/modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp#L2-L6 which feeds into here https://github.com/wikimedia/puppet/blob/bfb5060cf85ef281d45d0dffd61b2f1de54b8db7/modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp#L13-L22 [18:42:26] So ultimately I'm trying to get those variables passed through to `icinga::monitor::elasticsearch::base_checks`, but the behavior I'm seeing makes me think it's still using these defaults instead: https://github.com/wikimedia/puppet/blob/bfb5060cf85ef281d45d0dffd61b2f1de54b8db7/modules/icinga/manifests/monitor/elasticsearch/base_checks.pp#L7-L8 [18:43:26] So I guess to close out my line of questioning, given how I defined the variables in the initialization of `profile::icinga`, is my understanding correct that those variables end up being passed into `icinga::monitor::elasticsearch::cirrus_cluster_checks` and thus should be passed through to `icinga::monitor::elasticsearch::base_checks`? [20:37:28] ryankemper: what is your goal wrt those parameters? Redefine defaults? [20:59:32] shdubsh: So, basically I want to pass in my own values for a specific codepath but not mutate the defaults incase they're relied upon elsewhere [20:59:58] Basically `icinga::monitor::elasticsearch::base_check` is used generically by anything using elasticsearch (not just cirrus), so logstash etc in addition to cirrus [21:01:20] cirrus is broken up into two codepaths, one that is an alert for cloudelastic and another is an alert that lives on the two `alert` hosts and checks `cirrus->codfw` and `cirrus->eqiad` [21:02:23] The former, cloudelastic I've already got working with this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/634391 but the alerts for codfw and eqiad were still going [21:03:14] So to fix codfw and eqiad I made this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/636811, and that patch didn't fix the issue for codfw/eqiad which led to my above line of questioning [21:04:15] ah, I think Hiera may be the way to go here. Do you know the roles which you want to apply the new values? [21:10:07] Hmm, if we took a role approach it'd probably be whatever role includes `profile::icinga`, because `profile::elasticsearch::cirrus` isn't actually used by the icinga instances running the checks for eqiad/codfw cirrus [21:10:39] In this case though we have the file `modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp` which handles the part I care about and nothing else [21:11:05] But I believe we don't want to avoid defining parameters in classes directly and want to do it at the profile level (IIRC) [21:11:37] Which is why with the current approach I was/am trying to change `profile::icinga` to have the values I want (which may not be the right way to do it to be clear) [21:12:00] Basically the flow looks like `profile::icinga` -> `icinga::monitor::elasticsearch::cirrus_cluster_checks` -> `icinga::monitor::elasticsearch::base_checks` [21:16:49] s/don't want to avoid/want to avoid [21:24:11] I think I follow. Passing it through like that should be fine. So you're seeing (for example) $shard_size_warning is still 50 when you expect it to be 80 from the default_value from profile::icinga? [21:36:18] shdubsh: Exactly [21:37:35] So I'm wondering if it's just a simple misunderstanding of puppet and the way I'm overriding it is wrong...like I'm imagining the earliest a variable gets set is the value it carries [21:40:24] I think my next step is to look at `monitoring::service`, find where the nagios service lives and verify which values I'm seeing [21:41:20] Since this is on the codepath that *doesn't* use nrpe: https://github.com/wikimedia/puppet/blob/5cfe03af70c6d0c4e1d413e394569df568e7fd0b/modules/icinga/manifests/monitor/elasticsearch/base_checks.pp#L35-L40 [21:42:26] ryankemper: that's where I was going to go next. Which hosts do you suspect aren't getting the right value? [21:43:47] AFAICT, check_elasticsearch_shard_size is warning 80 and crit 100 in all cases [21:53:19] shdubsh: It was warning on `search.svc.eqiad.wmnet` (which actually runs on the icinga servers) previously but I just checked back now and it's actually in an `Unknown` state: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+shard+size+check+-+9243 [21:53:43] Totally should have re-checked first :x it's weird though because I forced a re-check last friday and it stayed alerting [21:54:13] So now I need to figure out how `UNKNOWN - int() argument must be a string, a bytes-like object or a number, not 'NoneType'` is happening [21:54:59] shdubsh: Okay well I think I can actually figure it out from here, since now something's properly broken :P [21:58:23] right on :)