[14:16:03] I'm trying to run local puppet tests with docker, mimicking as close as possible what CI does, has anyone been there before and/or been successful? [14:33:17] local puppet compileR? [14:33:50] i tried this when i joined https://gerrit.wikimedia.org/r/c/operations/puppet/+/477583 [14:34:05] if you download the facts mostly works [14:35:32] not the compiler no, the rake tests [14:35:49] ack [14:39:06] <_joe_> godog, cdanis: I will remove server_name from the labels in mtail, but it's a pretty important piece of data [14:40:56] _joe_: sure, and it should be aggregated somewhere, but prometheus might not be the place [14:41:06] i think if you have a small handful of server_names you know you care about, put them in there [14:41:52] <_joe_> cdanis: Ideally I would group them by db section [14:42:17] <_joe_> but yeah in the future when we use an mtail version that's less archaic I can think about it [14:45:47] _joe_: SGTM [14:46:31] don't know if it could be useful as a rough grouping by "language" + "project" or how easy/useful it is [14:47:43] db section is sensible [14:48:05] but hard to do as-is [15:26:36] <_joe_> yeah let's not [15:27:48] _joe_: do you have feelings on https://gerrit.wikimedia.org/r/c/operations/puppet/+/525560 btw? [15:28:09] <_joe_> cdanis: feelings on a CR? [15:28:36] yes [15:28:47] <_joe_> :P [15:28:55] <_joe_> well the etcd client is so broken and useless [15:29:07] <_joe_> why do you want to use it? [15:29:29] directory listings mostly [15:29:37] <_joe_> ok [15:29:58] <_joe_> I'd rather write a curl wrapper tbh, but that's ok as well [15:30:17] <_joe_> the advantage is it will be equally bad once we move to etcd3 [15:30:26] the client? yeah [15:31:03] godog: bundle install --without system_tests development --path=${BUNDLE_PATH:-.bundle} && bundle exec rake [15:31:09] no docker up to now [15:34:52] akosiaris: nice, I tried before with bundler on a buster system but with average luck/results, in https://phabricator.wikimedia.org/T208566 [15:35:06] which meant using ruby 2.5, another whole rabbit hole [15:37:08] tried with rbenv on buster and thus ruby 2.3.3, which also didn't work after a couple of tries, another rabbit hole that I'm not feeling like falling into [15:37:24] hence docker, which seems the sanest option ATM [15:37:45] godog: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/524525/ [15:38:01] met the same issue. I did not know about the task but I did a PCC and looked fine so I merged [15:38:36] oh nice, I'm trying again [15:40:58] heh, the puppetmasters were out of sync when I puppet-merged [15:41:09] on puppetmaster1001: ------------------------------ [15:41:12] Chris Danis: install etcd-client on cumin hosts (98f6c17164) [15:41:14] Merge these changes? (yes/no)? yes [15:41:33] I got three of these in the output for other puppetmasters though: WARNING: Revision range includes commits from multiple committers! [15:42:18] hmm [15:42:39] * jijiki turns on the bat-signal [15:42:44] probably related to bblack's failed puppet-merges he mentioned in #-operations a few minutes ago [15:50:23] akosiaris: thanks, looks like I can run rake as is now, still not getting all spec tasks listed but looks like a different problem [15:50:47] godog: blame the rake task generator [15:50:53] anyway I don't know if it is anything to be concerned about but I wanted to mention it somewhere it wouldn't get lost [15:51:10] tl;dr is that if you edit a file in a module then spec tasks get enabled [15:51:20] weird tricks to speed up the build [15:51:30] archived my puppet-merge output in https://phabricator.wikimedia.org/P8806 [15:52:07] am I missing something dumb about icinga config changes taking effect, etc? : [15:52:36] I merged this up: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/525566/3/modules/profile/manifests/bird/anycast_monitoring.pp [15:52:56] it looks like that was what I merged on other puppetmasters; anycast_monitoring.pp is the other modified file [15:53:12] I suspect there was just a race condition here between a few other merges? [15:53:12] and it's been agented to icinga1001, and the "recdns.anycast.wmnet" hostname no longer shows up in the filesystem (yes there was a race, but either way we seem past it) [15:53:41] root@icinga1001:/etc/nagios# grep anycast.wmnet -r . [15:53:41] root@icinga1001:/etc/nagios# grep 10.3.0.1 -r . [15:53:41] ./nagios_service.cfg: host_name 10.3.0.1 [15:53:42] ... [15:53:52] and I know icinga's been reloaded by puppet since those changes were on disk [15:54:16] yet when I try to confirm the check execution looks right manually, via stracing the executions... [15:54:35] root@icinga1001:/etc/nagios# strace -ff -p 2610 -e execve -s 1024 2>&1|egrep --line-buffered 'anycast|10\.3\.0\.1' [15:54:44] [pid 7892] execve("/usr/lib/nagios/plugins/check_dns", ["/usr/lib/nagios/plugins/check_dns", "-H", "www.wikipedia.org", "-s", "recdns.anycast.wmnet"]. [15:55:01] did the icinga restart fail? [15:55:05] it's still executing the check with the old value for the -s argument, which no longer exists in config [15:55:16] does icinga restart fail the puppet run? [15:55:33] mm, I think if the icinga reload fails it will fail the puppet run [15:55:40] I've been assuming if I see puppet agent runs going through green after changes, it should be reloaded [15:55:52] (it's definitely not truly-restarted though, the PID hasn't changed since before I started all of this) [15:56:27] e.g. recently in its puppet.log [15:56:30] Jul 25 15:41:11 icinga1001 puppet-agent[46094]: (/Stage[main]/Icinga/Systemd::Service[icinga]/Service[icinga]) Triggered 'refresh' from 2 events [15:56:34] from someone else's icinga changes [15:56:49] (after mine) [15:57:30] but the process has been up like 8 days, so surely others would've noticed if reload no longer really reloads by now [15:59:07] ehm [15:59:10] ehhhmmmm [15:59:33] Jul 25 14:32:44 icinga1001 icinga[66682]: errors in config! ... failed! [15:59:35] Jul 25 14:56:21 icinga1001 icinga[70243]: errors in config! ... failed! [15:59:37] Jul 25 15:32:42 icinga1001 icinga[183404]: errors in config! ... failed! [15:59:39] Jul 25 15:41:11 icinga1001 icinga[107319]: errors in config! ... failed! [15:59:41] it has been failing to reload for a while [15:59:43] I guess puppet runs don't fail on reload failures then heh [15:59:47] Jul 25 15:41:11 icinga1001 icinga[107319]: Error: Could not find any contact matching 'accraze' (config file '/etc/icinga/objects/contactgroups.cfg', starting on line 62) [16:00:09] broken for two hours [16:00:20] and because of various unfortunate kludges there is no monitoring of this [16:00:32] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/525143/ [16:00:39] (is the breaking change) [16:00:48] I suspect it is missing in the private repo [16:01:10] I am going to revert that for now. [16:01:26] thanks! [16:02:17] ah dammit, my mistake [16:02:25] I see the alert for icinga correctness fail btw [16:02:26] * akosiaris and I actually thought about it and double checked [16:02:42] and then I missed the private repo [16:02:47] Jul 25 14:32:44 icinga1001 icinga[66682]: Error: Could not find any contact matching 'accraze' (config file '/etc/icinga/objects/contactgroups.cfg', starting on line 62) [16:02:49] yeah I didn't look for an alert, I just kinda assumed if a puppet-triggered icinga reload failed, puppet would fail [16:02:49] oops [16:02:52] bad assumption! [16:03:04] 12:02:15 Line 7: Line should be <=100 characters [16:03:06] 😡 [16:03:19] cdanis: lemme fix the accraze thing [16:03:30] akosiaris: ok if you are doing it the proper way, I'll hold off [16:03:32] :) [16:03:51] I figured making changes in the private repo would require contact info I do not have [16:04:15] interesting, the 100 chars commitmsg check [16:04:28] yeah it's because I pasted the whole error [16:04:40] (it's about the commit message) [16:04:49] oh you said that, I just can't read [16:04:52] I generally manually limit myself to 50 chars on the title line and wrap at 66 elsewhere in the commitmsg if possible, because of some advice I read somewhere long-forgotten [16:05:03] I believe "50 chars on the title line" is enforced by the check as well [16:05:12] (about all the ways people read/view/parse/display commit msg stuff and where/how it will wrap badly or not fit) [16:07:12] volans: doesn't meta-monitoring usually catch this ? [16:07:40] I thought it usually manifested in a weird way, like the secondary icinga host exporting nonsense data, but that it still manifested [16:07:43] cdanis: sorry was doing other stuff, TL;DR? [16:07:55] volans: icinga config reloads failing for the past hour and a half and we did not notice [16:08:13] there is an icinga alert for that if the config is broken [16:08:34] but the systemd-level refresh for puppet always succeeds? [16:08:40] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Check+correctness+of+the+icinga+configuration [16:08:51] ah yes [16:08:53] okay then [16:09:24] not seems fixed on 1001 [16:09:31] yeah akosiaris was fixing the issue [16:09:33] I run 'sudo icinga -v /etc/icinga/icinga.cfg' [16:09:38] it uses a traditional initscript, and does have a check_config function that looks like it should've failed [16:09:38] when this happens [16:12:43] ok so I guess we've reduced this to the previously unsolved problem of "we don't notice criticals" [16:13:45] well that and "why doesn't puppet agent fail on the config reload for icinga.service?" [16:14:14] systemd uses /etc/init.d/icinga , and the reload() there uses check_config which sure seems like it should cause a bad exit code [16:14:43] it should be fixed now [16:15:14] re: we don't notice criticals: -ops chan is very spammy, and I doubt most of us are habitually staring at a refresh of the criticals list in the web UI either [16:16:07] maybe we could add to the spam and help solve the problem, by adding some recurring colored IRC output for outstanding criticals older than X? [16:16:55] e.g. a big red hard-to-miss message that icinga-wm repeats every 5 minutes saying "There are 23 outstanding unacked/undowntimed criticals older than 60s", or something [16:17:00] I'm wondering if I broke the bad exit code with my nsca kludge [16:17:22] does anyone know what systemd will do when there are multiple ExecReload commands and only some of them fail? [16:17:42] by default I think it fails if any fail [16:18:37] the docs say: [16:18:38] If more than one command is specified, the commands are invoked sequentially in the order they appear in the unit file. If one of the commands fails (and is not prefixed with "-"), other lines are not executed, and the unit is considered failed. [16:18:43] there's the @-+! and !! thing, right ? [16:18:50] (under ExecStart, which ExecReload seems to inherit language from) [16:19:16] cdanis: but also, at least on icinga1001, systemd says it's using /etc/init.d/icinga, not any unit file that might contain an ExecReload? [16:19:16] of which only the - seems to be pertinent to this [16:19:30] so yeah, prefix it with - to ignore the bad return code I guess [16:20:01] bblack: that's interesting... [16:20:04] Process: 57047 ExecReload=/usr/bin/killall -CONT nsca (code=exited, status=0/SUCCESS) [16:20:06] Process: 54809 ExecReload=/etc/init.d/icinga reload (code=exited, status=0/SUCCESS) [16:20:08] Process: 54791 ExecReload=/usr/bin/killall -STOP nsca (code=exited, status=0/SUCCESS) [16:20:39] well also [16:21:01] the systemd::service definition references a systemd_template, but has the reload set to use the initscript explicitly, at the puppet level [16:21:30] [above in modules/icinga/manifests/init.pp] [16:21:52] root@icinga1001:/etc/nagios# systemctl status icinga.service [16:21:52] ● icinga.service - LSB: icinga host/service/network monitoring and management system Loaded: loaded (/etc/init.d/icinga; static; vendor preset: enabled) [16:22:04] ^ regardless, systemd says the initscript is defining the service [16:27:24] unrelated to the above, but related to the check_dns stuff I'm staring at... [16:28:11] icinga executes check_dns a ton (aside from the obvious dns service checks, we also have the base module defining a check_dns to check up on the resolution of every single .mgmt. hostname) [16:28:34] and the check_dns binary executes nslookup as its engine :P [16:28:49] (nslookup is ancient and awful, and why is a program executing another program, etc?) [16:29:04] e.g. [16:29:07] [pid 89728] execve("/usr/lib/nagios/plugins/check_dns", ["/usr/lib/nagios/plugins/check_dns", "-H", "www.wikipedia.org", "-s", "10.3.0.1"], [/* 5 vars */] [16:29:10] [pid 89731] execve("/usr/bin/nslookup", ["/usr/bin/nslookup", "-sil", "www.wikipedia.org", "10.3.0.1"], [/* 1 var */]) = 0 [16:29:48] we should, somewhere on a our back burner, implement a better check_dns that uses some library to do it instead of shelling out to nslookup [16:30:13] (or alternatively, remove the .mgmt. check_dns if they're not good value propositions, killing most check_dns executions) [16:31:57] btw there's like 281 unackd WARNINGs in icinga right now [16:32:10] unsurprising :/ [16:32:12] oh it's mass puppet disables on the mw fleet, nevermind [16:33:44] bblack: I can postpone what I am doing if it helps you [16:33:50] I have not merged yet [16:34:01] jijiki: no it's fine, I was just noticing the large count of warnings in general [16:34:25] alright, it will take a while though because I will gradually enable puppet [16:34:43] <_joe_> cdanis: I see that we have mtail 3.x in production now [16:34:46] <_joe_> from backports [16:34:49] and now that I'm back out of my monitoring sidetrack, time to turn on some LVS + anycast recdns. [16:34:52] <_joe_> do you know anything about it? [16:35:51] <_joe_> that differs from the mtail we're using to run tests in CI [16:36:00] <_joe_> as no one updated the puppet ci image [16:36:09] <_joe_> else, tests wouldn't work [16:36:21] sigh [16:36:24] I do not know anything about this _joe_ sorry [16:36:28] <_joe_> anyways, a problem for tomorrow morning [16:36:42] <_joe_> cdanis: I'll sync with godog on that [16:36:57] looks like it was j.bond https://tools.wmflabs.org/sal/production?p=0&q=mtail&d= [16:41:01] <_joe_> yeah https://phabricator.wikimedia.org/T225604 has the gory details [16:41:05] anyway bblack I'm not sure what's up with the systemd / init interaction here and I have no time to dig into it today [16:41:13] <_joe_> basically there is a huge tech debt that needs to be paid now [16:41:34] oh right no logfd and no json support? [16:41:39] now I remember [16:42:16] <_joe_> basically we're using 4 different versions right now, including 3 different rc for 3.0.0 [16:42:40] <_joe_> and behaviour varies so much between rcs that rc5 works with varnishmtail and rc24 doesn't [16:42:44] 🤦 [16:43:00] <_joe_> also we're blind with tests, basically [16:43:20] <_joe_> so I think I have a proposal for those, but it needs someone to work on this actively [16:44:57] do you have time to file a task?