[03:44:22] jbond42: it looks like compiler1003 runs with some kind of delay/lag regarding changes and as a result it shows DIFFs instead of NOOPs due to old commits [06:41:39] <_joe_> vgutierrez: console output or GTFO [06:41:43] <_joe_> :P [06:43:37] morning to you too joe [06:48:12] <_joe_> seriously though, can I see the console output of that compiler run? [06:48:33] <_joe_> if the code doesn't get updated, that should show an error somewhere [07:07:58] so the pcc run is this one: https://puppet-compiler.wmflabs.org/compiler1003/19186/ [07:08:14] the expected output is a NOOP on cp4027-4031 and DIFF on cp4032 [07:08:41] we get DIFF in the first 5 nodes cause it's taking into account r.lazarus ssh key change [07:08:55] and that's a commit from last night [07:10:37] you're right as usual [07:10:40] [ 2019-10-31T03:42:46 ] INFO: Refreshing the common repos from upstream if needed [07:10:40] error: insufficient permission for adding an object to repository database .git/objects [07:15:53] <_joe_> vgutierrez: add another beer to the tap for the delirium cafe [07:15:59] <_joe_> :P [07:16:16] dunno if I'd be functional enough to be at delirium cafe on Friday night [07:16:21] <_joe_> I know that was a low blow [07:16:29] <_joe_> oh you're doing the red-eye thing? [09:08:42] looking no vgutierrez [09:09:24] looking *now* vg.utierrez [09:10:02] <3 [09:15:59] vgutierrez: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/19187/console [09:18:33] sure now is a NOOP everywhere cause that change has been merged a few hours ago [09:18:46] but yeah, it looks like it's refreshing properly :) [09:18:57] thx <3 [09:19:21] cool and np [09:19:26] I hope you'll join _joe_ for beers at the delirium cafe [12:58:06] _joe_: with puppet run_ci_locally.sh -- is it your experience that 'profile' spec tests always fail? [12:58:31] <_joe_> cdanis: no [12:59:05] <_joe_> cdanis: I think that's a regression, care to paste the errors? [13:00:23] ohhhh [13:00:29] nginx is still a submodule isn't it :| [13:05:20] okay nevermind, it all works after git submodule update [13:12:16] oh right [13:12:22] did you update the nginx submodule? [13:12:27] (or someone did?) [13:12:34] I think I never had it checked out in the first place :) [13:12:45] I have a forgotten pair of outstanding changes to unsubmodule it, but they'll need fixups if anything has changed there recently [13:13:04] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/521323/ [13:15:16] refreshing those [13:15:29] (and no, nothing has changed since ~1y ago in the submodule, still) [13:18:59] jenkins still gives a -1 to the first of the two commits, but passes on the second. I assume this is just an artifact of this method of trying to avoid breakage on puppet-merge, etc with the temporary move to "environments" [13:20:18] I think we discussed this all before back in august and someone ok'd it informally, but then I got distracted and never merged it, or something [13:40:55] bblack: i happen to create a patch set for that today as well https://gerrit.wikimedia.org/r/c/operations/puppet/+/547500 [13:41:27] and yes the -1 is because the fixtures.yaml file points to nginx in modules and not environment/production/modules [13:41:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/547500 [13:43:59] the second one's message is still showing a different module from before :) [13:44:54] FYI there is now a read-only git-based interface for puppet/hiera config in CloudVPS https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/master [13:44:59] cc ema [13:45:40] oh also still has .gitconfig/.gitreview [13:45:50] err .gitignore/.gitreview [13:47:19] happy to abandon mine and go with yours [13:50:52] either way. mine lacks the ticket ref too, so something needs updating [13:51:29] fixing that [13:53:07] heh bug doesn't care about leading whitespace, in my log copy [13:53:22] there we go [13:55:08] i have +1ed [13:55:51] going... [13:58:07] \o/ [13:58:43] seems to have come through my "git pull" locally ok too [13:58:57] I figured there would still be a minor mess there even if puppet-merge's thing worked out ok, guess not! [13:59:42] yay :D [14:00:09] I have a grafana/prometheus issue- https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1119&var-port=9104 only loads some of the metrics, as if one of the prometheus servers was failing [14:00:52] ^do you see the same if you hit grafana refresh button? [14:02:35] doing the followup stuff in the nginx repo itself (replace contents with a readme about the move, etc) [14:03:17] whoa [14:03:24] I got "upstream connect error or disconnect/reset before headers. reset reason: connection termination" when loading that [14:03:31] refreshing fixed, but still, odd [14:04:11] it looks like one of the prometheis is missing time series after 11:53? [14:04:31] ok, so not only me, I will investigate [14:06:29] same, refresh clears up the issue [14:12:35] curl db1119.eqiad.wmnet:9104/metrics works on p1004, not on 3, but there is connectivity- maybe some firewal issue [14:13:43] there is missing or bad rules on firewall after the restart [14:18:31] db1119 is missing a firewall rule to allow prometheus1003 to hit dest port 9104 [14:18:39] it has it for p1004 [14:18:46] that's odd [14:19:05] https://phabricator.wikimedia.org/P9510 [14:19:15] indeed [14:19:27] potential reasons: puppet commit bug (good) [14:19:35] puppet master corruption (bad) [14:20:07] I will depool the host again before doing any firewall or puppet operation [14:22:39] re: unsubmoduling - I already pushed this to the now-dead nginx submodule repo: https://gerrit.wikimedia.org/r/#/c/operations/puppet/nginx/+/547536/ [14:22:59] and now I have these two posted up: https://gerrit.wikimedia.org/r/#/c/integration/config/+/547539/ + https://gerrit.wikimedia.org/r/#/c/integration/config/+/547540/ [14:23:36] the first seems right based on previous examples, the latter seems right to me, but could break something (some of the other submodules that died a while back were never cleaned up fully in integration/config, but I'm pretty sure now they're all dead and we can do all of those things) [14:24:03] <_joe_> bblack: I can take a look if you want [14:24:54] if you have time, it's just cleanup (but who knows, it might slightly speed up jenkins if it has one less git command to run and one less network connection to make, etc) [14:27:47] /etc/ferm/conf.d/10_prometheus-mysqld-exporter seems ok, so not a puppet issue [14:29:43] reloading ferm created the right rule [14:31:10] maybe a 1 in a millionth time where ferm init fails partially because some network/dns issue [14:31:33] a bit spooky [14:31:41] yeah [14:31:46] because it is a suble failure [14:31:50] and i wonder if we should have an alert on prometheus's rate of scrape failures [14:32:03] we have some dashboards [14:32:11] there are some subtle things there to get right [14:32:12] but they are not too verbose [14:33:49] was it related to a reboot, or just random at runtime (the ferm issue)? [14:33:59] I've seen some racing at reboot time on some hosts, but never randomly while up [14:34:01] it was on reboot [14:34:12] yeah, but when that happened, ferm failed to start [14:34:24] this time, ferm started, but missed a single rule [14:34:24] yeah I think there's some dependency issues there (maybe not so much puppet deps, as system-level deps) about ordering of boot stuff [14:34:38] yes, not starting is not a problem because we have alerting for that [14:34:46] either that or it's just so many dns lookups all at once and the software doesn't retry a failure? donno [14:34:47] a partial load is a bit devilish [14:34:57] yeah, the rule does a dns resolution [14:35:14] so *maybe* the dns failed and silently skipped over that ip [14:36:13] well I've found problem #1. /usr/sbin/ferm is a perl script :) [14:36:30] "Oct 31 12:45:34 db1119 systemd[1]: Starting ferm firewall configuration..." shows no errors [14:37:01] yeah I'm looking into how @resolve works [14:37:24] don't worry too much, this could be one of those things that never happen again [14:37:37] at least I have never seen it [14:37:52] lol found the source of the fact that there's no error reported [14:37:56] my $query = $resolver->search($hostname, $type); [14:37:56] unless ($query) { [14:37:56] if (!$resolver->errorstring || [14:37:56] $resolver->errorstring eq 'NOERROR' || [14:37:56] $resolver->errorstring eq 'NXDOMAIN') { [14:37:59] # skip NOERROR/NXDOMAINs, i.e. don't error out but return nothing [14:38:02] next; [14:38:04] } else { [14:38:07] error("DNS query for '$hostname' failed: " . $resolver->errorstring); [14:38:10] } [14:38:10] so a todo [14:38:12] } [14:38:15] nice that there's an error case that just skips along happily :P [14:38:29] like, if I put that hostname in my firewall config, I probably care about all errors to resolve it :P [14:38:31] improve error logging of ferm [14:38:41] indeed [14:38:57] to be fair it is not that of a huge issue [14:39:09] if it is an important service, it will have monitoring [14:39:30] it it was like this a non-critical one, well, it is not-crirical [14:39:56] please don't let me bikeshade you [14:40:11] I may create a ticket, however [14:41:58] yeah please do [14:42:11] we should at least add some logging of those cases, so we can see an error output next time and take it from there [14:42:46] FWIW, the default settings of Net::DNS::Resolver, as best I understand them, still should have worked right anyways though (by default they have long timeouts and do attempt to retry transient failures, etc) [14:42:52] (and they do read the system's /etc/resolv.conf) [14:43:43] the only part of that I question a little, is it's possible N:D:R parses the options line from resolv.conf as well: [14:43:46] options timeout:1 attempts:3 [14:44:10] and maybe the aggressive timeout is an issue, for this case in Perl, during bootup with heavy contention, etc [14:46:24] do we have a moritz tag? XD [14:50:03] thanks both for the help [15:22:02] successfully nerdsniped, but I'm done looking at this for today now :) [15:22:18] https://phabricator.wikimedia.org/T237020#5623531 :P [18:48:51] when you get the "meta lint" error and manage to " Whoops! It looks like puppet-lint has encountered an error that it doesn't know how to handle. Please open an issue at https://github.com/rodjek/puppet-lint.. " ;) [19:16:48] bblack, we've had problems in the past with Net::DNS code returning bogus NXDOMAINs :/ [19:17:32] or was it a nil return with errorstring NOERROR [19:17:36] something along these lines