[08:35:26] <_joe_> jbond42: do we have more puppetmaster reimages planned? [08:36:32] <_joe_> I ask because I intend to upload a new confd package to stretch today, but I don't want it to be installed on new reimaged servers from the get-go [08:49:58] _joe_: unfortunatly the reimage of puppetmaster1001 didn't go ahead yesterday as puppetmnaster2001 died when i moved the CA endpoints to it. I plane to trouble shoot to day and hope to retry on monday https://phabricator.wikimedia.org/T234315#5541479 [08:51:05] <_joe_> ok [08:51:26] <_joe_> config-master is going to 2001 anyways, right? [08:52:00] it is yes do you want me to leave it there ? [08:52:08] <_joe_> yes please [08:52:13] ok will do [08:52:18] <_joe_> so I can maybe test the new package on 1001 [08:52:53] yes i wont be toucvhing it really today [10:05:07] -lOogDtprSe.iLsfxC [10:05:49] is that a password? [10:05:55] it is not! they are rsync options [10:06:49] I wouldn't be surprised if it was a megacli option either [10:07:19] :-P [10:08:01] <_joe_> arturo: my preferred is netstat, where you can combine options to form proper italian words [10:08:15] -putan? xD [10:08:18] <_joe_> one of my standard invokation was `netstat -polenta` [10:08:23] Hahah I always use -putan [10:08:25] lol [10:08:34] <_joe_> gross [10:08:48] <_joe_> polenta though, is awesome [10:09:03] <_joe_> (it's a bit redundant, but great for memorization) [10:09:10] * arturo tries the polenta one [10:10:47] you can issue `ss -polenta` too [10:11:13] <_joe_> I should get used to use ss by default [10:11:45] <_joe_> there is also the mildly gross netstat -rutto [10:12:00] <_joe_> (rutto = burp in italian) [10:12:14] xd [10:15:32] netstat -tuna or in Spanish -atun [10:27:06] lol for the spanish, my default one is -tunap [10:41:09] <_joe_> I converted to polenta [12:23:41] do you remember what was the usual suspect when prometheus-node-exporter is reporting this bogus number? [12:23:45] # HELP puppet_agent_last_run Timestamp of last run [12:23:45] # TYPE puppet_agent_last_run gauge [12:23:45] puppet_agent_last_run 1.569479391e+09 [14:26:53] maybe better over here re: puppet merge policy [14:27:19] so the ticket https://phabricator.wikimedia.org/T224033 sounds like it's building consensus to switch to rebase-if-necc and leave everything else alone, as a first step that really doesn't lose much. [14:27:49] I'm digging around in https://phabricator.wikimedia.org/T166888 too, because I recall someone having some kind of relatively-good argument against rebase-if-necc, but I don't recall what it was... [14:29:01] https://grafana.wikimedia.org/d/000000321/zuul?panelId=13&fullscreen&orgId=1&from=now-30d&to=now-5m <- pulled from that ticket, gives a good view of the V+2 wait times [14:31:12] the subthread about rebase-if-necc in the older ticket: https://phabricator.wikimedia.org/T166888#3311162 [14:33:20] it seems like one of the points of un-clarity was whether jenkins actually re-executes after a rebase-when-necc [14:33:24] it sounds like it might not [14:34:47] (so jenkins gives a V+2 to change abcd with parent a123, but another commit wins the merge race, and then gerrit will auto-rebase the nonconflicting (according to gerrit) change and let it merge, but not re-run jenkins after the necessary rebase?) [14:35:31] which, as some of the arguments in the later ticket state, may not actually be any less safe than what we do today [14:35:44] but it does open a loophole where CI *could've* potentially caught an issue, but now fails to even look. [14:36:13] bblack: are you sure that chart includes queue time? [14:36:22] I suspect it is execution time but not queue-waiting time [14:36:32] there were two charts in the original thread, one's a 404 now [14:36:55] but this one was "CI processing time" described in https://phabricator.wikimedia.org/T166888#3331797 [14:37:05] it seems to include all the meta-time [14:37:08] hm [14:38:49] anyways, I won't be here at monday's meeting, but perhaps this would be a good short topic there? Let everyone read/stew/argue today and tomorrow first, and if there's a rough concensus for merge-if-necc (sounds like there might be), take that decision at the meeting? [14:39:36] I won't be able to attend but I think I've made my position clear, +1 [14:39:55] maybe email ops-private noting you'd like to discuss it, otherwise I wouldn't expect most people to have thought about it beforehand [14:40:10] this issue has dragged on way too long IMHO. We need to either execute on the simple plan (switch to rebase-if-necc), or someone needs to come up with a clear and achievable other improvement in its place, or we kill it and stop thinking about it and decide what we have today is as good as it gets. [14:40:32] <_joe_> no it's not including the wait time [14:41:12] https://grafana.wikimedia.org/d/000000321/zuul?panelId=27&fullscreen&orgId=1&from=now-30d&to=now-5m [14:41:16] this looks more like my experience [14:41:16] <_joe_> I will never surrender to the idea I have to rebase a change 4 times, and wait CI 4 times, becuse my change on etcd should be rebased on top of your change to logstash [14:41:28] to be clear, yeah I don't think monday's the place for extended discussion, but if everyone can gather their thoughts and do some IRC debating or whatever beforehand (or email), we can maybe get to a state of quick consensus in the meeting. [14:41:48] also it's happened many times in the past two months that the max time has been over 20m [14:42:04] a little better since 9/21, dunno why, but still [14:42:38] long CI execution is of course a separate sub-topic, but it exacerbates the rebase/merge-hell problem and it will never be perfect [14:43:08] <_joe_> bblack: I think I said it multiple times - the weapon against accidental bad merges by gerrit is puppet-merge imho [14:43:16] the status quo encourages people to V+2 [14:43:20] that's the worst possible world [14:43:33] I try really hard not to use the V+2 except when emergency time really is of the essence [14:43:38] <_joe_> also, if our CI was so meaningful [14:43:52] <_joe_> bblack: I do it out of exhasperation when there are too many people racing [14:44:09] (on the other hand, I'm a horrible self C+2 offender, and that's a whole separate conversation about fixing that) [14:44:39] <_joe_> bblack: that part should really be "when it is ok to self-c+2" [14:44:50] I also try hard to not V+2 but I've succoumbed to temptation multiple times in the past month, usually when a change has already been V+2'd by Jenkins, but I had to rebase, and now it's taken >5 minutes [14:45:24] <_joe_> because it's pretty clear that if I want to try to raise the opcache usage on one appserver I should be able to do it [14:45:27] <_joe_> via puppet [14:45:42] _joe_: yeah, right now it's kind of "personal judgement", but without some kind of stated standards, it's easy for that to drift and be misapplied :) [14:45:47] <_joe_> or even to change the number of runners on all appserver I guess. [14:46:02] <_joe_> nah I was never given a standard besides "your best judgement" [14:46:11] that's what I mean [14:46:18] <_joe_> also sometimes I can wait weeks for someone to catch up with a review [14:46:35] maybe we should at least write something down about what good judgement normally looks like, and clear cases where it would almost never be good judgement to self+2, etc [14:46:35] <_joe_> who knows envoy well enough to challenge/review the config I wrote? [14:46:45] heh that is its own problem [14:46:50] (but related ofc) [14:46:59] and that last point, yeah, that's the case for me sometimes too [14:47:14] <_joe_> cdanis: strictly related, we don't have a bus factor of 2 on 80% of the projects [14:47:21] but even then, you could argue that someone else could at least take a cursory look for obvious dumb typos and such, independent of not understanding the deeper meaning. [14:47:32] <_joe_> I can at least review your puppet coding yes [14:47:48] bblack: there's also "please review this and ask questions liberally, so you can learn" [14:47:52] <_joe_> speaking of which [14:47:54] but ofc that is a lot of time [14:48:05] cdanis: yes, very true, good point on the mentoring value [14:48:24] I think it's something in the habit we should be doing more of [14:48:36] I work a lot better when I'm working with someone and I don't think I'm exactly an outlier in that regard [14:49:08] yeah there's an inverse of https://en.wikipedia.org/wiki/Rubber_duck_debugging in here somewhere [14:49:25] we normally think of that as the junior coder going to ask someone senior and then realizing the answer without needing an explicit answer [14:49:47] but the same effect works in reverse, in that the senior person, by just mentally bouncing ideas off the junior person, realizes more of their own mistakes, etc [14:49:57] i think more reviers sounds great but rhere also needs to be more guidence on how to to add as a revierw and more importantly there needs to be some expecvtation set on how long it takes for a review [14:50:00] bblack: while I agree in principle, in practice I think that it boils down to be unpractical, so basically self +2 judgement is the same judgement of running root commands in prod, you need to know what you're doing [14:50:04] and that's part of the job [14:50:14] for dumb typos and such CI should take care of it [14:50:19] maybe [14:50:41] volans: for Puppet yes, for a lot of config files not in practice [14:50:52] but CI won't in practice. it has no idea what the correct keyword was for that nginx config setting in some files/foo.conf, for instance [14:50:53] cdanis: why? it could [14:51:13] vim highlight has that concept, I'm sure there are many config linter that we could use to check that [14:51:24] it's not tractable in general [14:51:37] we run different versions, the keyword is brand-new and only works on the package we have on buster, etc... [14:51:59] sure sure, but would a non subject matter experct know it? [14:52:02] probably not [14:52:27] but they can figure it out [14:52:49] they can go google nginx docs, and find this keyword is only in 1.9.11+, and then question whether we're running that and on what base OSes, etc... [14:53:06] given sufficient time and motivation to really review a change instead of rubber-stamping it [14:53:11] which gets back to some of the bus-factor issues, of course [14:53:53] I'm usually for always having a review for any change, but have also committed self-+2 recently, so guilty as charged [14:55:12] I do it all the time. I suspect mostly with good judgement, but I'm sure not always. [14:55:41] it would be intresting to see how many self+2 changes failed vs ones which where reviewed. however most change managment systems allow for documentation only change requests. which is equivilent to +2 so i dont think they should be removed completly however guidence would be usefull. im sure i have abused it [14:56:07] anyways, the conversation seems to be sprawling out into fractal dimensions of sub-conversations :) [14:56:10] *equivilent to self+2 [14:56:27] I'll try to summarize the points about rebase-if-necc and send an ops-private email hoping a consensus decision emerges for monday. [14:56:50] thanks! [15:05:35] also my rubber duck debugging link didn't really capture what I was trying to say there, but it was the quickest link [15:05:59] there's a variation on that where kept a teddy bear on their desk, and when people would come to them with questions, they'd tell them to ask the bear first [15:06:19] the rubber duck link sounds more like a version of it with only one person involved, not two. [17:34:57] bblack: Not necessarily today, but we could use a review of https://gerrit.wikimedia.org/r/#/c/operations/dns/+/540148/ (feel free to foist it off on someone else if you think they're better suited) [17:40:31] andrewbogott: that looks fairly straightforward, there's nothing special about deploying new zonefiles really, and jenkins likes it. [17:40:56] I'm not really looking any deeper than that. I mean, the targets of the CNAMEs exist, and the rest is really up to you re: the contents, within reasonably-sane boundaries :) [17:41:01] bblack: great. As long as you're not worried about it breaking other zones we'll go ahead and deploy and see what happens. [17:41:06] thanks [17:41:59] did you want me to deploy or you are? [17:42:27] we'll do it [17:42:27] thanks [19:46:50] heh, I'm pretty sure the 'global' Varnish availability %age is just wrong [20:04:52] mmmhmm. yep, they are https://grafana.wikimedia.org/d/uiBRF5hWk/xxx-cdanis-frontend-traffic-copy?panelId=3&fullscreen&orgId=1&from=1570101355987&to=1570110394094 [20:11:30] we clearly need to improve our "incorrect pseudo-global sum-of-ratios-with-different-denominators" metric. I'll put it in an OKR :) [20:12:06] lol