[09:03:29] herron, shdubsh, godog - just ran puppet on logstash1007, another oom for ES :( [10:29:28] is https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetdb1002&service=Ensure+hosts+are+not+performing+a+change+on+every+puppet+run getting out of hands? [10:35:40] a little :D [10:38:03] it is strange since mc1029 is listed in there, but I don't see puppet behaving weirdly [10:38:44] mmm confirmed by https://puppetboard.wikimedia.org/node/mc1029.eqiad.wmnet [10:38:52] maybe the alert is a little stale, forcing a refresh [10:41:29] <_joe_> XioNoX: no it's not... the clouddb and the kubestages are there since well before christmas [10:41:42] <_joe_> I assumed the kubestages would be fixed by akosiaris and jayme in due time [10:43:41] I think that the alarm is probably getting too broad, some hosts are downtimed etc.. so they should be excluded (there is task about it) but in general we shouldn't keep puppet disabled on hosts for so long in my opinion [10:43:54] and the alarm should be reviewed by all teams [10:44:18] not only by people in clinic duty or reviewing icinga alarms [10:44:37] (the task: https://phabricator.wikimedia.org/T268211) [10:44:38] puppet disabled ==> puppet half broken [10:45:02] uh, I personally missed that (for kubestage. Will add a task to fix it [10:45:17] yeah, I'm asking as it's the first time that I see that many hosts listed in there [10:45:22] XioNoX: the forced re-run cleared out a lot of hosts [10:45:35] cool, thx :) [10:45:37] i have pinged cloud a few times in relation to the clouddb servers [10:45:54] <_joe_> me too [10:46:12] kubestage hosts only report this Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective) but not on every puppet run. Interesting [10:46:50] akosiaris: is that related to the docker patch you where working on? [10:47:03] *dynamic ferm rules patch [10:47:18] jbond42: no, docker does not do anything networking related on kubernetes hosts [10:47:36] thankfully... docker networking sucks [10:47:55] is cali not doing something simlar which perhaps needs handeling in a simlar manner? [10:48:31] never mind i see this ignored_chain_prefix = ('DOCKER', 'cali-', 'KUBE-') [10:48:45] ill have a look today and see if i can see anything obvioud [10:49:22] interesting in 4 subsequent runs only the 1st and the 4th triggered that. So something clearly changes something in between runs [10:54:30] akosiaris: created https://phabricator.wikimedia.org/T271702 for that [10:57:34] akosiaris@kubestage2001:~$ sudo /usr/local/sbin/ferm-status -v [10:57:34] ipv6: [10:57:34] --A FORWARD --jump ACCEPT --match mark [10:57:34] ipv4: [10:57:34] --A FORWARD --jump ACCEPT --match mark [10:57:47] jayme: thanks [10:58:07] that's the diff ^ that ferm-status says it notices [10:58:41] 147 13602 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:S93hcgKJrXEqnTfs */ /* Policy explicitly accepted packet. */ mark match 0x10000/0x10000 [10:58:46] and that's the matching rule I guess [10:59:15] it would have been nicer if it was in it's own chain ... [11:00:28] and there we go https://github.com/rancher/rancher/issues/28840 [11:01:50] akosiaris: as its not in its own chain i guess i could add logic to filter based on comments as well [11:05:17] jbond42: we could do that. I am just trying to understand why it shows up now. This is pretty new it seems and IIRC it's about being able to address services without endpoints [11:05:27] trying to figure out if it's avoidable [11:06:36] akosiaris: ok ill leave it with you for a bit, im on the ticket please ping if you want me to make a change to ferm-status [15:25:52] who can grant me access to the packaging project on horizon? [15:35:47] vgutierrez: tool or vps project? [15:36:25] I'd say VPS.. I need access to builder-envoy-03.packaging.eqiad.wmflabs [15:37:05] vgutierrez: that sounds like something jayme has worked with [15:37:28] vgutierrez: any of these people should be able to I think: https://ldap.toolforge.org/group/project-packaging [15:41:26] vgutierrez: looking [15:45:37] vgutierrez: should be done [15:47:34] hey ottomata - are those old schemas of meta being kept for historical purposes or can be deleted at a later date? [15:49:43] tabbycat: we have no real plan to delete them, instead we are edit protecting them with a notice [15:50:20] Very good [16:06:11] thanks jayme! [16:40:43] jynus: heyas, i updated https://phabricator.wikimedia.org/T271230 but really with info no decision [16:40:51] likely cool to gather quotes for both solutions and make the decision then [16:41:04] ie backup host expansion [16:42:03] thanks robh, I just wanted to know if you fell strongly against any of those [16:42:36] yeah the shelf for eqiad is problematic for onsite space due to eqiad, so i actually like going more density with r740xd2 but not strongly enough to recommend it over the other without comjparison quotes [16:42:38] so I will fill in and base the decision on quotes, if it is not too hard [16:42:46] but i totally get what you were going for in reducing my workload there in quotation and its appreciated! [16:42:52] yeah, that [16:42:58] and making dcops onsite happy too [16:43:13] as it could entail lots of hw movement that are unnecesary [16:43:22] Yeah for codfw going shelf no big deal, but eqiad being so close to capacity it means its likely worth my time to quote both solutions [16:43:28] will fill in, we have templates for both [16:43:32] awesome [16:43:37] and will send back to you soon [16:44:17] assigning to me for now [16:44:31] ack [16:44:33] templates here == past requests [16:44:55] thanks for the quick response, and sorry for delay, I was off [16:47:51] no worries, dba team still has tasks back to me faster than others this quarter ;D [16:48:19] well, we had the delay of the media backups last quarter, but had a good reason [16:48:46] hopefully we can go over that this Q (either for capex of opex) [17:23:01] klausman: i took the liberty of bolding that item for you in the agenda :) [17:23:28] Of course. [17:25:05] klausman: sorry I forgot to ping you back today, not sure if you'll be around after the meeting or I'll ping you tomorrow, my bad [17:25:30] klausman: thanks for calling that out btw, it's a good note [17:26:07] volans: go ahead, I have some time [17:26:56] basically I wanted to know if the host rebooted into PXE and got reinstaleld after d-i but before the first puppet run or after the first puppet run during the reboot issued by the cookbook (I guess the latter but looks weird to me) [17:27:05] rzl: it's a bit of a corrolary of "Always include the shell prompt ($ or #) to indicate whether a command should be run as root or not. It also has the effect of disabling wholesale c&P of commands) [17:27:48] yeah, and right up there with always using `${FOO?}` instead of `$FOO` in docs [17:27:56] volans: I *think* the sequence was reinstall, reinstall, failed puppet run (i.e. no firstrun in between two reimages), but I am not 100% sure. [17:28:12] (so that if I forget to set the variable, it errors instead of doing something I almost certainly don't want) [17:28:36] Ack. My fave: "rm -rf $HOME/" [17:28:57] womp womp [17:29:02] A while back the Steam uninstaller had such a bug and definitely wiped out homedirs before being fixed. [17:29:37] klausman: ack thx, I'll look at the logs more in depth [17:35:47] I could use a hand with an acme-chief issue if/when someone has some time. I can see things getting written to DNS during the challenge/respose but LE is nevertheless not satisfied. [17:35:47] https://phabricator.wikimedia.org/T260835#6736903 [17:37:12] vgutierrez, possibly? ^ [17:44:19] hmm what TTL are you using for those TXT records? [17:45:41] we had issues with some of those in production till we set the TTL to 0 [17:50:01] vgutierrez: is that not something managed by the acme script? I may be misunderstanding what you mean [17:50:47] horizon shows 'Time To Live - ' which I guess means it's inherited from the zone... [17:51:00] and the zone has a ttl of 3600 [17:51:19] that's handled by the dns script that acme-chief triggers to fulfill the challenges [17:51:27] ok, will look at that... [17:51:31] horizon has a custom one [17:51:35] (although that script has worked in many many other places already) [17:52:03] I don't know if the TTL is set by the script or it is dns zone dependent [17:52:15] That's why I'm bringing it now [17:52:25] yep, I'm digging [17:52:48] If that doesn't help I'll be happy to deep further [17:54:59] BTW.. a quick check with my laptop says that toolserver.org NS records are wmf production DNS servers [17:55:37] That would explain why fulfilling the challenges on WCMS DNS servers doesn't work from LE point of view [17:58:37] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/655476 solves that ttl thing [17:58:44] But I assume the other issue is the real one [17:58:45] I'd hold that [18:00:40] so… designate thinks that it manages toolserver.org, but I guess those records are invisible to the outside world since our registrar refers things to prod [18:01:08] guess I need to send an email to markmonitor :/ [18:10:57] vgutierrez: I'm requested to have the domain transferred to Designate. I'll bug you again if that doesn't resolve everything :) [18:11:07] Do you think that ttl change is a good idea in any case? [18:12:26] Conceptually yes... but I can't review it right now [18:14:50] that's fine, it's certainly not urgent [18:14:55] I'll add you as a reviewer and let it sit