[10:25:15] yay! [10:25:22] https://www.irccloud.com/pastebin/nD6asLKH/ [10:28:15] 🌈 [10:30:05] good time for lunch [10:30:08] * dcaro lunch [11:27:45] topranks: quick review here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080267 [11:29:07] arturo: +1 [11:29:13] thanks! [11:29:16] makes sense to do it that way, much more flexible [11:30:37] it took me some spare cycles to think how to implement it [12:02:41] arturo: puppet is still failing on enc-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud (has not worked actually since the last time I checked), should it? Or are you still working on getting the network setup? [12:10:09] dcaro: the network should be stable today [12:10:17] I just run puppet on a few other VMs, and it works [12:10:33] https://www.irccloud.com/pastebin/2J4Y3R8H/ [12:10:44] that enc one and cloudinfra-cloudvps-puppetserver-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud fail :/ (getting emails every day) [12:10:55] let me check [12:12:16] dcaro: [12:12:18] https://www.irccloud.com/pastebin/Tl6Nr5v2/ [12:12:54] might it have gotten fixed tody? [12:12:59] https://www.irccloud.com/pastebin/m8uYCrEs/ [12:13:34] maybe! [12:13:45] I see the enc-1 performs a change on every agent run [12:13:56] https://www.irccloud.com/pastebin/GNgJZ5O7/ [12:14:21] not sure why it tries to run confd? [12:14:24] keyholder is not armed [12:16:01] I don't know where is this secret https://www.irccloud.com/pastebin/1wWV2IPF/ [12:16:15] should be in the puppet repo I think? [12:16:24] (in the private repo in puppet) [12:16:35] not sure though, maybe in pws [12:16:57] in pws I don't see it [12:19:14] * arturo mumbles something about puppet git tree ownership [12:20:50] ok found it [12:20:51] alias git=pgit [12:21:09] and your tree ownership issues will be gone (unless someone already did something) [12:23:01] ok, so last puppet issue is confd wont start because [12:23:19] Sep 28 02:01:53 enc-1 confd[1560444]: 2024-09-28T02:01:53Z enc-1 /usr/bin/confd[1560444]: INFO SRV record set to _etcd-client-ssl._tcp.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud [12:23:19] Sep 28 02:01:53 enc-1 confd[1560444]: 2024-09-28T02:01:53Z enc-1 /usr/bin/confd[1560444]: FATAL Cannot get nodes from SRV records lookup _etcd-client-ssl._tcp.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud on 172.20.254.1:53: no such host [12:23:39] I wonder if I have deleted that record myself [12:25:26] maybe when adding them to tofu? [12:26:20] do we have a similar record in eqiad1? [12:28:51] I'd say no, confd is failing there too [12:29:26] ok, so maybe some puppet change made it a requirement for the record to be present [12:29:36] or maybe we don't need confd at all [12:30:04] I'll stop my research here, need to focus on something else [12:54:08] ack [15:00:05] did cloudcumin server died? [15:09:22] 👀 [15:09:44] dcaro: ongoing outage, see -sre [15:09:55] ack [15:09:57] sorry, -operations [15:11:25] yep, xd, just realized [15:11:37] (wondered for a bit how tomcat was related though) [15:37:25] cloudcumin was rebooted and now I cannot use the tofu cookbook? it fails with no actual error [15:39:06] log [15:39:08] https://www.irccloud.com/pastebin/YEJbKDl9/ [15:40:20] I have the feeling of this being unnecessary hard to debug [15:43:07] turns out, keyholder needed to be rearmed [15:43:12] and yet, it will not tell you the output of the command xd [15:43:47] maybe we can add an alert of sorts for keyholder not being armed? [15:44:00] apparently there is one such alert for prod hosts [15:44:11] I ran the puppet agent hoping for it to report something too, but it didn't [15:44:33] hmm, there's a couple 'nrpe timeout' alerts though xd [15:45:45] ok, I just saw them [15:46:33] not sure they are very useful though, ferm and ntp sync [15:46:46] (though nrpe timeout usually means nrpe itself is not up and running) [16:03:35] * arturo offline [16:50:20] dcaro: there might be something wrong with how we merge the api docs. We have these pointers to a ref in the merged docs: `"$ref": "#/components/parameters/JobId"` but not the actual reference. In the jobs api openapi.yaml however, we do have the ref so the problem is not there. [16:51:30] I will investigate [16:52:26] (restish won't work in lima-kilo unless this is fixed) [16:53:05] nice! [16:53:16] oh, the functional tests are broken in tools xd, I'm fixing it [16:54:14] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/558 [17:17:35] blancadesal: yep, I think that the merge logic does not expect a prameters subpath, just added it, let's see [17:18:30] To be fair, we might want to use just schemas subpath [17:18:34] like all the others [18:06:25] dcaro: I'll look into it if you don't have the time [18:35:14] blancadesal: got https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/45, but will not be able to test it, feel free to take it over :) [18:36:21] ok! thank you :) [18:37:19] cya! [18:37:31] * dcaro off