[00:57:30] Rook: for T379746, if you need specifics to secure PAWS, I can provide more info privately. [00:57:30] T379746: Cleanup miners - https://phabricator.wikimedia.org/T379746 [00:58:44] No no need for specifics. I thought there was some way to block an IP is what I was wondering about IPs. We can add the security tag to that ticket if you think it is appropriate [01:01:24] I can block IPs, which I have done for the proxies/web hosts. The largest group appears to be normal telecom provider though with other legitimate users. [01:03:14] Reviewing at this rate isn't sustainable though. Something is likely going to need to be done from the PAWS/cloud side. [09:03:51] morning [09:40:26] o/ [09:49:56] o/ [11:05:33] quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/595 <- it's preventing running the tests in lima-kilo [11:05:36] (without manual changes) [11:09:35] dhinus: thanks, addressed your comments [11:10:00] dcaro: thanks, approved [11:15:19] I'm trying to figure out what's the best answer to this email, or in general to cloud vps users who want to experiment with prometheus metrics [11:15:23] https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/3SAMOJSJZBH64M3WPQJXXIUACKJPMBJA/ [11:15:53] we have some docs here suggesting it's ok-ish to push some custom metrics to the metricinfra prometheus https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metricsinfra_Prometheus [11:16:42] but I feel like we need more docs on how to actually do it, or how to set up a custom prometheus instance in a project (if that makes sense) [11:17:02] dhinus: yeah, same feeling here regarding using metricsinfra prometheus [11:17:20] and also same, about additional docs [11:18:07] I guess the "easy" answer is pointing them to the -cloud IRC channel for help :D [11:19:01] but maybe I'll try to figure out how this example mentioned in wikitech is working https://libraryupgrader2.wmcloud.org/metrics [11:20:09] does prometheus.wmcloud.org scrape the /metrics URL on _all_ cloudvps vms? [11:21:57] I don't think so, both suggestions would be part of the metricsinfra project service (unfinished), allowing users to define alerts/metrics etc. [11:22:11] we would probably want to do some thinking on the offering we want to give [11:25:43] the config for the scrapes is in the prometheusconfig DB, that the metricsinfra controller uses [11:25:49] (the scrapes table) [11:26:38] I don't see that one though, looking [11:27:00] found it [11:28:45] https://www.irccloud.com/pastebin/A0T0GaLY/ [11:32:08] nice one, thanks [11:37:16] this is the epic task about allowing project admins to configure custom scrape targets: T284993 [11:37:17] T284993: Enable self-service Prometheus configuration management for project administrators - https://phabricator.wikimedia.org/T284993 [11:49:40] yep, I think that's the one yes [11:49:45] quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/62 [11:49:54] I've added some info to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metricsinfra_Prometheus [11:50:40] thanks [11:50:52] (for the docs) [11:50:59] dcaro: approved the MR [11:51:19] and the review :) [11:54:47] oh, another quick one https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/596 [11:54:55] (so when I deploy the fix it's actually tested for) [12:00:58] gtg for lunch, will deploy the fix https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/63 after (anyone feel free to release before that) [12:01:12] +1d [14:49:42] I got a few quick reviews here https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests [14:52:43] and https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests?label_name%5B%5D=Needs+review [15:06:07] what is ntp-04 in project cloudinfra? "No Puppet resources found on instance ntp-04 on project cloudinfra" [15:07:16] dhinus: as far as I know all our VMs use ntp-03/04 time servers to sync clocks. [15:07:22] So that probably matters [15:07:52] looking [15:08:09] "Failed to open TCP connection to puppetmaster.cloudinfra.wmflabs.org:8140 (getaddrinfo: Temporary failure in name resolution)" [15:08:55] I'd check resolv.conf for starters, and then reboot it :) [15:09:03] btw, confirmed that those servers still matter: [15:09:07] [Time] [15:09:07] Servers=ntp-03.cloudinfra.eqiad1.wikimedia.cloud ntp-04.cloudinfra.eqiad1.wikimedia.cloud [15:09:16] (from a randomly selected VM) [15:09:42] name resolution is broken for any host [15:10:11] the nameserver is missing from /etc/resolv.conf [15:10:30] ntp-03 has "nameserver 172.20.255.1", in ntp-04 that line is missing [15:10:39] I'll try adding manually, then re-running puppet [15:10:46] that's interesting, has puppet been broken there for a year? [15:10:54] Easy to fix, but mysterious! [15:11:20] only broken for 960 minutes, apparently [15:11:22] :) [15:11:52] I would like to think that resolv.conf doesn't just randomly degrade :( [15:12:11] that line was removed by puppet itself, it's logged [15:12:23] "Applying configuration version '(4ac6bd9d7f) Eevans - Update corto puppetization'" [15:13:19] well the previous puppet run was at the same commit, and it worked [15:13:31] then it somehow decided that line had to go... [15:13:46] so probably a temporary hiera lookup failure... [15:15:35] looks likely [15:19:25] I'm going to try to make a safety net for this since it seems very bad [15:19:38] (although to be honest puppet should fail entirely if there's a hiera failure...) [15:20:16] the template is shared with prod so I'm surprised this hasn't happened before [15:23:17] I'm not finding any related task in phab, I'll open one for posterity [15:25:56] oh great, tell me the # and I'll attach this patch [15:28:34] T379927 [15:28:35] T379927: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927 [15:28:50] I marked it as "resolved", but feel free to reopen and attach the patch [15:31:10] heh, as always I want to cc the person who worked on this code last and of course it's jbond all the way down [15:31:32] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249 [15:32:16] haha [15:32:21] * andrewbogott seeks pre-meeting breakfast [15:38:02] andrewbogott: i don't think that will fix the issue. Its not obvious from the task, but is it possible thatthe nameservers variable contained ip addresses for the current host [15:38:08] i ask as the template has the following [15:38:17] `<% @_nameservers.reject{|ns| [@facts['ip'], @facts['ip6']].include?(ns) }.each do |nameserver| -%>` [15:42:01] originally added in: https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d [15:43:39] jbond, you are kibo for the new millennium [15:44:19] So we could move that logic that removes localhost up into the .pp file and then check... [16:14:29] andrewbogott: its wont remove localhost. it will remove the primary IP address [16:15:33] I'd also suggest chatting to _joe_ about why they added the check initialy. however if the server is a dns server then i think it also make senses to use localhost and *not* $facts['ip'] [16:16:47] i also worked with sukhe to add some special handeling for the production DNS/ntp servers so its also worth having a chat with them to see what we did there (assuming we finished it) [16:17:30] * jbond doesn't get the kibo reference [16:31:55] <_joe_> lol jbond you're too young [16:32:10] <_joe_> kibo was a fading legend when I joined newsgroups [16:32:37] <_joe_> he was some guy that ran some bot looking for mentions of him in any newsgroup, and he'd show up if mentioned [16:32:54] <_joe_> what was the task? [16:33:18] ahh i seee nice to know im still too young for some things lol [16:33:27] this was the commit https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d [16:33:42] specifically "Remove all the $nameservers_override from the node definitions and add those to per-site, per-role hiera [16:34:05] and this line https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d#diff-bb184c1bf60b3bafdb7cd2a60fe65b836f647fe25a3bf5227d26f48f1ff0e38bR9 [16:34:53] the line has since changed but is mostly the same https://github.com/wikimedia/operations-puppet/blob/production/modules/resolvconf/templates/resolv.conf.erb#L10 [16:35:47] sorry wrong comment i ment this one "exclude the IP of the current node from the list to avoid self-dependencies [16:36:05] <_joe_> yes so [16:36:12] <_joe_> this was originally done with the overrides [16:36:21] <_joe_> we didn't want a node being installed [16:36:39] <_joe_> having itself as a nameserver, when the nameserver software was still unconfigured [16:36:48] <_joe_> it causes all sorts of issues ofc [16:37:05] <_joe_> so unless you modify /etc/resolv.conf as the absolute last thing in puppet [16:37:23] <_joe_> or at least after the local dns server is set up [16:37:30] <_joe_> things might fail randomly to resolve [16:37:34] <_joe_> does this make sense? [16:37:41] ack yes that yes that makes complet senses [16:39:09] in that case andrewbogott i would speak with sukhe i was helping them solve thiswhen they were rebuilding the dns servers [16:39:36] i can't rember if we finished it off but whatever we did there should work for you [16:39:58] and i thnink if you just remove the line ou will hit the issue _joe_ describes above when rebuilding servers [16:40:07] thanks _joe_ :D [16:43:36] * arturo offline [16:49:46] _joe_: the larger context is that we had a VM (not a nameserver) randomly remove its own resolver and I'm trying to prevent that from happening again with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249. Since it only happened once the safest approach may be to just do nothing as that patch was already a stab in the dark -- no idea why its nameserver list came up empty. [16:50:26] <_joe_> yeah not sure either, we can put a consistency check in there [16:54:51] andrewbogott: from the task it looked like the server was an ntp server. in production the NTP serveres are the DNS servers. so are you sure it wasn't a namserver server? [16:55:49] yeah, they're different boxes in wmcs. Let me make sure there's not a name server installed there by accident... [16:56:50] doesn't look like it. [16:58:09] ok one fix would be to remove the filter to reject the $facts['ip'] out of the template and into puppet. then you can do your patch. ill comment on the CR [16:59:13] thanks! [17:02:22] * jbond done [17:33:25] <_joe_> andrewbogott: please ping me for a review before merging [17:35:00] will do [17:36:18] <_joe_> FTR, ofc there's a wikipedia page [17:36:19] <_joe_> https://en.wikipedia.org/wiki/James_%22Kibo%22_Parry [17:36:54] :) thanks [17:50:54] _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249, not at all time critical [18:13:51] jbond: nice seeing you around :) [18:14:00] * dcaro off