[00:08:17] are you aware that some prometheus hosts (but only some, not all) are trying to talk to puppetserver1001 but get connection refused from it, causing other puppet errors? [00:08:45] other hosts using the same prometheus::ops role do not have this issue [00:17:15] ah, nevermind. this is a race condition while a puppet change is applied. it fixes itself on next run. [00:17:54] double checking all 9 hosts, but you can basically ignore this [00:19:42] in an unrelated matter: puppet on prometheus::ops machine is very slow and it seems the reason is that /srv/prometheus is recursively managed by puppet and has about 20k files in it. puppet though says the recommened soft limit is 1000 and warns that doing more can make it slow. [00:40:44] Thanks for the taking a look at the 1st issue you mentioned. [00:40:54] As for the 2nd one, it's a known issue: https://phabricator.wikimedia.org/T351643 [00:41:17] denisse: ah, good to know it's already known. alright! [00:41:49] I deployed that change that replaced ferm::service with firewall::service. That did NOT change anything about actual firewall rules. [00:42:03] it makes it possible though to switch the firewall provider in the future [00:42:16] That's pretty nice, thanks for the patch! :) [00:42:39] which would then give you some new features that I'm happy to chat about [00:43:03] today though, it it was only that one config file gets renamed from - to _ in the name..duh [00:43:20] which means the service was restarted [00:44:10] That would allow us to move from iptables to nftables, right? [00:44:46] that is correct [00:45:33] nftables can do things like, for example: if a host opens more than X connections in the last Y minutes.. then drop packets for Z minues [00:45:39] so throttling [00:46:18] in iptables this was only possible by loading some non-default modules [00:47:38] firewall::service is generic and can use either ferm or nftables. you are still using ferm just like before, but now it's not hardcoded anymore [00:48:47] Thanks for implementing the change, do you know if there's an estimated date for us to switch to nftables? [00:49:45] there is no such date. i'm just doing it for all services of my team and randomly for prometheus::ops because I wanted to show it to you [00:50:17] infra foundations wants us to to switch though, that is clear from the code [00:50:55] Do you happen to have documentation on the steps required to make the change? [00:51:08] I'll talk with the o11y team about this on Wednesday. [00:51:32] I don't have a wikitech page. I should make one. [00:51:46] though it's a one-liner: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1059152/2/hieradata/role/common/durum.yaml [00:51:57] Oh, good! I'll take a look. [00:52:03] that is.. ONCE all uses of ferm::service have been replaced [00:52:30] the one way to find out is to upload a change like above and not merge it but just compile it [00:52:59] if it compiles then it should be all good to go [00:53:24] I can offer to show it on a google meet shared screen [00:54:54] Thanks for offering, let's coordinate for that! :D [00:55:06] I've created this task to track the firewall migration of o11y services: https://phabricator.wikimedia.org/T372845 [00:55:10] :) [00:55:43] so first step would be to grep if there are any other "ferm::service" in the code or that was all [11:56:50] we would like to decommission a mapreduce job, that we used to alert on. If we stop generating metrics, the alert will keep firing. What's the recommended way to deal with this scenario? Is there a way to drain historical prometheus metrics? There's some more context in this phab: https://phabricator.wikimedia.org/T372456#10076765 [12:27:45] gmodena: IIRC those metrics come from gobblin via pushgateway ? if so then cleaning up the metric from the pushgateway will do the trick [12:28:31] can't remember if gobblin does take care of deleting metrics from pushgateway off the top of my head [12:36:09] godog that's gobblin indeed. AFAIK it does not take of deleting metrics from pushgateway though [12:36:12] ^ btullis [12:45:43] gmodena: ok thanks for confirming [12:54:43] to be clear I think all that's needed in this case for the actual deletion is curl -XDELETE http://prometheus-pushgateway.discovery.wmnet/metrics/job/webrequest_frontend_rc0 gmodena [12:55:07] I'll let you folks decide if/when that's appropriate [12:56:26] godog ack. Does that require SRE powers, or access to specific hosts? [12:56:47] I'll f/up with btullis & team. Thanks for the pointer godog [12:56:54] gmodena: no special powers needed no [12:56:57] sure np [12:58:41] Great, thanks. I'll take care of that.