[10:48:54] I have created this ticket T371355 out of caution, we should keep this in the radar [10:48:57] T371355: toolforge: maintain-kubeusers: review & correct kubernetes templated resource names - https://phabricator.wikimedia.org/T371355 [15:00:10] dhinus: https://grafana.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency?orgId=1&refresh=30s is using the haproxy stats from cloudlbs, you might be able to add/get more stuff from them [15:01:57] dcaro: thanks! [15:59:39] dcaro: that link was very useful because it led me to discover that the "sudden increase" is actually a switch from cloudlb1001 to cloudlb1002 :P [16:00:19] nice :) [16:16:28] but that doesn't match what I see in the receiving side behind the load balancers :/ [16:22:55] hmm, on the receiving side you should see the aggregated traffic of both [16:23:13] exactly, but I see a big change starting June, 12th [16:23:19] and very little traffic before [16:23:39] these are 2 different metrics though, the haproxy metric and the "host overview" metric for the backend [16:24:06] that's in the throughput that you see the bump right? [16:24:16] yes [16:24:27] https://phabricator.wikimedia.org/T367778#9906482 [16:24:45] sorry wrong link, the right one is: https://phabricator.wikimedia.org/T367778#10028027 [16:25:17] well actually both are useful, I'm posting a new comment to recap [16:59:38] looking at the traffic at the cloudlb level for the 10-06-2024 does not show much [16:59:41] https://usercontent.irccloud-cdn.com/file/szfFhPal/image.png [17:00:04] there's some weird spikes though [17:00:46] might be a resolution issue :/ [17:01:12] yes I just found there's some artifact due to using "irate" instead of "rate" [17:01:41] ack [17:02:40] so the traffic actually decrease and not increased :) [17:02:46] but it became more spiky [17:06:37] I summarized my latest findings in https://phabricator.wikimedia.org/T367778#10028841 [17:07:28] wait I pasted a wrong graph [17:09:58] ok fixed [17:10:56] I'm logging off for today, I have a plan for some tests I want to do tomorrow (depooling the host and seeing how the graphs change) [17:14:25] 👍 nice [17:21:52] andrewbogott: I'm adding some NFS alerts (very basic, just check that the service is up and running), sorry for the noise xd [17:22:18] nfs alerts seem good! [17:39:32] When such alerts do trigger, where do I find the code that is making them run? I have one way of doing it on tf infra test, but that doesn't seem to be what is happening with paws (PawsPawsNFSDown in particular) [17:46:52] they are defined in metricsinfra database [17:47:06] https://wikitech.wikimedia.org/wiki/Metricsinfra [17:47:52] Oh that. So everything is defined there? Or most things? [17:48:52] anything coming from cloudvps directly, anything paws/toolforge specific should be coming from there [17:49:46] meaning they would have their own metrics infra servers to manage that? [17:50:39] no, all the cloudvps alerts are managed by the same cloudmetrics servers, they might be pulling data from specific prometheus servers on the project, but the alerts are in the shared one [17:55:03] Oh that makes sense. Thank you [18:11:40] np :) [18:11:42] * dcaro off