[09:08:26] morning [09:09:41] morning [09:24:32] o/ [11:48:36] greetings, I'm traveling back home today though I took another look at T412506 and the easiest thing for now would be to bump scrape_timeout for the kube-state-metrics job [11:48:36] T412506: Investigation into ToolforgeKubernetesNodeNotReady 2025-12-12 page - https://phabricator.wikimedia.org/T412506 [11:49:16] in case someone has cycles to do it today, if not I'll do it tomorrow [12:07:56] https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1094 upgrades kube-state-metrics [13:06:27] neat! LGTM [13:09:30] !1095 LGTM too [13:10:49] modulo the uninstall + reinstall [13:26:36] > You should run a new pipeline, because the target branch has changed for this merge request. [13:29:06] /o\ [13:39:17] deploying the kube-state-metrics sharding patch, there might be a slight blip in the availability of those metrics [13:39:38] ack [13:45:51] (done) [13:46:01] taavi: I didn't realize in this case individual pods would be scraped meaning effectively metrics are tripled unless we do aggregation everywhere [13:46:25] maybe that's fine though [13:47:48] godog: basically the sharding mode in kube-state-metrics means that each pod is only handling 1/n of the resources, so the total metric count stays the same: https://prod-misc-upload.public.object.majava.org/taavi/YmgFhhEhcV7Bo.png [13:50:14] ok sweet! [13:52:38] gtg [15:39:06] dhinus: are you planning to merge the tools-db/nfs alerting patches soon? [15:54:54] taavi: sorry, didn't see the ping. yes, but I'm double checking a few things [15:54:59] I can also do that tomorrow [15:55:52] are you ok with my suggestion here? https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52#note_178286 [16:02:36] dhinus: yeah, that's fine I think [16:14:47] I pushed that change, but I messed up the rebase :) [16:16:28] actually I think it's fine but gitlab's "Compare with previous version" shows the changes coming from the rebase, which is confusing [16:18:18] taavi: can you double check and approve the MRs? [16:19:22] ship it [16:19:28] thanks! [16:20:46] ha didn't think that the failing pipeline prevents merge.. I'll try merging the 2nd MR in the stack first [16:25:09] that worked, 1st and 2nd merged, re-running the pipeline for the 3rd [16:26:32] uh I just found an error in the 3rd one, fixing it [16:30:11] fixed (the mountpoint was wrong) [16:30:24] https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/54 [16:48:03] I can see the new alerts in both https://prometheus.svc.toolforge.org/tools/alerts and https://prometheus.svc.beta.toolforge.org/tools/alerts [16:51:21] I will delete the corresponding alerts from the mariadb table in metricsinfra-controller-2 [16:58:55] I'm stopping toolsdb replication for 5 mins to make sure the alert is firing as expected [17:05:50] the alert has fired as expected, I can see it in alerts.wikimedia.org and in #-feed, plus I received it via email [17:07:10] replication restarted, the alert has cleared [17:07:58] approved !54 [17:08:21] taavi: thanks, merging that one too [17:26:16] dhinus: there's an AlertLintingProblem alert that just popped up [17:26:22] yep seen it [17:27:32] maybe pint doesn't like my "smart" way of disabling the page in toolsbeta by using "mountpoint=/srv/tools" [17:27:57] do you know where the check is running so I can see the output? [17:28:51] ok I can see the error here https://prometheus.svc.beta.toolforge.org/tools/graph?g0.expr=pint_problem+%3E+0&g0.tab=1 [17:28:56] it's a systemd service running on the prometheus node [17:29:10] 'prometheus "tools" at http://127.0.0.1:9902/tools has "node_filesystem_avail_bytes" metric with "mountpoint" label but there are no series matching' [17:29:23] that's the same as you get when you open the description label on karma [17:29:38] ah nice didn't see it [17:30:13] now, the only fix I can think of is splitting that single alert to a separate file with deploy-tag: project-tools [17:30:47] unless I can add a deploy-tag selectively to a single rule inside the file [17:31:00] nope [17:32:38] or maybe adding "or on() vector(0)" to the expr? [17:32:59] which one do you think is cleaner? [17:34:42] I am not a fan of the vector() hack, so probably the former? (or just rely on the existing hack that makes paging toolsbeta alerts not actually paging, and have it deployed to both) [17:37:58] hmm what is the existing hack? [17:38:54] the one that just rewrites the severity label before sending the alert to AM [17:40:09] that rings a bell but I forgot that it existed... where is it stored? [17:40:18] I think I'll just use that [17:44:08] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/toolforge/prometheus.pp#586 [17:44:58] thx [17:53:15] taavi: https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/55 [17:54:34] CI failing? [17:55:04] odd error, restarting [17:55:47] fixed [17:57:14] approved [17:57:30] thx [18:13:50] the pint alert has cleared [18:14:02] I'll call it a day :) [18:14:13] * dhinus off