[09:08:26] <dhinus>	 morning
[09:09:41] <volans>	 morning
[09:24:32] <taavi>	 o/
[11:48:36] <godog>	 greetings, I'm traveling back home today though I took another look at T412506 and the easiest thing for now would be to bump scrape_timeout for the kube-state-metrics job
[11:48:36] <stashbot>	 T412506: Investigation into ToolforgeKubernetesNodeNotReady 2025-12-12 page - https://phabricator.wikimedia.org/T412506
[11:49:16] <godog>	 in case someone has cycles to do it today, if not I'll do it tomorrow
[12:07:56] <taavi>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1094 upgrades kube-state-metrics
[13:06:27] <godog>	 neat! LGTM
[13:09:30] <godog>	 !1095 LGTM too
[13:10:49] <godog>	 modulo the uninstall + reinstall
[13:26:36] <taavi>	 > You should run a new pipeline, because the target branch has changed for this merge request.
[13:29:06] <godog>	 /o\
[13:39:17] <taavi>	 deploying the kube-state-metrics sharding patch, there might be a slight blip in the availability of those metrics
[13:39:38] <godog>	 ack
[13:45:51] <taavi>	 (done)
[13:46:01] <godog>	 taavi: I didn't realize in this case individual pods would be scraped meaning effectively metrics are tripled unless we do aggregation everywhere
[13:46:25] <godog>	 maybe that's fine though
[13:47:48] <taavi>	 godog: basically the sharding mode in kube-state-metrics means that each pod is only handling 1/n of the resources, so the total metric count stays the same: https://prod-misc-upload.public.object.majava.org/taavi/YmgFhhEhcV7Bo.png
[13:50:14] <godog>	 ok sweet!
[13:52:38] <godog>	 gtg
[15:39:06] <taavi>	 dhinus: are you planning to merge the tools-db/nfs alerting patches soon?
[15:54:54] <dhinus>	 taavi: sorry, didn't see the ping. yes, but I'm double checking a few things
[15:54:59] <dhinus>	 I can also do that tomorrow
[15:55:52] <dhinus>	 are you ok with my suggestion here? https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/52#note_178286
[16:02:36] <taavi>	 dhinus: yeah, that's fine I think
[16:14:47] <dhinus>	 I pushed that change, but I messed up the rebase :)
[16:16:28] <dhinus>	 actually I think it's fine but gitlab's "Compare with previous version" shows the changes coming from the rebase, which is confusing
[16:18:18] <dhinus>	 taavi: can you double check and approve the MRs?
[16:19:22] <taavi>	 ship it
[16:19:28] <dhinus>	 thanks!
[16:20:46] <dhinus>	 ha didn't think that the failing pipeline prevents merge.. I'll try merging the 2nd MR in the stack first
[16:25:09] <dhinus>	 that worked, 1st and 2nd merged, re-running the pipeline for the 3rd
[16:26:32] <dhinus>	 uh I just found an error in the 3rd one, fixing it
[16:30:11] <dhinus>	 fixed (the mountpoint was wrong)
[16:30:24] <dhinus>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/54
[16:48:03] <dhinus>	 I can see the new alerts in both https://prometheus.svc.toolforge.org/tools/alerts and https://prometheus.svc.beta.toolforge.org/tools/alerts
[16:51:21] <dhinus>	 I will delete the corresponding alerts from the mariadb table in metricsinfra-controller-2
[16:58:55] <dhinus>	 I'm stopping toolsdb replication for 5 mins to make sure the alert is firing as expected
[17:05:50] <dhinus>	 the alert has fired as expected, I can see it in alerts.wikimedia.org and in #-feed, plus I received it via email
[17:07:10] <dhinus>	 replication restarted, the alert has cleared
[17:07:58] <taavi>	 approved !54
[17:08:21] <dhinus>	 taavi: thanks, merging that one too
[17:26:16] <taavi>	 dhinus: there's an AlertLintingProblem alert that just popped up
[17:26:22] <dhinus>	 yep seen it
[17:27:32] <dhinus>	 maybe pint doesn't like my "smart" way of disabling the page in toolsbeta by using "mountpoint=/srv/tools"
[17:27:57] <dhinus>	 do you know where the check is running so I can see the output?
[17:28:51] <dhinus>	 ok I can see the error here https://prometheus.svc.beta.toolforge.org/tools/graph?g0.expr=pint_problem+%3E+0&g0.tab=1
[17:28:56] <taavi>	 it's a systemd service running on the prometheus node
[17:29:10] <dhinus>	 'prometheus "tools" at http://127.0.0.1:9902/tools has "node_filesystem_avail_bytes" metric with "mountpoint" label but there are no series matching'
[17:29:23] <taavi>	 that's the same as you get when you open the description label on karma
[17:29:38] <dhinus>	 ah nice didn't see it
[17:30:13] <dhinus>	 now, the only fix I can think of is splitting that single alert to a separate file with deploy-tag: project-tools
[17:30:47] <dhinus>	 unless I can add a deploy-tag selectively to a single rule inside the file
[17:31:00] <taavi>	 nope
[17:32:38] <dhinus>	 or maybe adding "or on() vector(0)" to the expr?
[17:32:59] <dhinus>	 which one do you think is cleaner?
[17:34:42] <taavi>	 I am not a fan of the vector() hack, so probably the former? (or just rely on the existing hack that makes paging toolsbeta alerts not actually paging, and have it deployed to both)
[17:37:58] <dhinus>	 hmm what is the existing hack?
[17:38:54] <taavi>	 the one that just rewrites the severity label before sending the alert to AM
[17:40:09] <dhinus>	 that rings a bell but I forgot that it existed... where is it stored?
[17:40:18] <dhinus>	 I think I'll just use that
[17:44:08] <taavi>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/toolforge/prometheus.pp#586
[17:44:58] <dhinus>	 thx
[17:53:15] <dhinus>	 taavi: https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/55
[17:54:34] <taavi>	 CI failing?
[17:55:04] <dhinus>	 odd error, restarting
[17:55:47] <dhinus>	 fixed
[17:57:14] <taavi>	 approved
[17:57:30] <dhinus>	 thx
[18:13:50] <dhinus>	 the pint alert has cleared
[18:14:02] <dhinus>	 I'll call it a day :)
[18:14:13] * dhinus off