[06:51:22] greetings [06:54:52] reading backscroll, I was wondering re: rate-limit http status code back to the client, ATM is 500 did I get it right? IMHO should be 429 [06:55:18] i.e nothing necessarily wrong server side but rather client side [07:25:33] godog: i was considering that too, but if the issue is that the tool is in general getting more traffic it can handle (and not that that specific client is sending too much traffic), then a 503 felt more appropriate [07:38:00] yeah I can see the argument both ways [07:40:15] FWIW I don't really feel strongly either way [08:03:57] morning [08:36:47] morning! [09:27:54] this should help detect stuck workers https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192070 [09:29:25] checking [09:29:26] I'm testing it, but the idea is to check for the stuck lsof, and only then flag the processes/worker as stuck on nfs [09:32:35] just tested on an nfs worker that's not stuck, did a couple fixes [09:35:04] lgtm overall [09:35:29] ack, I'll test and fix anything that comes up and put up for review aagin [09:44:11] dcaro: also FWIW feel free to add me as a reviewer to gerrit, the notifications hit my inbox in that case and I see them [09:44:23] 👍 [09:44:38] sadly I wasn't able to do the same yet with gitlab, the emails carry an header and gmail doesn't index all headers [09:44:55] it does index gerrit headers though, that's how I'm able to filter messages for me to my inbox [09:47:42] I mean technically possible with the help of google apps scripts [09:48:04] "X-GitLab-NotificationReason: mentioned" for the curious [09:57:50] I try to use that on my mutt config, but I have not found yet an effective workflow (too many emails) [09:59:25] fyi. you might see a fleeting puppet failure alert or systemd timer error on toolsbeta/tools, it's me testing [10:12:26] changed the graphs already, getting some data :) [10:12:27] https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&from=now-1h&to=now&timezone=utc&var-cluster_datasource=P8433460076D33992&forceLogin=true [10:12:39] now we have also stuck per-tool [10:12:39] https://grafana-rw.wmcloud.org/d/RFhIBshHz/global-tools-stats?orgId=1&from=now-6h&to=now&timezone=utc [10:13:25] neat [10:14:00] very nice! [10:16:38] leaving, I won't be around irc/laptop so much in the next few days and I'll catch up later [10:18:25] ack, cya! [10:34:02] this updates the current alert (warning only) https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/41 [10:36:36] and the cookbook side for rebooting the workers https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1192090 [11:05:54] * dcaro lunch [16:27:49] * dhinus off [16:45:16] * dcaro off