[07:05:42] * dhinus paged ToolforgeWebHighErrorRate: High 5xx rate on Toolforge 
[07:07:45] <dhinus>	 the metric peaked at 45% and is going down, now it's around 30%
[07:14:25] <dhinus>	 the grafana haproxy dashboard shows a spike in sessions, but it's well below the 2k limit
[07:29:55] <dhinus>	 interesting: the frontend shows a decrease in 2xx and a corresponding increase in 5xx, whereas the backend shows only a _decrease_ in 2xx responses, but not a corresponding increase in 5xx.
[07:30:07] <dhinus>	  https://usercontent.irccloud-cdn.com/file/6DWWIUZ5/Screenshot%202025-09-28%20at%2009.27.07.png
[07:41:17] <dcaro>	 dhinus: heyo
[07:41:41] <dcaro>	 that might mean that the haproxy was not able to get to k8s ingress and returning 504
[07:41:52] <dcaro>	 anything I can do to help?
[07:42:20] <dhinus>	 I'm looking at the haproxy logs but I'm not seeing anything useful
[07:44:04] <dhinus>	 some tools are working fine, but we're getting the first user reports T405848
[07:44:05] <stashbot>	 T405848: Tool is returning 500 - https://phabricator.wikimedia.org/T405848
[07:45:09] <dhinus>	 uh the homepage for that tool works fine, only a subpage returns 500
[07:46:09] <dcaro>	 hmm, I see flashes of journald getting into D state
[07:46:30] <dhinus>	 which host?
[07:46:41] <dcaro>	 tools-k8s-haproxy-5
[07:46:55] <dcaro>	 but short bursts only, the load is not too high (~0.5)
[07:47:02] <dhinus>	 I was expecting Phaultfinder to create a task but it didn't, I'll create one manually
[07:48:53] <dcaro>	 there's spikes of PSI, though I don't think it should be enough to make things fail
[07:48:57] <dcaro>	 https://usercontent.irccloud-cdn.com/file/4YO2et7g/image.png
[07:49:16] <dcaro>	 mostly iowaiting thought
[07:52:16] <dhinus>	 T405850
[07:52:17] <stashbot>	 T405850: 2025-09-28 ToolforgeWebHighErrorRate: High 5xx rate on Toolforge web services - https://phabricator.wikimedia.org/T405850
[07:54:19] <dhinus>	 is there a way to see the error rates per-tool?
[07:55:03] <dcaro>	 not sure, on the ingress side yes, but might not be getting there, let me check
[07:55:30] <dcaro>	 this is what's reported by the ingress in k8s https://grafana-rw.wmcloud.org/d/RFhIBshHz/global-tools-stats?orgId=1&from=now-6h&to=now&timezone=browser
[07:56:30] <dcaro>	 I noticed there that tool-ftl had a big spike of 500s, though others do not
[07:57:31] <dhinus>	 tool-geohack saw a drop in reqs
[07:57:53] <dhinus>	 if you take that out of the graph, things look more or less normal
[07:57:55] <dcaro>	 yep, there's a big drop in requests
[07:58:04] <dcaro>	 https://usercontent.irccloud-cdn.com/file/IC759MIe/image.png
[07:58:17] <dcaro>	 that's 500s only (they are way less than any other status code)
[07:58:53] <dhinus>	 this matches what we're seeing in the haproxy backend graphs: short spike of 5xx, big drop in 2xx
[07:58:54] <dcaro>	 so it seems that some stuff was not really getting to the ingresses, though there's small spikes on the ingress side too
[07:59:30] <dhinus>	 I wonder if the missing geohack requests are getting terminated with 5xx at the haproxy layer, and not reaching the ingress
[08:00:24] <dhinus>	 in the haproxy logs I'm seeing lots of "k8s-ingress-https/<NOSRV>"
[08:00:41] <dcaro>	 maybe haproxy flags backends as failed and then it just replies with 500, so connections don't get to the backend
[08:01:04] <dcaro>	 (and thus the same shape of traffic, but multipiled in the frontend side)
[08:01:23] <dcaro>	 https://usercontent.irccloud-cdn.com/file/AnNkAHdq/image.png
[08:04:45] <dcaro>	 this might be interesting
[08:04:48] <dcaro>	 https://usercontent.irccloud-cdn.com/file/41aFSxWh/image.png
[08:04:57] <dcaro>	 the aggregated cpu for all the k8s workers
[08:06:00] <dcaro>	 there's a spike in D state process in toolforge workers too
[08:06:03] <dcaro>	 https://usercontent.irccloud-cdn.com/file/SxaI82td/image.png
[08:08:43] <dcaro>	 from the console of tools-k8s-worker-nfs-1
[08:08:43] <dcaro>	 [Sun Sep 28 06:49:12 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
[08:09:30] <dhinus>	 it's not triggering the "tools stuck in D state" alert though
[08:09:36] <dhinus>	 but we can try restarting that host
[08:09:37] <dcaro>	 yep :/
[08:09:47] <dcaro>	 let me try yes
[08:09:58] <dcaro>	 there were a few that just went up and down though
[08:10:01] <dcaro>	 (ex. nfs-40)
[08:10:17] <dhinus>	 can you restart it while I keep looking at other things?
[08:10:22] <dcaro>	 yep
[08:15:55] <dcaro>	 oh, replica_cnf is failing to create two user's replica files (unrelated)
[08:16:47] <dhinus>	 the restart seems to have cleared the alerts
[08:25:38] <dhinus>	 I think the alert is gonna fire again because the metric is still > 0.25
[08:26:15] <dcaro>	 yep, I'm not sure the restart did much, but the trend was just getting around there
[08:28:10] * dhinus paged again
[08:32:31] <dcaro>	 I think that there's a few tools being crawled that are stuck on nfs, for example, catfood is getting crawled, and returing 500s, running on tools-k8s-worker-nfs-19, let me double check if the process there is stuck but it's probably
[08:33:03] <dcaro>	 yep, it's stuck
[08:34:46] <taavi>	 morning, can I help?
[08:35:34] <dhinus>	 o/ haproxy is returning 503 for at least one tool (geohack)
[08:35:53] <dcaro>	 I'll reboot a bunch more of the workers
[08:35:57] <dhinus>	 requests are not getting to nginx-ingress (I think)
[08:36:43] <dhinus>	 taavi: is there a way to check why haproxy is returning 503? from the stats interface maybe? the logs don't show much
[08:38:30] <dcaro>	 I think most of the requests blocked by haproxy are request-denied
[08:38:51] <dcaro>	 https://usercontent.irccloud-cdn.com/file/eZP520Fq/image.png
[08:39:03] <dcaro>	 that would mean that it's pre-empting the clients in a way I think
[08:39:39] <dcaro>	 that'd be the rate limiting acting I htink
[08:40:01] <dhinus>	 https://geohack.toolforge.org/ shows "The tool you are trying to access is currently receiving more traffic than it can handle."
[08:40:04] <taavi>	 error requests are currently logged into /var/log/haproxy.log, which has a four-letter state code described at https://wikitech.wikimedia.org/wiki/HAProxy/session_states
[08:40:16] <taavi>	 most of those are 'PR--', aka being denied by access control rules or similar
[08:44:10] <dhinus>	 is haproxy rate-limiting based on hostname? can we see stats of those rates?
[08:44:30] <taavi>	 I guess one option would be to raise https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/toolforge/k8s/haproxy/k8s-ingress.cfg.erb#75 to match the frontend maxconn limit, then we could change the alert to be based on haproxy_backend_http_responses_total (instead of the frontend
[08:44:30] <taavi>	 equivalent) which would ignore the per-tool rate limiting
[08:44:53] <taavi>	 dhinus: not without https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191662 which I was planning to merge on monday
[08:46:48] <dcaro>	 I guess it's the `denied` counter in the frontend stats, though it does not separate per tool no
[08:46:52] <dcaro>	 https://usercontent.irccloud-cdn.com/file/nToXUijx/image.png
[08:48:11] <dhinus>	 so basically this is rate limiting working as intended, and we should not even alert
[08:48:54] <dcaro>	 there was also the nfs event
[08:49:01] <dcaro>	 (still rebooting nodes)
[08:49:30] <dhinus>	 do we have any proof the nfs event caused a visible increase in toolforge errors?
[08:49:56] <dcaro>	 spikes in 
[08:49:59] <dcaro>	 https://usercontent.irccloud-cdn.com/file/Zpig7dqn/image.png
[08:50:01] <dcaro>	 that's on the ingress side
[08:50:11] <dcaro>	 (after haproxy)
[08:50:17] <dhinus>	 oh yes the spikes were NFS probably. then after the spikes, most of the 25% request errors is caused by the rate limiting on geohack
[08:50:33] <dhinus>	 the initial spike was close to 50%
[08:50:41] <dcaro>	 maybe the increase in traffic made nfs go nuts and then combined
[08:50:54] <taavi>	 dhinus: at least some part of it. there was a brief spike of backend error rates, but the rate limiting is working as expected to prevent a more wide-spread outage
[08:51:03] <dcaro>	 yep
[08:51:24] <dcaro>	 this might give a better view
[08:51:26] <dcaro>	 https://usercontent.irccloud-cdn.com/file/9MYgVUne/image.png
[08:52:10] <dcaro>	 I actually think that the rate limiting caused the general ingress 500s to lower xd
[08:52:23] <dcaro>	 (the higher metric there), but it spiked avyhow
[08:54:31] <dcaro>	 this is the aggregated 24h
[08:54:34] <dcaro>	 https://usercontent.irccloud-cdn.com/file/1OQMgkyQ/image.png
[08:54:45] <dcaro>	 so yep, I don't think it actually made much difference this morning
[08:54:53] <dcaro>	 (though it did happen)
[08:56:15] <dcaro>	 request denied dropped
[08:56:20] <dcaro>	 anyone changed anything?
[08:56:41] <dcaro>	 https://usercontent.irccloud-cdn.com/file/z9w8QSWZ/image.png
[08:57:10] <taavi>	 someone turned their crawler off?
[08:57:15] <dcaro>	 xd
[08:58:00] <dhinus>	 I have to go offline shortly, dcaro taavi please add more details to T405850 if you have time
[08:58:01] <stashbot>	 T405850: 2025-09-28 ToolforgeWebHighErrorRate: High 5xx rate on Toolforge web services - https://phabricator.wikimedia.org/T405850
[08:58:11] <dcaro>	 hmm... actually, the shape af backend errors is similar to the shape of the frontend errors
[08:58:12] <dhinus>	 taavi: probably after the ping in -cloud :P
[08:58:14] <dcaro>	 https://usercontent.irccloud-cdn.com/file/6irZaYPy/image.png
[08:58:20] <dcaro>	 there was an event yesteday night it seems too
[08:59:08] <dcaro>	 (that's 24h)
[08:59:31] <dcaro>	 I have to go too though, I'll paste the graphs later today to the task
[08:59:36] <dhinus>	 thanks!
[08:59:38] <taavi>	 same, I'll look at monday at moving the alert to be based on backend metrics so that this rate-limiting doesn't affect it
[08:59:57] <dhinus>	 thanks both! enjoy your sunday :)