[07:05:42] * dhinus paged ToolforgeWebHighErrorRate: High 5xx rate on Toolforge [07:07:45] the metric peaked at 45% and is going down, now it's around 30% [07:14:25] the grafana haproxy dashboard shows a spike in sessions, but it's well below the 2k limit [07:29:55] interesting: the frontend shows a decrease in 2xx and a corresponding increase in 5xx, whereas the backend shows only a _decrease_ in 2xx responses, but not a corresponding increase in 5xx. [07:30:07] https://usercontent.irccloud-cdn.com/file/6DWWIUZ5/Screenshot%202025-09-28%20at%2009.27.07.png [07:41:17] dhinus: heyo [07:41:41] that might mean that the haproxy was not able to get to k8s ingress and returning 504 [07:41:52] anything I can do to help? [07:42:20] I'm looking at the haproxy logs but I'm not seeing anything useful [07:44:04] some tools are working fine, but we're getting the first user reports T405848 [07:44:05] T405848: Tool is returning 500 - https://phabricator.wikimedia.org/T405848 [07:45:09] uh the homepage for that tool works fine, only a subpage returns 500 [07:46:09] hmm, I see flashes of journald getting into D state [07:46:30] which host? [07:46:41] tools-k8s-haproxy-5 [07:46:55] but short bursts only, the load is not too high (~0.5) [07:47:02] I was expecting Phaultfinder to create a task but it didn't, I'll create one manually [07:48:53] there's spikes of PSI, though I don't think it should be enough to make things fail [07:48:57] https://usercontent.irccloud-cdn.com/file/4YO2et7g/image.png [07:49:16] mostly iowaiting thought [07:52:16] T405850 [07:52:17] T405850: 2025-09-28 ToolforgeWebHighErrorRate: High 5xx rate on Toolforge web services - https://phabricator.wikimedia.org/T405850 [07:54:19] is there a way to see the error rates per-tool? [07:55:03] not sure, on the ingress side yes, but might not be getting there, let me check [07:55:30] this is what's reported by the ingress in k8s https://grafana-rw.wmcloud.org/d/RFhIBshHz/global-tools-stats?orgId=1&from=now-6h&to=now&timezone=browser [07:56:30] I noticed there that tool-ftl had a big spike of 500s, though others do not [07:57:31] tool-geohack saw a drop in reqs [07:57:53] if you take that out of the graph, things look more or less normal [07:57:55] yep, there's a big drop in requests [07:58:04] https://usercontent.irccloud-cdn.com/file/IC759MIe/image.png [07:58:17] that's 500s only (they are way less than any other status code) [07:58:53] this matches what we're seeing in the haproxy backend graphs: short spike of 5xx, big drop in 2xx [07:58:54] so it seems that some stuff was not really getting to the ingresses, though there's small spikes on the ingress side too [07:59:30] I wonder if the missing geohack requests are getting terminated with 5xx at the haproxy layer, and not reaching the ingress [08:00:24] in the haproxy logs I'm seeing lots of "k8s-ingress-https/" [08:00:41] maybe haproxy flags backends as failed and then it just replies with 500, so connections don't get to the backend [08:01:04] (and thus the same shape of traffic, but multipiled in the frontend side) [08:01:23] https://usercontent.irccloud-cdn.com/file/AnNkAHdq/image.png [08:04:45] this might be interesting [08:04:48] https://usercontent.irccloud-cdn.com/file/41aFSxWh/image.png [08:04:57] the aggregated cpu for all the k8s workers [08:06:00] there's a spike in D state process in toolforge workers too [08:06:03] https://usercontent.irccloud-cdn.com/file/SxaI82td/image.png [08:08:43] from the console of tools-k8s-worker-nfs-1 [08:08:43] [Sun Sep 28 06:49:12 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying [08:09:30] it's not triggering the "tools stuck in D state" alert though [08:09:36] but we can try restarting that host [08:09:37] yep :/ [08:09:47] let me try yes [08:09:58] there were a few that just went up and down though [08:10:01] (ex. nfs-40) [08:10:17] can you restart it while I keep looking at other things? [08:10:22] yep [08:15:55] oh, replica_cnf is failing to create two user's replica files (unrelated) [08:16:47] the restart seems to have cleared the alerts [08:25:38] I think the alert is gonna fire again because the metric is still > 0.25 [08:26:15] yep, I'm not sure the restart did much, but the trend was just getting around there [08:28:10] * dhinus paged again [08:32:31] I think that there's a few tools being crawled that are stuck on nfs, for example, catfood is getting crawled, and returing 500s, running on tools-k8s-worker-nfs-19, let me double check if the process there is stuck but it's probably [08:33:03] yep, it's stuck [08:34:46] morning, can I help? [08:35:34] o/ haproxy is returning 503 for at least one tool (geohack) [08:35:53] I'll reboot a bunch more of the workers [08:35:57] requests are not getting to nginx-ingress (I think) [08:36:43] taavi: is there a way to check why haproxy is returning 503? from the stats interface maybe? the logs don't show much [08:38:30] I think most of the requests blocked by haproxy are request-denied [08:38:51] https://usercontent.irccloud-cdn.com/file/eZP520Fq/image.png [08:39:03] that would mean that it's pre-empting the clients in a way I think [08:39:39] that'd be the rate limiting acting I htink [08:40:01] https://geohack.toolforge.org/ shows "The tool you are trying to access is currently receiving more traffic than it can handle." [08:40:04] error requests are currently logged into /var/log/haproxy.log, which has a four-letter state code described at https://wikitech.wikimedia.org/wiki/HAProxy/session_states [08:40:16] most of those are 'PR--', aka being denied by access control rules or similar [08:44:10] is haproxy rate-limiting based on hostname? can we see stats of those rates? [08:44:30] I guess one option would be to raise https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/toolforge/k8s/haproxy/k8s-ingress.cfg.erb#75 to match the frontend maxconn limit, then we could change the alert to be based on haproxy_backend_http_responses_total (instead of the frontend [08:44:30] equivalent) which would ignore the per-tool rate limiting [08:44:53] dhinus: not without https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191662 which I was planning to merge on monday [08:46:48] I guess it's the `denied` counter in the frontend stats, though it does not separate per tool no [08:46:52] https://usercontent.irccloud-cdn.com/file/nToXUijx/image.png [08:48:11] so basically this is rate limiting working as intended, and we should not even alert [08:48:54] there was also the nfs event [08:49:01] (still rebooting nodes) [08:49:30] do we have any proof the nfs event caused a visible increase in toolforge errors? [08:49:56] spikes in [08:49:59] https://usercontent.irccloud-cdn.com/file/Zpig7dqn/image.png [08:50:01] that's on the ingress side [08:50:11] (after haproxy) [08:50:17] oh yes the spikes were NFS probably. then after the spikes, most of the 25% request errors is caused by the rate limiting on geohack [08:50:33] the initial spike was close to 50% [08:50:41] maybe the increase in traffic made nfs go nuts and then combined [08:50:54] dhinus: at least some part of it. there was a brief spike of backend error rates, but the rate limiting is working as expected to prevent a more wide-spread outage [08:51:03] yep [08:51:24] this might give a better view [08:51:26] https://usercontent.irccloud-cdn.com/file/9MYgVUne/image.png [08:52:10] I actually think that the rate limiting caused the general ingress 500s to lower xd [08:52:23] (the higher metric there), but it spiked avyhow [08:54:31] this is the aggregated 24h [08:54:34] https://usercontent.irccloud-cdn.com/file/1OQMgkyQ/image.png [08:54:45] so yep, I don't think it actually made much difference this morning [08:54:53] (though it did happen) [08:56:15] request denied dropped [08:56:20] anyone changed anything? [08:56:41] https://usercontent.irccloud-cdn.com/file/z9w8QSWZ/image.png [08:57:10] someone turned their crawler off? [08:57:15] xd [08:58:00] I have to go offline shortly, dcaro taavi please add more details to T405850 if you have time [08:58:01] T405850: 2025-09-28 ToolforgeWebHighErrorRate: High 5xx rate on Toolforge web services - https://phabricator.wikimedia.org/T405850 [08:58:11] hmm... actually, the shape af backend errors is similar to the shape of the frontend errors [08:58:12] taavi: probably after the ping in -cloud :P [08:58:14] https://usercontent.irccloud-cdn.com/file/6irZaYPy/image.png [08:58:20] there was an event yesteday night it seems too [08:59:08] (that's 24h) [08:59:31] I have to go too though, I'll paste the graphs later today to the task [08:59:36] thanks! [08:59:38] same, I'll look at monday at moving the alert to be based on backend metrics so that this rate-limiting doesn't affect it [08:59:57] thanks both! enjoy your sunday :)