[07:40:01] greetings [08:28:13] morning [08:30:03] minor haproxy annoyance: `http-response set-header` does not affect error pages at all [08:30:25] :/ [08:30:53] taavi godog opinions about increasing the maxconn again? (see discussion in -cloud)? [08:32:49] if I'm reading the graph at https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/infra-k8s-haproxy correctly, there was a spike at 6k sessions last night? [08:33:30] i'm not opposed to it if the next layers won't get overloaded, but I also have a feeling just increasing the limits won't make the problem go away [08:34:23] yeah I'm also doubtful. do you know where is the graph that dcar.o posted to T405280? [08:34:24] T405280: [infra,haproxy,ingress] 2025-09-23 Ingress hitting the backend session limit and started replying with 5xxs - https://phabricator.wikimedia.org/T405280 [08:34:31] it seems to be a different one from the one I linked above [08:35:12] ah I can see the grafana query at the bottom of the image [08:35:45] but only for the request graphs, not for the sessions graph [08:36:40] i don't think we have a session graph per tool [08:37:11] I was looking for the graph titled "Server - Current number of active sessions" I see in the phab task, I think I found it under "By Server Sessions" in the same dashboard [08:37:57] the shape is similar to "Sessions" under "Basic General Info" but the numbers are different, I'll check the definitions [08:39:13] dhinus: checking [08:39:46] one uses "haproxy_server_current_sessions", the other uses "haproxy_backend_current_sessions" [08:40:42] haproxy_backend_current_sessions shows a clear flatline when it reaches 2k, which only happened for a couple hours last night [08:41:04] sorry I meant haproxy_server_current_sessions shows the flatline [08:41:13] haproxy_backend_current_sessions grows up to 6k+ [08:43:44] my understanding so far is that k8s-ingress-http backend was in some kind of trouble and haproxy kept either queuing or opening new sessions for incoming requests (via k8s-ingress-https frontend) [08:43:50] does that check out ? [08:45:20] that does not explain backend_current_sessions growing I think? [08:45:50] also the flatline looks very much rate limiting happening (I think the value that was recently increased by dcar.o) [08:45:58] but I'm still missing some parts of the picture [08:46:19] I'll post the two graphs to the phab task [08:47:55] yes definitely the flatline is the limits that were recently increased [08:51:23] looking at the front nginx logs I don't see any major changes to request patterns [08:52:25] pasted the graphs at T405280 [08:52:26] T405280: [infra,haproxy,ingress] 2025-09-23 Ingress hitting the backend session limit and started replying with 5xxs - https://phabricator.wikimedia.org/T405280 [08:54:36] the average response time from ingress-nginx (as measured by haproxy in haproxy_server_http_response_time_average_seconds) starts rising at ~21:30 UTC. again this is something where it'd be really useful to have per-tool stats [08:56:45] indeed [08:57:35] for my own education, the current flow for *.toolforge.org is haproxy (off k8s) -> k8s ingress (nginx) -> ??? [08:57:52] is that correct ? [08:58:13] front nginx (which we want to remove) -> haproxy (off k8s) -> ingress-nginx (in k8s) -> tool pods [08:58:58] ah got it, thank you taavi [08:59:48] how does ingress-nginx map/route requests to tool pods ? [09:00:57] objections to change https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/infra-k8s-haproxy to default to utc as opposed to browser time? [09:01:04] got bitten again just now [09:01:37] the ingress controller generates an absolutely massive nginx config file from kubernetes Ingress objects, which matches on host header and then forwards traffic to the correct kubernetes Service (at which point it is kube-proxy's problem) [09:01:38] yes please [09:01:41] godog: please do, I think I did in a couple of dashboard but we have many that default to local time (I think for no reason other than it was the default when creating the dashboard) [09:02:12] ack, doing [09:02:35] taavi: got it, thank you for explaining [09:03:06] re: ingress-nginx you can find more info at https://kubernetes.github.io/ingress-nginx/how-it-works/ (warning: not the most readable docs :P) [09:03:29] cheers, will check that out [09:10:56] heyo :), the graph I show in the task is manually made (the query is in the screenshot), I did though this one that uses the same metric though it slices it per response code and gets the top ones (https://grafana.wmcloud.org/d/RFhIBshHz/global-tools-stats?orgId=1&from=now-6h&to=now&timezone=browser) [09:12:11] dhinus: I think that the difference between 'haproxy_server_current_sessions' and 'haproxy_backend_current_sessions' is that one is the generic one (the bundled up max of 8k) and the other is the per-backend server one (with 2k max per server) [09:13:04] dcaro: ah that makes sense, thank you! [09:20:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191657 [09:27:02] nice, +1d [09:28:22] taavi: btw yesterday I started reviewing your tofu-managed sg for haproxy, but then I was distracted. I'll try to complete the review today, I need a bit more time [09:31:03] I added an 'availability' row to infra-k8s-haproxy since we're paging on it now, feel free to change at will though [09:31:42] e.g. last night's session exhaustion https://grafana.wmcloud.org/goto/4NOihn3Ng?orgId=1 [09:35:09] also I noticed that https://nginx.org/en/docs/http/ngx_http_limit_conn_module.html documents a caveat where "[a] connection is counted only if it has a request being processed by the server and the whole request header has already been read." [09:35:47] I don't have proper data to show that's the case, but I based on a quick glance of last night's logs I'd expected the limiter we put in place last time to have quick-rejected a lot more requests [09:36:17] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191583 proposes moving that limit to haproxy [09:46:42] LGTM [09:47:14] * godog bbiab [12:54:09] sorry I'm constantly nagging for reviews for something today.. but seems like fixing T404072 uncovered a maintain-kubeusers bug (T405728), fix is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191675/ [12:54:10] T404072: disable_tool fails to archive mpic-alpha-demo - https://phabricator.wikimedia.org/T404072 [12:54:11] T405728: JobUnavailable Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad - https://phabricator.wikimedia.org/T405728 [12:55:10] LGTM [12:55:30] also there's some toolforge harbor quota requests I don't feel qualified to review: T405643 T405644 T405645, in case someone familiar with that is around today [12:55:31] T405643: Request increased build quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T405643 [12:55:31] T405644: Request increased build quota for cluebotng-monitoring Toolforge tool - https://phabricator.wikimedia.org/T405644 [12:55:32] T405645: Request increased build quota for cluebotng Toolforge tool - https://phabricator.wikimedia.org/T405645 [12:57:02] I would probably wait for dcar.o for those quotas [13:49:44] PSA: I added an example and a note here about using "openstack object save" to download a file from an object storage bucket to a cloudcontrol https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide#OpenStack_CLI [13:49:57] including a funny quirk where by default the file gets saved in an unexpected location :P [14:04:12] unrelated, I created T405742 because the tofu pipelines keep on failing very frequently [14:04:12] T405742: tofu-provisioning: Failed to install provider - https://phabricator.wikimedia.org/T405742 [14:08:10] speaking of which, https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/pipelines/136939 eventually succeded, I'll run the noop apply [14:08:40] great [14:09:48] I'm logging off a bit early, have a good weekend! [14:15:52] btw looks like puppet CI is failing for some wmcs mypy checks? [14:15:54] 14:14:40 wmcs: commands[3]> mypy --check-untyped-defs modules/profile/files/wmcs [14:15:56] 14:14:46 modules/profile/files/wmcs/services/maintain_dbusers/maintain_dbusers.py:598: error: No overload variant of "__getitem__" of "tuple" matches argument type "str" [call-overload] [15:50:47] cdanis: hmmm, that same file passed earlier today in https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/17739/console, something must've changed in the environment since then [15:51:01] taavi: so we did just rebuild the puppet CI docker image, but with a minimal change [15:52:22] https://gerrit.wikimedia.org/r/1191701 [15:52:36] cdanis: it seems to have made tox dump a bunch of dependency versions, somehow https://phabricator.wikimedia.org/P83474 [15:52:38] not sure if we changed the mypy version or maybe typeshed or something? [15:52:49] hmm [16:00:16] cdanis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191719/ [16:01:02] nice, thanks! [16:01:13] watching jenkins to see wmcs run [16:02:30] 16:02:09 wmcs: commands[3]> mypy --check-untyped-defs modules/profile/files/wmcs [16:02:31] 16:02:17 Success: no issues found in 17 source files [16:02:33] 16:02:17 wmcs: OK ✔ in 39.33 seconds [16:06:42] cdanis: merged. of course mypy didn't prevent me from making a type error in that exact file earlier today, but it did cause thos :/ [16:06:52] mypy 😔