[09:15:44] arnaudb/jelto: do you know what's the current maxRequestsPerConnection value in jetty for gerrit? [09:16:19] the main difference between a regular client and ATS from gerrit PoV is that ATS will try to re-use connections as much as possible [09:16:46] so I'm wondering if we are hitting some kind of limit on jetty that triggers a connection close from their side [09:22:09] I was not able to find any setting in https://gerrit-review.googlesource.com/Documentation/config-gerrit.html and in our config regarding this setting, but I'm still digging [09:23:22] we could try disabling keep alive on ATS as a temporary fix [09:23:36] if that fixes it we know that's the culprit [09:25:19] if that's possible without major problems on the performance side for other services it's worth trying. [09:25:19] However there are also tons of config parameters we should check in gerrit config and apache (which sits in front of gerrit). I'm still digging in our config but it seems a lot of the ssh related settings are also relevant for cloning over https [09:34:35] sure, I'll prepare a CR [09:37:41] vgutierrez: there are also some apache settings like MaxKeepAliveRequests 100 or KeepAliveTimeout 5 which we might have to tweak? if apache is sitting in front of gerrit that could also be the reason for the closed connections? [09:38:20] jelto: yes.. but dunno if you wanna tweak those while having the origin servers exposed to the Internet [09:39:06] ack yes, probably just for a short test [09:40:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239888 [09:40:44] relevant doc is https://docs.trafficserver.apache.org/en/9.2.x/admin-guide/files/records.config.en.html#proxy-config-http-server-session-sharing-match [09:42:31] fabfur / slyngs could I get a traffic sanity check on that CR? [09:43:14] that's you fabfur.. I forgot s.lyngs is OoO :) [09:59:33] let me check [10:02:56] I think the CR is ok, if it's the only way to circumvent this (completely disabling connection reuse) [10:04:02] well.. it's disabling connection re-use across clients [10:04:34] jelto: I got a successful checkout with --jobs=16 in cp1110 after applying this [10:05:07] thank you! I'll also try locally and on wmcs machines and monitor the metrics/logstash [10:05:24] jelto: puppet still needs to apply the change globally [10:05:36] second checkout in a row [10:05:44] I'l l trigger a puppet run on A:cp-text [10:06:00] fabfur: this is a workaround, the fix is configuring apache/jetty properly [10:06:09] okay, I'll wait until it's enabled fleet-wide [10:23:44] so far I was not able to produce 502s, just 429s [10:23:44] I can also post a short update in T417536 in a sec [10:23:45] T417536: Investigate gerrit 5xx responses - https://phabricator.wikimedia.org/T417536 [10:26:14] jelto: yeah.. 429 with --jobs=16 is kinda expected [10:26:37] yes [10:28:56] and that shouldn't be a problem from the CI infrastructure anyways [10:29:54] yes all of my submodule updates from wmcs were successful so far. I'll post a short update [10:34:02] 06Traffic, 06collaboration-services, 10Gerrit, 07ci-test-error (WMF-deployed Build Failure), 13Patch-For-Review: Investigate gerrit 5xx responses - https://phabricator.wikimedia.org/T417536#11622215 (10Jelto) @Vgutierrez suggested this issue might be related to the aggressive connection reuse from ats. G... [10:35:24] same thing here, submodule update is triggering 429 but no 502 [10:35:46] cool [10:35:52] let's see if CI is happier now [11:21:17] 06Traffic, 06collaboration-services, 10Gerrit, 07ci-test-error (WMF-deployed Build Failure), 13Patch-For-Review: Investigate gerrit 5xx responses - https://phabricator.wikimedia.org/T417536#11622345 (10Jelto) p:05Unbreak!→03High No more 502s appeared so far, I'll reduce the severity. Feel free to bu... [11:43:50] vgutierrez: do you happen to know anything about https://gerrit.wikimedia.org/r/c/operations/puppet/+/1228583 and it's revert? [11:44:31] I just realized that it was reverted and we now run with rp_filter = 2 on all interfaces by default (on trixie) [11:44:39] (instead of 1) [12:00:13] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11622455 (10JMeybohm) /link {T352956} [12:01:42] jayme we needed to revert it initially cause it broke tcp-proxy instances running trixie [12:02:01] culprit was a missing hiera key on tcp-proxy puppetization [12:02:33] got as much from the revert commit message. BUt it was unclear to me if there was/is any follow up [12:03:03] I just realized that it got reverted while double checking things after the ipip deploy in k8s staging [12:03:12] (one of the nodes runs trixie) [12:04:53] vgutierrez: is that fixed for tcp-proxy instances? Should we just re-apply the initial patch from alex or use a different approach? [12:06:39] yes, it's been fixed [12:07:28] okidoke, I'll resend the patch [12:07:55] oh...you did already [12:08:01] hmm...I need coffee [12:08:24] why didn't it stick then... [12:10:27] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11622481 (10JMeybohm) I seem to lack caffeine, the revert was already reverted at: https://gerrit.wikimedia.org/r/c/operati... [12:19:33] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11622496 (10JMeybohm) [13:02:57] vgutierrez: apart from the rp_filter settings my understanding is that I can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237280/2 and run the migrate-service-ipip cookbook after for 'k8s-ingress-staging', right? [13:03:27] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11622614 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a08d08c2-5950-4fed-a93a-c3c2670c6e3e) set by ayounsi@cumin1003 for 2:00:00 on 2 host(s) and their services wi... [13:03:43] that would also run the basic test of checking for a response to a encapsulated SYN package [13:11:25] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11622638 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7f778211-928f-4b3a-b850-c48eecb6889b) set by ayounsi@cumin1003 for 2:00:00 on 2 host(s) and their services wi... [13:13:58] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11622661 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=99c5c739-235a-4c04-8c28-d8b0ef056232) set by ayounsi@cumin1003 for 2:00:00 on 40 host(s) and their services w... [13:22:03] vgutierrez: would you mind taking a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239936 [13:52:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11622821 (10Jclark-ctr) 05Open→03Resolved [13:54:20] jayme: allow me to run some tests first please [13:54:34] jayme: so those realservers should be able to handle IPIP traffic? [14:01:00] jayme ^^ \o/ https://www.irccloud.com/pastebin/Ux4miAQ6/ [14:08:42] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11622925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=349bbdfb-74d0-421c-9a4c-07e586c71db9) set by ayounsi@cumin1003 for 2:00:00 on 3 host(s) and their services wi... [14:11:09] 06Traffic, 06collaboration-services, 10Gerrit, 07ci-test-error (WMF-deployed Build Failure), 13Patch-For-Review: Investigate gerrit 5xx responses - https://phabricator.wikimedia.org/T417536#11622959 (10hashar) That is great thank you :-] I am wondering though why reusing connections leads to errors. Ap... [14:12:57] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11622966 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=709d9f76-c776-4510-a7e0-1c4545cf4710) set by ayounsi@cumin1003 for 2:00:00 on 3 host(s) and their services wi... [14:14:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7008 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=magru&var-instance=cp7008 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [14:14:55] ^ magru depooled [14:16:13] vgutierrez: yes and thanks :) [14:17:43] moritzm: how can I force a refresh of apt repos in build2001 after uploading a new package to trixie-wikimedia? [14:19:02] vgutierrez: so gtg for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1237280/2 and the migrate-service-ipip cookbook? [14:19:29] jayme: hmm not sure if the cookbook will be happy with a k8s backed service [14:19:43] RESOLVED: [8x] HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp7002 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [14:19:46] vgutierrez: after running the reprepro include command it'll be immediately available, only needs an "apt-get update" on the receiving client [14:20:15] so I'm getting pbuilder-satisfydepends-dummy : Depends: golang-github-florianl-go-tc-dev which is a virtual package and is not provided by any available package [14:20:49] but [14:20:52] vgutierrez@apt1002:~$ sudo -i reprepro -C main list trixie-wikimedia |grep florian [14:20:52] trixie-wikimedia|main|amd64: golang-github-florianl-go-tc-dev 0.4.7-1 [14:20:52] trixie-wikimedia|main|i386: golang-github-florianl-go-tc-dev 0.4.7-1 [14:21:17] which host are you trying to install it on? will have a look shortly [14:21:26] I'm trying to build a package using it [14:22:55] using a debian container with trixie I can install the package as expected [14:23:04] that's why I thought something was missing on build2001 [14:23:34] do you use WIKIMEDIA=yes or DIST=foo-wikimedia? otherwise it doesn't add our WMF-specific component to the apt source for pbuilder [14:23:49] vgutierrez@build2001:~/tcp-mss-clamper$ GBP_PBUILDER_DIST=trixie WIKIMEDIA=yes ARCH=amd64 DIST=trixie GIT_PBUILDER_AUTOCONF=no BACKPORTS=yes gbp buildpackage -jauto -us -uc -sa --git-builder=git-pbuilder --git-ignore-branch [14:24:14] hmmm wait.. that DIST=trixie must be the culprit then [14:24:47] try replacing it with DIST=trixie-wikimedia, it should be all that's needed [14:26:11] vgutierrez: but the cookbook is required to run after the change? I might be lacking context here and I did not find any guidance apart from the comments on T352956 which do say the cookbook needs to run [14:26:11] T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956 [14:26:44] 06Traffic, 06collaboration-services, 10Gerrit, 07ci-test-error (WMF-deployed Build Failure), 13Patch-For-Review: Investigate gerrit 5xx responses - https://phabricator.wikimedia.org/T417536#11623052 (10hashar) Jelto asked how we can find whether CI jobs are still failing. Jenkins store the console outpu... [14:26:57] moritzm: nope.. no luck [14:27:24] jayme: so the cookbook guides you through the process and validates that everything is OK [14:27:34] jayme: but technically is not needed [14:27:47] you just need to merge the change, run puppet on the load balancers and restart pybal [14:31:09] vgutierrez: ah, okay. Ofc. I would still like to use the cookbook since we'll have to do this for a bunch of services in prod. I'll give it a try [14:31:31] jayme: ack, so give it a try and let me know [14:31:45] vgutierrez: the cookbook asks to set the scheduler to mh right away but the patch alex prepared does not do so (keeps wrr). Do you know why? [14:32:09] alex wanted to do that incrementally [14:32:46] having a closer look at what's failing there [14:33:01] jayme: we would need scheduler_flag: mh-port too BTW [14:33:08] FIRING: SLOMetricAbsent: trafficserver-combined - https://slo.wikimedia.org/?search=trafficserver-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:33:12] given it's an internal service [14:33:49] vgutierrez: yeah, the cookbook sais so too. sorry [14:34:06] ack, it's been a while since I haven't run that :D [14:34:25] so switching that later is just a matter of changing those options and restarting pybal again I suppose? [14:36:24] indeed [14:36:29] cool [14:36:49] that's required cause in liberica there is no option of not using maglev [14:36:58] it's the only supported scheduler [14:37:04] yes, that's understood [14:38:08] FIRING: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:40:35] vgutierrez: btw the ipip syn test is done by the cookbook when run in --dry-run [14:40:56] yeah.. I didn't implement a dry-run mode of that [14:41:05] which is good :) [14:41:23] for my usecase at least [14:43:08] RESOLVED: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:54:00] vgutierrez: I now realize that we will have to do this for the kubemaster/apiserver services as well and we probably never talked about that. Since that's not running in containers it won't have the lowered MTU there which means we need clamping on k8s masters or we have to reduce the MTU there (on the physical interface) as well :/ [14:54:47] amazing :D [14:55:00] unless you wanna clamp those endpoints of course [14:56:02] I would like not to for the same reasons ... networking wise the apiservers are as complex as regular workers (iptables mess, calico,...) [14:56:16] yep [14:56:26] so it looks like you need to go down the same pathj [14:56:56] yes :/ [14:57:30] love it [14:57:37] all alexes fault ofc :p [14:57:50] of course [15:01:21] btw. have you ever talked to other teams owning k8s clusters that have low-traffic services configured? [15:02:56] might lead to a longer than expected tail for migrating everything if they are unaware...I'll raise this in the next k8s SIG [15:03:12] they hate me already anyways for pushing for the k8s upgrade :p [15:08:33] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11623304 (10ayounsi) [15:08:44] 10netops, 06Infrastructure-Foundations: magru: upgrade routers & switches (2026) - https://phabricator.wikimedia.org/T416442#11623307 (10ayounsi) 05Open→03Resolved a:03ayounsi [15:08:54] vgutierrez: found the error, you're too bleeding edge and are actually the first to have used a hook on trixie :-) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239963 [15:09:03] :D [15:09:16] moritzm: ohh <3 thx [15:17:44] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11623347 (10Joe) >>! In T414805#11612457, @Krinkle wrote: >>>! In T414805#11612140, @gerritbot wrote: >> %%%[mediawiki/extensions/Wik... [15:18:19] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11623348 (10Joe) >>! In T414805#11620283, @Ladsgroup wrote: > FWIW: Out of 9K‌ results in https://global-search.toolforge.org/?q=%5C%... [15:18:30] jayme: we haven't talked to other k8s cluster owners nope [15:20:23] W: OpenPGP signature verification failed: http://apt.wikimedia.org/wikimedia trixie-wikimedia InRelease: Sub-process /usr/bin/sqv returned an error code (1), error message is: Missing key B8A2DF05748F9D524A3A2ADE9D392D3FFADF18FB, which is needed to verify signature. [15:20:23] E: The repository 'http://apt.wikimedia.org/wikimedia trixie-wikimedia InRelease' is not signed. [15:20:28] moritzm: sorry :( [15:29:15] yeah, this is also something that is new in trixie, currently working on a fix [15:29:32] the hook uses apt-key to add the Wikimedia repository key [15:29:44] yeah.. it's complaining about apt-key not being there [15:29:48] but apt-key was removed in trixie, I'm currently fixing the hook [15:41:30] 06Traffic, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#11623456 (10JMeybohm) [15:42:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239974 should fix it, I'll have it reviewed and then it should work tomorrow [15:42:43] 06Traffic, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#11623463 (10JMeybohm) [15:45:39] moritzm: thx again 🍻 [15:49:25] 06Traffic, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#11623492 (10JMeybohm) > What we did not account for is the fact that the kubernetes apiservers are low-traff... [17:16:14] 06Traffic, 06cloud-services-team, 10Data-Services, 10Datasets-General-or-Unknown, 13Patch-For-Review: Move dumps.wikimedia.org HTTP service behind CDN edge - https://phabricator.wikimedia.org/T306550#11624133 (10BCornwall) It would be helpful to state the desired goals here: Is this for protections via r... [17:17:17] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11624162 (10RobH) [17:42:23] 06Traffic, 06collaboration-services, 10Gerrit, 07ci-test-error (WMF-deployed Build Failure), 13Patch-For-Review: Investigate gerrit 5xx responses - https://phabricator.wikimedia.org/T417536#11624330 (10Vgutierrez) >>! In T417536#11622959, @hashar wrote: > That is great thank you :-] > > > I am wonderin... [17:57:45] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11624409 (10BCornwall) [18:02:45] hi [18:03:17] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11624456 (10BCornwall) [19:02:18] 06Traffic, 10API Platform, 10MediaWiki-User-login-and-signup, 06MediaWiki-Platform-Team (Q3 Kanban Board), 13Patch-For-Review: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007#11624755 (10Tgr) Turns out this is harder than it seem... [19:37:10] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11624885 (10CDobbins) [20:03:08] 06Traffic, 06SRE: Anycast ns[01].wikimedia.org for IPv4 - https://phabricator.wikimedia.org/T366193#11625005 (10ssingh) [21:18:39] 06Traffic, 06SRE: Anycast ns[01].wikimedia.org for IPv4 - https://phabricator.wikimedia.org/T366193#11625286 (10BBlack) Re: anycast catchments, diversity, resilience, etc (some of this is re-treading things said above, but bear with me): The ideal state for anycast authdns is that you have multiple distinct (... [21:24:25] 06Traffic, 06SRE: Anycast ns[01].wikimedia.org for IPv4 - https://phabricator.wikimedia.org/T366193#11625305 (10BBlack) While I'm on these esoteric subjects - another bonus thing that some operators do, is place their nameserver *hostnames* in distinct TLDs operated by distinct operators. For example, having... [21:46:26] 06Traffic: wmfuniq_experiment_fetcher.py fails to run on Trixie - https://phabricator.wikimedia.org/T417476#11625354 (10BCornwall) 05In progress→03Resolved a:03BCornwall Fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239602 [22:01:42] 06Traffic, 13Patch-For-Review: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11625383 (10BCornwall)