[08:11:39] 06Traffic, 06MediaWiki-Platform-Team: Higher rate limit request - https://phabricator.wikimedia.org/T417854#11630850 (10Aklapper) @Rtconner: Hi, how is this website related to Wikimedia? Where to find its user data policy or privacy policy? I am unsure if this use is in scope for https://meta.wikimedia.org/wik... [08:11:52] 06Traffic, 06MediaWiki-Platform-Team: Higher OAuth rate limit (tier) request for external website - https://phabricator.wikimedia.org/T417854#11630852 (10Aklapper) [09:54:20] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631170 (10MoritzMuehlenhoff) >>! In T417632#11627179, @JMeybohm wrote: > @ayounsi suggested we could remove `linux-sysctl... [10:05:18] 06Traffic, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#11631206 (10Vgutierrez) blast radius is big.. I'm wondering if k8s nodes have workloads not exposed to the I... [10:09:48] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631212 (10JMeybohm) >>! In T417632#11631170, @MoritzMuehlenhoff wrote: > Can't we simply simply override net.ipv4.conf.*.... [10:15:15] 06Traffic, 10API Platform, 10MediaWiki-User-login-and-signup, 06MediaWiki-Platform-Team (Q3 Kanban Board), 13Patch-For-Review: Login with `action=login` and bot password does not create a JWT session cookie - https://phabricator.wikimedia.org/T415007#11631215 (10OWresch-WMF) p:05Triage→03High [10:43:05] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631314 (10MoritzMuehlenhoff) Alternatively we could also rebuild linux-base for trixie-wikimedia and drop the rp-filter s... [10:51:32] 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871 (10ayounsi) 03NEW p:05Triage→03Low [10:51:42] 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871#11631340 (10ayounsi) [10:51:43] 10netops, 06Traffic, 06Infrastructure-Foundations: 2026 Junos upgrade - https://phabricator.wikimedia.org/T416444#11631341 (10ayounsi) [10:52:30] 10netops, 06Infrastructure-Foundations: codfw: upgrade routers (2026) - https://phabricator.wikimedia.org/T417871#11631342 (10ayounsi) [10:53:39] 10netops, 06Infrastructure-Foundations: eqiad: upgrade routers (2026) - https://phabricator.wikimedia.org/T417873 (10ayounsi) 03NEW p:05Triage→03Low [10:53:47] 10netops, 06Infrastructure-Foundations: eqiad: upgrade routers (2026) - https://phabricator.wikimedia.org/T417873#11631372 (10ayounsi) [10:53:49] 10netops, 06Traffic, 06Infrastructure-Foundations: 2026 Junos upgrade - https://phabricator.wikimedia.org/T416444#11631373 (10ayounsi) [10:54:52] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631374 (10JMeybohm) >>! In T417632#11631314, @MoritzMuehlenhoff wrote: > Alternatively we could also rebuild linux-base f... [11:45:23] dear traffic, since we merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239651/6/hieradata/common/service.yaml for mw-parsoid [11:45:38] what are we missing to make pybal not worry about mw-parsoid/ [12:46:08] 06Traffic, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team: OAuth requests from Zeto app get throttled - https://phabricator.wikimedia.org/T417854#11631675 (10Tgr) [12:47:44] 06Traffic, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team (Radar): OAuth requests from Zeto app get throttled - https://phabricator.wikimedia.org/T417854#11631679 (10Tgr) [12:49:40] effie: not worry meaning? [12:51:51] vgutierrez: I am redploying this service and it is not being used by any production service [12:53:34] effie: so you'll get some alerts if you are too aggressive with the re-deploy of the service [12:53:52] vgutierrez: I thought that switching page to false [12:53:55] would spare us [12:53:58] but nothing you could avoid [12:54:39] you won't trigger pages [12:54:51] but warnings... mainly PybalBackendDown [12:55:09] ok ok [12:55:17] I should have it sorted soon hopefully [12:55:40] that was from a quick check on the alerts repo [12:55:51] I don't remember right now if we still have some pure icinga alerts for pybal [12:56:19] oh yes we have [12:57:01] but it's still a warning [12:57:03] it should be fine [13:20:45] 06Traffic, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Trixie switches rp_filter from strict (1) to loose (2) for all interfaces - https://phabricator.wikimedia.org/T417632#11631892 (10JMeybohm) 05Open→03Resolved a:03JMeybohm A patched package (from trixie-proposed-updates) has been up... [13:23:23] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11631902 (10JAllemandou) [13:24:16] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: strip x-wmf-* headers from responses - https://phabricator.wikimedia.org/T417781#11631905 (10JAllemandou) [13:26:43] ok they should be going away soon [13:29:25] ok sorted [13:56:43] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11632039 (10JAllemandou) [14:21:01] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11632139 (10daniel) [14:22:07] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11632148 (10daniel) Let's postpone capturing x-wmf-user-id. x-wmf-ratelimit-class is the urgent one. Not... [14:24:30] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11632158 (10ssingh) @BBlack and I had a long discussion about this (longer than the usual ones!) and some of it carried over to T366193 where we discussed getting two new /24s for the v4 anycasts.... [14:38:54] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Gerrit, 07Zuul: Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches) - https://phabricator.wikimedia.org/T417497#11632249 (10hashar) **TLDR** Zuul/CI does not receive event from... [14:42:01] 06Traffic, 06collaboration-services, 10Gerrit, 07ci-test-error (WMF-deployed Build Failure), 13Patch-For-Review: ATS causes git fetches from Gerrit to fail with 502 responses - https://phabricator.wikimedia.org/T417536#11632259 (10hashar) [14:47:34] 06Traffic, 06collaboration-services, 10Gerrit, 07ci-test-error (WMF-deployed Build Failure), 13Patch-For-Review: ATS causes git fetches from Gerrit to fail with 502 responses - https://phabricator.wikimedia.org/T417536#11632306 (10hashar) 05Open→03Resolved a:03Vgutierrez I can confirm the issue... [14:52:13] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team: Debug - ATS causes git fetches from Gerrit to fail with 502 responses - https://phabricator.wikimedia.org/T417897 (10ABran-WMF) 03NEW [14:56:40] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11632377 (10Clement_Goubert) [14:56:43] 06Traffic, 06serviceops, 07Epic, 05FY2025-26 KR 5.1, 07OKR-Work: Log rate limits from rest-gateway in webrequests - https://phabricator.wikimedia.org/T414349#11632380 (10Clement_Goubert) →14Duplicate dup:03T417864 [14:59:25] 06Traffic, 06Data-Engineering, 06MW-Interfaces-Team, 06MediaWiki-Platform-Team (Radar), 07OKR-Work: haproxy: capture x-wmf-* headers in webrequest data set - https://phabricator.wikimedia.org/T417864#11632385 (10Clement_Goubert) In the merged task, I was proposing not creating a new header, but instead a... [15:01:00] 06Traffic, 06collaboration-services, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Debug - ATS causes git fetches from Gerrit to fail with 502 responses - https://phabricator.wikimedia.org/T417897#11632389 (10ABran-WMF) 05Open→03In progress p:05Triage→03High [15:19:04] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Gerrit, 07Zuul: Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches) - https://phabricator.wikimedia.org/T417497#11632482 (10hashar) There are a couple alternatives I have consid... [15:24:22] arnaudb: vgutierrez: I have added #traffic for Zuul > Gerrit ssh connection being terminated by the TCP Proxy after idling for 30 seconds ( T417497 ) [15:24:23] T417497: Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches) - https://phabricator.wikimedia.org/T417497 [15:24:47] but I think I have found a workaround which is to have Zuul to connect to the discovery URL which if I understand properly would make it connect directly to the Gerrit host [15:24:50] thus bypassing the TCP proxy [15:25:15] and thus no more being subject to the idling timeout :) [15:25:38] hashar: note Valentin is out for on-call compensation today and tomorrow so he will respond Monday :> [15:25:46] I think that will work for Zuul, I gotta have a host entry in the ssh known_host though but that is not the end of the day [15:25:53] ahhh cool thanks sukhe :) [15:26:06] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Gerrit, 07Zuul: Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches) - https://phabricator.wikimedia.org/T417497#11632531 (10Dzahn) I think directly connecting via the discovery n... [15:33:45] 06Traffic, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team (Radar): OAuth requests from Zeto app get throttled - https://phabricator.wikimedia.org/T417854#11632584 (10Rtconner) Turns out this was all caused by the other issue : https://phabricator.wikimedia.org/T417839 Sorry. This can be closed. I w... [15:33:56] 06Traffic, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team (Radar): OAuth requests from Zeto app get throttled - https://phabricator.wikimedia.org/T417854#11632588 (10Rtconner) 05Open→03Resolved [16:10:13] hashar: what protocol is zuul connecting via? ssh? [16:11:11] 06Traffic, 07OKR-Work, 06Test Kitchen (Experiment Platform Sprint 20): Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11632767 (10KReid-WMF) [16:11:15] cdanis: yeah [16:11:18] cdanis: yes over SSH [16:11:21] using Paramiko [16:12:04] yesterday I was imagining using Split-horizon DNS to have the Zuul server to receive the Gerrit IP instead of the TCP proxy [16:12:16] but that can be achieved by using gerrit.discovery.wmnet instead [16:12:22] hmmm using Paramiko, one simple workaround is `-o ServerAliveInterval=22` with regular ssh [16:12:44] there is https://docs.paramiko.org/en/stable/api/transport.html#paramiko.transport.Transport.set_keepalive but that's a code change [16:12:45] this way Zuul will connect directly to it, by passing the TCP proxy and we are back to known state :) [16:12:59] with ssh keepalive I've had ssh connections stay open via proxy for many hours [16:13:11] I thought about th ekeep alive, but Zuul+paramiko are all end of life and I don't think I can even update them (short of hacking the python files) [16:13:19] ack [16:13:51] and yeah ServeraLiveInterval=22 works great which led me to remember Zuul uses Paramiko (grrrr) [16:13:52] using `gerrit.discovery.wmnet` is fine IMO [16:13:59] yup [16:14:03] it was intended as the internal endpoint :) [16:14:12] and this way we don't have to mungle the HAProxy timeout [16:14:19] it might be worth changing anyway, but, yeah [16:14:21] out of curiosity, is there a plan to replace zuul? [16:14:23] I think 30 seconds is fine, that matches Gerrit own idle timeout for ssh connections [16:14:39] ah, we should make it slightly longer then [16:15:06] (the one in haproxy, I mean) [16:15:36] for the switch to the discovery URL I will check with sre-collab (I mentioned it in their channel) and I have send a series of puppet patches in oirder to be able to switch Zuul to the discovery url [16:15:58] I gotta puppet compile / check etc with SRE collab [16:17:00] ack [16:17:18] does zuul send actual traffic every 30s? is that why it previously worked? [16:17:55] that is the other way around [16:18:05] Zuul connects to Gerrit to listen for events [16:18:17] and Gerrit only emits events when something happens (patchset uploaded, someone commenting etc) [16:18:26] well I'm wondering how it works aaginst the direct gerrit endpoint, if gerrit also has some idle timeout of 30 seconds [16:18:38] i am reverying that [16:18:46] there are so many timeout all over the place .. [16:19:24] https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#sshd.idleTimeout [16:19:24] Time in seconds after which the server automatically terminates idle connections [16:19:43] we have that set to 3600 seconds (not 30secs as I wrongfully said above) [16:20:21] Gerrit has another similar one which is to kill of connections that are taking too long to connect which is set at 30s ( https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#sshd.waitTimeout ). But I don't think those are idling [16:21:17] makes sense, from gerrit's perspective they aren't idle, they're listening for events [16:21:29] I think so yeah [16:22:53] thus we got the TCP Proxy with a 30s idle timeout and the backend Gerrit having a 3600s idle timeout [16:23:35] and I guess it is fine [16:23:39] nah we can fix :) [16:24:25] and the part that is not clear is whether the stream of events is subject to that timeout [16:25:47] 06Traffic, 07OKR-Work, 06Test Kitchen (Experiment Platform Sprint 20): Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11632834 (10dr0ptp4kt) CC @mpopov [16:30:20] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Gerrit, and 2 others: Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches) - https://phabricator.wikimedia.org/T417497#11632847 (10hashar) Yup I think so. I have crafted three Pupp... [16:30:36] cdanis: I forgot, there is a plan to replace Zuul and a good part of the infra has been setup [16:30:41] hashar: jelto: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240747 [16:30:54] + a lot of cleanup in a 10+ years list of mess. So it is progressing [16:35:20] 06Traffic, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#11632866 (10JMeybohm) >>! In T352956#11631206, @Vgutierrez wrote: > blast radius is big.. I'm wondering if k... [16:36:54] fwiw I think the above patch is a good idea anyway, but it might be fun to test zuul using tcpproxy with it applied :) [16:40:54] then I don't know whether it is a good idea to have long times out for public facing client [16:41:12] then Gerrit had a 3600s timeout for ages so [16:41:13] I mean, it was the status quo before, rgiht? [16:41:23] and I think it even had a bug where all the timers were actually * 1000 more [16:41:28] yeah [16:41:39] also what you wrote about `timeout tunnel` isn't true -- what was your source on that? [16:41:39] I replied on your change [16:41:44] yes, I replied to your reply :) [16:42:34] AHH [16:42:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240243/comment/a6afe4b6_c19ec040/ [16:42:59] that was on jelto patch which changes from listen to to frontend [16:42:59] ah see, I'm using it in the `defaults` section [16:43:10] yeah [16:43:13] I mixed it up [16:43:23] * hashar promises no AI has been used [16:43:26] oh the open-source haproxy documentation is confusing in this way [16:43:37] This parameter is specific to backends, but can be specified once for all in [16:43:39] "defaults" sections. This is in fact one of the easiest solutions not to [16:43:41] forget about it. [16:43:47] but it does say that at least :D once you *find* the section where it says that [16:43:54] and I have Zero experience with HaProxy so I echoed what it had :] [16:43:58] yeah :) [16:44:09] there are some people here who refer to haproxy's config language as "VogonScript" [16:44:19] so don't feel bad about that [16:45:03] oh I never feel bad for being wrong, that is always a good opportunity to learn a few more things! :b [16:45:37] indeed [16:47:59] another thing I have learned earlier this week is "TCP timeout race condition" [16:48:18] I swear I should have collected all the occurences of race conditions I have encountered in my career [16:48:24] AND I STILL FIND NEW ONES [16:48:37] I learned a horrifying one of those that is (as far as I know) only triggerable from within the kernel [16:48:47] which makes me wonder how the timeout should be set with the TCP proxy compared to what Gerrit has [16:49:33] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8dbf060480236877703bff0106fc984576184d11 [16:50:03] as part of digging into T414460 [16:50:04] T414460: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460 [16:51:02] eeek [16:51:11] I am quite happy to only have to deal with python/php bugs :] [17:04:30] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11632990 (10Ladsgroup) >>! In T414805#11630302, @Krinkle wrote: > > Is this about Swift index size or Thumbor capacity? I am not sug... [18:04:08] 06Traffic, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team (Radar): OAuth requests from Zeto app get throttled - https://phabricator.wikimedia.org/T417854#11633337 (10Aklapper) →14Duplicate dup:03T417839 [18:12:44] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Gerrit, and 2 others: Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches) - https://phabricator.wikimedia.org/T417497#11633379 (10hashar) p:05Unbreak!→03High @Dzahn and I have... [18:21:07] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11633451 (10Aklapper) T417913 might be a potential side effect which I closed too quickly...? [18:23:06] 06Traffic, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10Gerrit, and 2 others: Gerrit events not received by Zuul due to TCP Proxy timeout (CI is not triggered for some patches) - https://phabricator.wikimedia.org/T417497#11633466 (10Dzahn) The change to the tcp-proxy timeouts have a... [18:41:25] 06Traffic, 10Prod-Kubernetes, 06ServiceOps new, 07Kubernetes, 13Patch-For-Review: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#11633649 (10ayounsi) Jumbo frames won't solve the issue. Even if we start using a MTU > 1500 on the server,... [18:50:13] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, and 2 others: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805#11633671 (10Ladsgroup) Actually that'll be fixed in a couple of hours 😅 [20:25:06] 06Traffic: Redirect techblog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T417940 (10CKoerner_WMF) 03NEW [20:31:24] 06Traffic, 10Technical-Tool-Request, 06WMF-Legal: New service to shorten wmflabs URLs - https://phabricator.wikimedia.org/T232240#11634275 (10Nemoralis) I wonder if we could ask the WMF (#wmf-legal and #traffic, per the Wikitech page) to register a domain for this purpose. [[ https://ge.domains/search?domain... [20:44:54] brett: if you have a moment I would love a quick review of, https://gerrit.wikimedia.org/r/c/operations/dns/+/1240792