[00:10:37] 06Traffic, 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep - https://phabricator.wikimedia.org/T427819#11983571 (10bd808) β†’14Duplicate dup:03T428052 [00:10:38] 06Traffic, 10Beta-Cluster-Infrastructure, 06SRE: Beta cluster haproxy does not support `warn-blocked-traffic-after` keyword - https://phabricator.wikimedia.org/T428052#11983573 (10bd808) [00:10:43] 06Traffic, 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep - https://phabricator.wikimedia.org/T427813#11983576 (10bd808) β†’14Duplicate dup:03T428052 [00:10:46] 06Traffic, 10Beta-Cluster-Infrastructure, 06SRE: Beta cluster haproxy does not support `warn-blocked-traffic-after` keyword - https://phabricator.wikimedia.org/T428052#11983578 (10bd808) [00:14:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:29:40] FIRING: [11x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:34:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:49:40] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [00:54:40] RESOLVED: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [01:38:44] 06Traffic, 06Data-Persistence: Move thumbnail caching from upload cluster to text - https://phabricator.wikimedia.org/T427465#11983661 (10dr0ptp4kt) Thanks @Ladsgroup, understood on the extra complexity. [08:13:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11984142 (10cmooney) >>! In T427393#11983173, @BCornwall wrote: > I was advised by @taavi to also update mediawiki-config's `wmf-con... [08:59:12] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: NetboxAccounting - https://phabricator.wikimedia.org/T428132 (10LSobanski) 03NEW [09:11:56] 10netops, 06Infrastructure-Foundations, 06SRE: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120#11984329 (10ayounsi) 05Openβ†’03Resolved a:03ayounsi ` lsw1-a8-codfw> traceroute lo0.lsw1-a2-codfw.codfw.wmnet traceroute to lo0.lsw1-a2-codfw.co... [09:35:30] 06Traffic, 06serviceops-deprecated, 07Upstream: Java fails to install on WMF Debian container - https://phabricator.wikimedia.org/T352350#11984430 (10hashar) This is an issue in the Debian packaging for OpenJDK [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=863199 | Debian #863199 ]] [[ https://bug... [11:43:48] FIRING: PuppetFailure: Puppet has failed on cp6009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:53:48] FIRING: [2x] PuppetFailure: Puppet has failed on cp6009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:15:13] Hi! I merged an ATS change more than 2 hours ago, adding a map for http://kafka.wikimedia.org -> https://kafka-ui.discovery.wmnet:30443. When testing it live in my browser, I get the usual page synonymous to a non working redirection "Welcome! The Wikimedia movement is a global community...". I ran a check on all cp hosts via cumin, and it seems [12:15:13] that only 33% of the cp hosts have a working redirection for this domain (see https://phabricator.wikimedia.org/P93867) [12:15:32] Last time something like this happened, we had an unrelated config issue preventing any new config from being reloaded [12:16:15] I ran puppet on all cp hosts. It mostly ran fine, except for a couple of hosts failing with [12:16:15] Error: '/usr/local/sbin/reload-vcl -n frontend -f /etc/varnish/wikimedia_text-frontend.vcl -d 2 -a -s /etc/varnish/wikimedia_misc-frontend.vcl && (rm /var/tmp/reload-vcl-failed-frontend; true)' returned 1 instead of one of [0] [12:18:09] (I've just thought of the fact that some of these cp hosts are cache::image, and not cache::text) [12:18:41] however, cp6009.drmrs.wmnet is a cache::text host, and the redirection is busted on it [12:20:59] hmm, looking just above (and at https://puppetboard.wikimedia.org/report/cp6009.drmrs.wmnet/cbcfc1b854f704d4775fa4fc0fb8ee7d9dc165a9) it appears puppet is currently failing on cp6011, 6009 and 6015 [12:21:49] however, I get the same busted redirection on cp6012, where puppet ran just fine [12:22:19] so, heh, any help would be appreciated, please and thank you [12:25:12] 06Traffic, 10Lift-Wing, 10Semantic Search, 07Essential-Work, 06Machine-Learning-Team (Q4 FY2025-26): Transparent DNS Routing for LiftWing Services (eqiad vs Multi-DC) - https://phabricator.wikimedia.org/T422253#11984950 (10isarantopoulos) [12:27:01] looking back at https://phabricator.wikimedia.org/P93867, it seems that _all_ hosts in codfw, drmrs, eqsin have a busted redirection, and all hosts in eqiad and esams have a working redirection, when it's a mixed bag for magru [12:27:37] I just don't know what to make of it. Could we have requestctl rules impacting specific pops and preventing ATS from reloading its config? Seems far fetched but what do I know [12:33:48] FIRING: [2x] PuppetFailure: Puppet has failed on cp6009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:38:27] I'm going to let it be for a while, and see if somehow some cache expires [12:39:01] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11985032 (10ayounsi) [12:39:12] 10netops, 06Infrastructure-Foundations: Create public vlans in eqiad and codfw - https://phabricator.wikimedia.org/T422043#11985033 (10ayounsi) [12:41:01] 06Traffic, 10Lift-Wing, 10Semantic Search, 07Essential-Work, 06Machine-Learning-Team (Q4 FY2025-26): Transparent DNS Routing for LiftWing Services (eqiad vs Multi-DC) - https://phabricator.wikimedia.org/T422253#11985035 (10DPogorzelski-WMF) i think the original question was whether we had a way to use on... [13:14:58] brouberol: sorry for the lack of a response, all EU Traffic folks are out on PTO today [13:15:33] I am catching up so let me see what is happening [13:22:39] thanks! [13:23:48] RESOLVED: PuppetFailure: Puppet has failed on cp6009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:26:25] brouberol: so things seemed to have resolved now -- by themselves [13:26:35] looking into what was failing on varnish's side and why [13:26:55] huh intersting [13:27:10] so I guess that I might have to wait for some cache to ttl out? [13:30:11] indeed, I'm now seeing 93% successful redirections, compared to 33% earlier [13:31:27] brouberol: so the VCL was not loaded and that could explain the disparity. so you made your change but it wasn't picked up by Varnish. why that failed, looking [13:32:27] so right now, spot checking seems to give very varied results. I can login and then I get a blank page, or I get the proper UI but clicking on a link fails [13:32:30] so I'm going to wait some more [13:32:53] brouberol: you should not have to wait though. there is no cache per se here. [13:32:56] is this https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/2b824d4f18b0cb42359b88ab434e51e177c87d27%5E%21/#F1 your change? [13:32:59] that's it or something else? [13:33:14] yep [13:33:17] that's the one [13:33:33] yeah that can't be the cause, it's just a symptom [13:34:51] brouberol: what's x-cache-status on the cp hosts that dont' work? [13:35:05] let's see [13:36:02] x-cache: cp6013 miss, cp6011 pass [13:36:03] x-cache-status: pass [13:36:18] yeah drmrs text [13:38:01] I bet trafficserver didn't get restarted on the runs where puppet failed because of varnish [13:38:24] ooh, so now puppet runs but does not change the config, and thus ats is left un-restarted? [13:38:38] > sudo cumin -b1 'A:cp-text_drmrs' 'systemctl reload varnish-frontend.service' [13:38:41] running [13:39:48] none of the trafficservers have had their config reloaded since 10:48 UTC at latest [13:41:09] hmm [13:41:15] https://puppetboard.wikimedia.org/report/cp6011.drmrs.wmnet/c5e88f13c2ef708fe09a5f80b846c6e12e4f51a7 no, this says it does, I think [13:41:17] > Jun 04 13:40:45 cp6009 varnish-frontend[2867121]: subprocess.CalledProcessError: Command '['/usr/bin/varnishadm', '-n', 'frontend', 'vcl.use', 'vcl-fea8aaa9-08c5-48b1-a170-0a28b96735f9']' returned non-zero exit status 2. [13:41:21] Service[trafficserver] success [13:41:35] there is something most certainly up with the stale VCLs in drmrs [13:41:37] sukhe this is what I was seeing in pupperboard [13:41:54] which is causing the reload to fail, which is causing the changes to not propagate [13:42:10] Even varnish does not want to work with france [13:42:14] πŸ’€ [13:44:16] I'm seeing 100% of text hosts redirect to idp now? [13:44:25] idk [13:44:27] let me check [13:44:59] I'm getting a blank page in firefox. Let me try in a private session, somehow [13:45:17] I don't think it is resolved [13:45:21] same, when I'm passed idp, blank page [13:45:28] > sukhe@cumin1003:~$ sudo cumin 'A:cp-text_drmrs' "varnishadm -n frontend vcl.list | wc -l" [13:45:32] is not painting a good picture [13:45:42] hm [13:45:44] it works for me [13:45:45] the old VCLs are piling up [13:46:21] (2) cp[6011,6013].drmrs.wmnet [13:46:25] ----- OUTPUT for command #1: 'varnishadm -n fr...vcl.list | wc -l' ----- [13:46:29] 11 [13:46:33] trying something er, more drastic I guess [13:46:45] cdanis: are you going through drmrs ? [13:48:16] after enabling chrome debug pages so I could flush my open socket pools to get tunnelencabulator to actually work, yes, and now I *can* repro [13:48:23] cp6012 miss, cp6012 pass [13:48:42] AHA [13:48:43] sukhe@cp6011:~$ sudo varnishadm -n frontend vcl.list [13:48:43] available warm warm 4 vcl-31558fa9-2e1c-477b-be55-ed8fd1749390 <- (1 label) [13:48:46] available label warm 0 wikimedia_misc -> vcl-31558fa9-2e1c-477b-be55-ed8fd1749390 (1 return(vcl)) [13:48:50] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11985434 (10cmooney) Actually @BCornwall I'm hoping to test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1297232 , the go... [13:48:51] brouberol: open up net tools :) the JS 404s in drmrs [13:48:51] yeah a _restart_ fixed it [13:49:02] the stale VCLs are cleared up [13:49:10] I wish I knew how to do it without that but I am not v [13:49:14] probably because the request goes to the standard fallback mediawiki page "Welcome! The Wikimedia movement is a global community etc etc" [13:49:39] GET https://kafka.wikimedia.org/assets/index-ToFdRV4e.js [13:49:41] 404 Not Found [13:49:43] cp6016 hit, cp6012 pass [13:49:45] a 404 hit! [13:49:51] which is probably *also* part of the stickiness [13:50:08] hunh drmrs is still c-hashed? [13:50:19] yes, the last site and yes :) [13:50:23] neat [13:50:33] oh wait blblack is around [13:50:43] I'm shift-refreshing and getting more of the site every time lol [13:50:50] blblack: any words of wisdom on how to get varnish to clean up stale and old VCLs, other than a restart? [13:50:55] a restart worked and I can do it worse case but yeah [13:51:11] cdanis: haha, for now I'm still getting blank pages most of the time, or the fallback page [13:51:14] sukhe: I have need to do a rolling restart in drmrs anyway [13:51:20] cdanis: for? [13:51:26] wasn't I gonna push whatsit there [13:51:37] yeah https://gerrit.wikimedia.org/r/c/operations/puppet/+/1297655 [13:52:22] cdanis: sorry, this requires a rolling restart of varnish? [13:52:23] sukhe: my thing can wait 100% byw [13:52:26] *btw [13:52:36] no, it already requires a rolling restart of the haproxy processes, which because haproxy is so smooth is as user-hitless as a reload, but [13:52:38] brouberol: yeah no issues, but we do have a problem if VCL is not reloading [13:52:41] if I'm gonna touch all the hosts anyway [13:52:45] true [13:52:47] cdanis: ok then please go for it +1 [13:52:52] we will need to restart, it fixed the problem [13:52:57] ack [13:53:06] should I preserve perhaps one of the hosts depooled in the bad stae? [13:53:23] cdanis: for analysis you mean? [13:53:26] ye [13:53:38] not to do myself, but to leave for someone else, to be clear [13:54:14] a bit split but one host is fine I guess [13:54:15] depooled [13:54:42] split because -> I went through the backlog and it seems this issue has persisted for a while and the reason is the cascading failure on VCL reload because of stale VCLs [13:54:48] it's only in drmrs and only on ntext [13:54:51] text [13:55:32] hmmmmmb [13:55:43] and drmrs is the only place where we have c-hashing still enabled eh [13:55:46] yeah [13:55:49] hmmmmb. [13:56:18] but also on upload though [13:56:46] upload has way way way fewer uri hostnames than text (that will all chash differently right at startup time and require connecting to different ATSen) [13:57:02] idk I can imagine a lot of shenanigans because of that [13:57:12] but that should not affect the VCL reload though I think [13:57:26] idk [13:58:37] naive q: could it have something to do with the size of the VLC, caused by some drmrs-specific requestctl rules? [13:58:41] *vlc [13:58:43] aaah [13:58:46] VCL [13:58:47] dammit [13:59:14] also conceivable [13:59:23] brouberol: it's OK, maybe VLC can also play VCL files [13:59:34] ^^ [13:59:36] tfw no traffic cone emoji in unicode [14:03:44] sukhe: do you have a one-liner to check for stale vcls, or close to one? [14:05:22] cdanis: varnishadm -n frontend vcl.list | wc -l will give you an idea. [14:07:43] so for some of the text hosts, we see a lot of the older VCLs still serving traffic [14:09:09] _____FORMATTED_OUTPUT_____ [14:09:10] cp6009.drmrs.wmnet: 8 [14:09:12] cp6010.drmrs.wmnet: 6 [14:09:14] cp6011.drmrs.wmnet: 2 [14:09:16] cp6012.drmrs.wmnet: 2 [14:09:18] cp6013.drmrs.wmnet: 6 [14:09:20] cp6014.drmrs.wmnet: 2 [14:09:22] cp6015.drmrs.wmnet: 4 [14:09:24] cp6016.drmrs.wmnet: 3 [14:09:26] pretty sure anything != 2 there is an error state [14:09:28] varnishadm -n frontend vcl.list | grep -E "vcl-[0-9a-f]+" | grep -v " 0 " | wc -l [14:14:44] it's more trickier than that though I think [14:15:37] leaving cp6013 depooled [14:16:01] there is the misc one, there is the active one and there are the older ones that are serving no requests (should have been discared) and older ones serving requests (a problem) [14:17:00] thanks [14:17:50] like cp1100: [14:17:53] available label warm 0 wikimedia_misc -> vcl-35e62912-272d-408b-8b02-cc35dd978c1c (3 return(vcl)'s) [14:17:56] available auto warm 2 vcl-796ce60d-1f38-43bc-ad7e-e6ad951f4da9 [14:17:59] available warm warm 0 vcl-2f125fdd-1e5c-4f14-856c-8411140b91b1 [14:18:02] available auto warm 0 vcl-eeb1168d-77bf-4dce-95e2-f2917da4fe72 [14:18:05] available warm warm 145 vcl-35e62912-272d-408b-8b02-cc35dd978c1c <- (1 label) [14:18:08] active auto warm 697 vcl-746b92fc-d8c6-4aeb-8207-f8a5cf0546ce [14:18:17] > available auto warm 2 vcl-796ce60d-1f38-43bc-ad7e-e6ad951f4da9 [14:20:37] misc also serves eventstreams doesn't it [14:21:50] cache::alternate_domains: [14:21:58] stream.wikimedia.org? [14:22:08] if yes, then yes [14:22:24] those can be very long-lived [14:22:36] so that’s one possible source [14:22:48] I tried to get something from varnishlog but my varnish foo is failing [14:23:12] stream-intermal.w.o was released recently, that provides an authenticated ui over eventstreams, for all streams [14:23:33] so that might be an additional source of streams causing some piling/delayed reloads [14:25:52] (if that is indeed an/the issue) [14:26:04] well, all the pooled drmrs text nodes should be fixed, now [14:27:09] available label warm 0 wikimedia_misc -> vcl-3be3716f-4751-40fc-9cfe-87a543519c7a (3 return(vcl)'s) [14:27:12] active auto warm 948 vcl-ef399a1d-bc53-4f45-a346-82432dc87d49 [14:27:15] good [14:27:18] available warm warm 28 vcl-c7d3c3dc-ba7f-4351-baeb-7b65189b20b4 [14:27:29] available warm warm 134 vcl-3be3716f-4751-40fc-9cfe-87a543519c7a <- (1 label) [14:28:11] cdanis: note that we need a restart and not a reload though. which means that we will need to space these out 20 mins or so between each restart. [14:28:33] is it nice? no. but I can't figure out any other way right now. [14:28:43] a reload is not just doing it [14:29:23] brouberol@cumin1003:~$ sudo cumin 'cp6*' ' curl -s https://kafka.wikimedia.org | grep -q idp' [14:29:23] 68.8% (11/16) of nodes failed to execute [14:29:40] so we will need to do the restart cookbook on A:cp-text_drmrs basically [14:29:45] brouberol: do 'A:cp-text AND A:drmrs' [14:29:54] thanks, I might have been too broad there [14:30:04] you got 8 upload hosts too, where that definitely won't work [14:30:04] unless I am wrong (which is possible) [14:30:09] yeah true [14:30:10] sukhe: well uhhh I already did the restarts πŸ™ƒ [14:30:17] oh [14:30:28] spaced out over a few minutes each but not 20 [14:30:39] ok, we have been meaning to stress test it lol [14:30:45] brouberol@cumin1003:~$ sudo cumin 'A:cp-text AND A:drmrs' ' curl -s https://kafka.wikimedia.org | grep -q idp' [14:30:45] 62.5% (5/8) of nodes failed to execute: cp[6009-6010,6012-6014].drmrs.wmnet [14:32:06] I've tried so many times and I can't reproduce [14:32:17] trying as another sample size [14:32:21] 100.0% (8/8) success ratio (>= 100.0% threshold) for command #1: 'curl -s https://....org/ | grep idp'. [14:32:27] lol what a day [14:32:33] and it's 10:30am [14:32:38] haha [14:32:55] wooo, new metric [14:32:56] 87.5% (7/8) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.: cp[6009-6010,6012-6016].drmrs.wmnet [14:33:16] cdanis: did we reload ATS? [14:33:26] Puppet allegedly did [14:33:26] because in the normal flow of operations it should have been reloaded [14:33:36] but perhaps becaues Puppet was borking, it did not [14:44:11] I am AFK for a bit now so will come back to this again [14:52:31] thank you! [15:07:38] 06Traffic, 06MW-Interfaces-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, and 4 others: Epic: API Rate Limiting Architecture - https://phabricator.wikimedia.org/T399291#11985926 (10daniel) [15:10:04] topranks: You okay with me merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1297621 today? [15:11:27] brett: yes I'll be around so can validate things are working [15:27:35] topranks: Rolled out fleet-wide! [15:27:47] brett: nice!! [15:45:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986028 (10BCornwall) @cmooney I've depooled cp5030. Have fun! [15:55:23] cdanis: did we manually reload ATS on cp-drmrs? asking. [15:55:27] because well I am still confused [15:55:35] sukhe: I thought puppet said it did [15:55:47] ok [15:55:59] yeah I guess by now at least with no failed puppet runs, hmm [15:56:08] https://puppetboard.wikimedia.org/report/cp6011.drmrs.wmnet/c5e88f13c2ef708fe09a5f80b846c6e12e4f51a7 [15:56:12] ctrl-f Service[trafficserver] [15:56:23] that too :( [16:26:19] 06Traffic, 10Diff-blog, 10Technical Blog: Redirect techblog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T417940#11986184 (10Aklapper) @CKoerner_WMF: My impression is that this has been done? If I'm not wrong, does someone plan to * decline or resolve [open tasks](https://phabricat... [16:28:21] blblack: once things have settled down, could use your input here [16:28:30] basically, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/2b824d4f18b0cb42359b88ab434e51e177c87d27%5E%21/#F1 was pushed [16:28:45] and prior to that, it seems like _only_ in cp-text drmrs, there were some stale VCLs [16:29:16] I think that seems to be fixed now at least if I look at the active VCLs and the stale ones not serving any traffci [16:29:28] but the above change for example is still not active on A:cp-text_drmrs and only there [16:29:59] could be it related to something specific with drmrs being the only non single-backend site? [16:30:04] I just don't know what though [16:30:22] 06Traffic, 10Diff-blog, 10Technical Blog: Redirect techblog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T417940#11986198 (10Pppery) Also the top-level page https://techblog.wikimedia.org should redirect to https://diff.wikimedia.org/tech-blog/ instead of https://diff.wikimedia.org/ [16:30:24] so if we do sudo cumin "A:cp-text" "curl -s https://kafka.wikimedia.org | grep -q idp" [16:30:29] only drmrs fails in that for example [16:30:35] oh sorry I've been artficially backscrolled on this channel all day apparently [16:30:42] no worries [16:30:55] heading to a meeting now but yeah, could use your input for sure since we have tried [16:32:31] I'll read some of the backscroll, but just reading the few lines above, my first question is: what's the real observable problem? because generally stale VCLs are kinda-normal and not a big deal. [16:33:05] yeah those are fine now, as I shared, per the output. we have an active one and one for misc. [16:33:18] sukhe: !!! in drmrs you get different results for `curl -s https://kafka.wikimedia.org` than you do for `curl -s https://kafka.wikimedia.org/` [16:34:19] lol. I was getting a different result on every single run. (sorry meeting now) [16:34:58] hm [16:37:02] so, I'm still not done reading backscroll, but I'll say this: in general, it would probably be wiser to roll out the ATS backend mapping ahead of the varnish part, rather than doing them in the same commit. [16:37:24] this would particularly impact drmrs more than other sites, because varnish could start routing traffic for that domain to ATSes that don't yet have the backend mapping [16:37:39] meeting, bbiab [17:01:04] on that point above about drmrs being special: in the future when drmrs is finally single-backend, we could probably solve this purely in puppet (by ensuring the change is applied and functional in the local ATS before changing the local varnish). I don't know if the puppetization even attempts this now, so that could be part of the random failure in other sites if not. [17:01:42] but in the world we have today, not splitting the two levels of change (ATS, then Varnish) is always going to cause some kind of problem. [17:03:52] that the puppet runs failed, due to some pile-up of old VCLs, is kinda of a separate but compounding issue, IMHO. [17:04:52] but in general, AFAIK, I'm pretty sure if we have legitimate stale in-use VCLs (e.g. for streams), they don't get new traffic from new connections, only the old ones. [17:05:01] so they shouldn't have been a direct cause of anything [17:07:24] (unless there's something else fishy going on here, where we have browsers hitting streams, and somehow reusing the conn for non-stream future reqs? seems far-fetched, though) [17:08:31] 06Traffic, 06SRE, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#11986310 (10BCornwall) a:05ssinghβ†’03BCornwall [17:27:35] reading [17:27:53] 13:04:52 < blblack> but in general, AFAIK, I'm pretty sure if we have legitimate stale in-use VCLs (e.g. for streams), they don't get [17:28:53] that's confirmed by the output too right? the fourth column is the requests (in-flight) being served by varnish using the stale VCL? [17:30:20] so could we tackle the problem in a reverse way then like you hinted: because both were put at the same time, then perhaps varnish was reaching out to different ATSes without the mapping and perhaps cached something there? so could the resolution be to try to clear both the caches (we did just varnish) and start there [17:30:26] yeah, so in general I'm saying that in itself should be OK, and shouldn't cause divergent answers [17:30:54] (the stale stream VCLs) [17:31:51] if you've successfully ensure the new config is actually-deployed in all varnishes + ATSes, and *then* wiped relevant records in all the vsarnishes, that should've been enough to recover from whatever happened. [17:32:36] at least at the CDN layer. obviously the same concerns can potentially go deeper down the stack. In general, "define a new thing" always has to happen sequentially from the inside out. [17:53:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Moving switches to make space for the refreshed switches. - https://phabricator.wikimedia.org/T428195 (10VRiley-WMF) 03NEW [18:20:02] 10netops, 06Infrastructure-Foundations: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11986559 (10cmooney) [18:23:39] back after meetings (!) [18:23:50] yeah so we already did all the above [18:23:55] I wonder what else we are missing [18:29:14] 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: POPs - free up 2xQSFP ports - https://phabricator.wikimedia.org/T424611#11986618 (10cmooney) [18:31:11] sukhe: what's still broken? [18:31:52] blblack: the specific mapping change that was pushed out is only being reflected on cp6015 and 16 [18:32:03] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/2b824d4f18b0cb42359b88ab434e51e177c87d27%5E%21/#F1 [18:32:10] not reflected on the other text nodes, only in drmrs [18:33:12] so the other pops are fine, but in drmrs, only 15+16 have the change? [18:33:16] yep [18:33:57] I am going to try something silly, which is to clear ATS' cache [18:34:02] on one of the affected nodes [18:38:58] well keep in mind, the chashing from varnish->ATS there [18:39:08] ouch yeahhhhh [18:39:09] ha [18:39:22] ATS may be broken in one or more nodes (caching or lack of actual config reload), but multiple other nodes may be using it at the varnish level [18:39:27] I am banking too much on the fact that I am getting the same node I guess [18:40:04] do we have an easy tool for wiping just one domain or record in ATS? [18:41:19] I tried looking in the docs but no [18:41:27] or we could purge (maybe a few times in a row to be sure) kafka.wikimedia.org via the script for it, hitting both ATS and varnish [18:41:30] so unfortunately I just cleared all cache on a depooled host [18:41:41] yeah, let me try that [18:42:19] (only reason I'm saying that few times in a row voodoo, is in theory it's all async and a varnish could re-fetch a bad entry from ATS when varnish happens to process the PURGE before ATS does) [18:43:08] sukhe@deploy1003:~$ echo 'https://kafka.wikimedia.org' | mwscript-k8s --attach -- purgeList.php [18:43:12] not much luck [18:43:26] will purgeList treat a bare-domain URI as the whole thing? [18:44:00] I don't think so based on what I have done in the past [18:44:05] ha [18:44:05] sukhe@deploy1003:~$ echo 'kafka.wikimedia.org' | mwscript-k8s --attach -- purgeList.php [18:44:07] yeah me either [18:44:10] https://aa.wikipedia.org/wiki/Kafka.wikimedia.org [18:44:11] https://aa.wikipedia.org/w/index.php?title=Kafka.wikimedia.org&action=history [18:44:14] it purged this ^ [18:44:23] which well [18:44:38] where does it get that list from? [18:44:50] no idea tbh [18:44:51] that's... not what I expected, I guess it's been a while since I've used the script [18:45:53] in any case, we *think* that ATS has applied the new mapping rules in all the nodes (the config change), right? [18:46:35] yes because Puppet has run. I will do a manual reload too I guess. [18:48:34] uhhhhhhh [18:48:36] hmm [18:49:05] (8) cp[6009-6016].drmrs.wmnet [18:49:09] ----- OUTPUT for command #1: 'grep kafka /etc/...ver/remap.config' ----- [18:49:13] map http://kafka.wikimedia.org https://kafka-ui.discovery.wmnet:30443 [18:49:17] forcing a manual reload now [18:50:01] FWIW, the "bad" output (generic landing page) on cp6012 varnish-fe is a frontend cache hit, and it came from an ATS cache hit, too [18:50:30] same host? [18:50:49] x-cache: cp6016 hit, cp6012 hit/5 [18:50:50] x-cache-status: hit-front [18:51:13] 6016 ATS + 6012 varnish [18:51:25] yeah same for cp6013 [18:51:44] reloading done, just in case. [18:51:52] try purge with the trailing slash just in case? [18:52:18] by that I mean: echo 'https://kafka.wikimedia.org/' | mwscript-k8s [18:52:21] yeah [18:53:14] sukhe@deploy1003:~$ echo 'https://kafka.wikimedia.org/' | mwscript-k8s --attach -- purgeList.php [18:53:17] so that purge worked [18:53:18] sukhe@deploy1003:~$ echo 'http://kafka.wikimedia.org/' | mwscript-k8s --attach -- purgeList.php [18:53:35] it did? [18:53:38] (2) cp[6015-6016].drmrs.wmnet [18:53:42] ----- OUTPUT for command #1: 'curl -s https://...rg | grep -i idp' ----- [18:53:46]

The document has moved here.

[18:53:50] still two hosts? [18:54:24] yeah, I get the same 6012->6016 path, but right after your purge with the final slash, it flipped to miss/miss [18:54:32] it still serves the "wrong" content, though [18:54:38] (the generic landing page) [18:54:49] and it's hitting again now of course, on its fresh cache entry [18:54:54] let me try something silly I guess. I will purge both varnish and ATS cache in tandem on the depooled host [18:55:06] why? [18:55:23] it's unlikely to work, since it's unlikely to pick itself as the ATS backend [18:55:37] miss/miss, still wrong answer -> something is wrong deeper than ATS [18:55:39] if I force it to I guess? [18:55:53] the only other way then given drmrs is to clear the cache everywhere? [18:56:11] yeah but clearing this specific cache entry didn't help [18:56:22] we saw it flip from hit/hit->bad-content, to miss/miss->bad-content [18:56:31] caching is not the (only) issue [18:58:37] ok [18:58:43] tell me you found something with that OK :) [18:59:20] I mean I am not confident in the caching thing except I don't see what else can it be except something weird that got stuck when we had that stale VCL due to VCL reload failures for a day or so in drmrs [18:59:26] which then in theory the restart should have cleared out for varnish [18:59:56] well, I noticed if I did a fetch directly to the applayer ( https://kafka-ui.discovery.wmnet:30443/ ) from my cumin host, the server header says istio (whereas my bad outputs don't say that) [19:00:09] so I tried: [19:00:12] bblack@cumin1003:~$ sudo cumin 'A:cp' 'curl -vI https://kafka-ui.discovery.wmnet:30443/ 2>&1|grep -i server:' [19:00:26] result is: [19:00:35] (55) cp[2043-2058].codfw.wmnet,cp[5017-5021,5023-5032].eqsin.wmnet,cp[7002,7004,7006,7008,7010,7012,7014,7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet -> server: Apache/2.4.67 (Debian) [19:00:46] (56) cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[3066-3081].esams.wmnet,cp[7001,7003,7005,7007,7009,7011,7013,7015].magru.wmnet -> server: istio-envoy [19:00:55] so I'm gonna say there's something wrong at the applayer :P [19:00:58] ha [19:01:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986698 (10cmooney) >>! In T427393#11986028, @BCornwall wrote: > @cmooney I've depooled cp5030. Have fun! Thanks! [19:02:37] don't ask me why magru is split like that [19:03:06] maybe that's unrelated but points at some mistake in our geo-maps re: magru IPs or something [19:03:25] I bet the real diff is things that resolve discovery records to eqiad vs codfw for the applayer stuff [19:03:51] or it could be ~random, I donno [19:04:18] seems fairly consistent over time, the split [19:06:17] but, crucially, both the "Apache" and "istio" sets do return a 302 -> IDP [19:06:30] neither of which matches the generic-landing-page bad output [19:06:38] yeah... which is what we see on cp6015 and 16 [19:07:30] yeah from all 111 cp servers, I do seem to consistently get 302 -> IDP [19:07:38] (querying discovery from the cp servers, I mean) [19:07:57] wow [19:08:03] now all cp hosts in drmrs are failing [19:08:03] 14.5% (8/55) of nodes failed to execute command #1: 'curl -s https://...rg | grep -q idp': cp[6009-6016].drmrs.wmnet [19:08:24] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986710 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host cp5030.eqsin.wmnet wi... [19:09:03] I am not sure what changed [19:09:13] other than us calling curl and that somehow changing something (!) [19:09:13] sigh [19:09:14] well, probably the purges caused that, indirectly [19:09:25] but we tried after the purge too [19:09:28] even after a while [19:09:44] what I mean is: we may have purged working entries and gotten then recached as newly-broken ones [19:10:10] yeah but why now though? we purged and then we tested it after a while and it was the same [19:10:13] now _suddenly_ it is broken [19:10:21] surely purged did not take that long [19:13:42] when I try ATSes directly, I invariably seem to get the bad content from k8s [19:14:24] e.g from both cp2043 and cp1100, if I execute: [19:14:32] curl -4I http://kafka.wikimedia.org:3128/ --resolve kafka.wikimedia.org:3128:127.0.0.1 [19:14:41] server: mw-web.codfw.main-5d468dcf79-p55qv [19:14:45] last-modified: Mon, 01 Jun 2026 14:20:50 GMT [19:14:47] etag: "ba5d-65331e6be5480" [19:14:50] accept-ranges: bytes [19:14:53] content-length: 47709 [19:14:56] (and the generic landing page content) [19:15:09] X-Cache-Int: cp2043 hit [19:15:21] so it's still cached in ATS ~everywhere, with the bad output [19:17:01] the eqiad one I'm trying has Age: 234 [19:17:23] I think the routing map config change is not actually-effective on many of these ATSes [19:17:26] blblack: but the above should really be [19:17:30] curl -4 https://kafka.wikimedia.org --resolve kafka.wikimedia.org:443:127.0.0.1 [19:17:33] no? [19:17:40] that would hit varnish [19:17:44] (well, haproxy->varnish) [19:17:51] I'm hitting ATS directly, as if I am Varnish [19:18:10] yes, true you are testing ATS only hmm [19:18:38] either there's some kind of ATS bug where cache entries become stuck after you change the routing of a given map destination [19:18:43] or the mapping change is not actually in effect [19:20:32] brett: if you are around I'm reimaging cp5030 now, seems to be working fine with the'--move-vlan' flag [19:20:39] if you have a moment could you take a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1297768 [19:28:26] I'm gonna depool cp1100 for a little bit... [19:28:42] topranks: maybe he is AFK, I +1ed. [19:28:47] blblack: ok :) [19:29:01] topranks: looks good, checked netbox [19:29:02] sukhe: thanks <3 [19:29:21] reimage is going fine too [19:29:47] nice! [19:30:58] sorry, was indeed afk! [19:31:02] thanks for looking [19:33:05] sukhe: ok so my port 3128 testing was faulty I think. You have to explicitly provide the Host header too (to avoid it sticking the port there) [19:35:30] blblack: yeah I was confused there too but I am seeing double now so I thought you were trying something I did not understand [19:35:36] I presume something like [19:36:14] curl -H 'Host: kafka.wikimedia.org' http://127.0.0.1:3128 ? [19:37:55] yeah and I'm setting XFP: HTTPS too JIC [19:38:09] so, right now what I see is: [19:38:21] sudo cumin A:cp 'curl -4 -I -H "X-Forwarded-Proto: HTTPS" -H "Host: kafka.wikimedia.org" http://kafka.wikimedia.org:3128/ --resolve kafka.wikimedia.org:3128:127.0.0.1 2>&1 | grep -Ei "(location|server):"' ----> [19:38:22] yep +1 [19:38:32] half get server: istio-envoy [19:38:35] location: https://idp.wikimedia.org/login?service=https%3a%2f%2fkafka.wikimedia.org%2f [19:38:42] half get: [19:38:43] server: Apache/2.4.67 (Debian) [19:38:46] location: https://idp.wikimedia.org/login?service=https%3a%2f%2fkafka.wikimedia.org%2f [19:38:54] and the lone exception is cp6016 with: [19:39:00] server: mw-web.eqiad.main-6d967f8996-rz7x2 [19:39:20] stepping out for a bit for pickup, will be back to resume the fun [19:39:28] and probably all of drmrs is chashing this URL to that ATs [19:40:00] [repooled cp1100, don't need it anymore] [19:41:53] so back on cp6016, I did the most manual thing possible to locally purge just that one URI: [19:41:56] bblack@cp6016:~$ telnet localhost 3128 [19:41:58] Trying ::1... [19:42:01] Connected to localhost. [19:42:03] Escape character is '^]'. [19:42:05] HTTP/1.0 200 OK [19:42:08] PURGE / HTTP/1.0 [19:42:10] Host: kafka.wikimedia.org [19:42:20] and now it has the correct content in ATS [19:43:53] so now to clear the frontend cache hits in drmrs [19:47:34] bblack@cumin1003:~$ sudo cumin 'cp6*.drmrs.wmnet' 'curl -4X PURGE --resolve kafka.wikimedia.org:3127:127.0.0.1 -H "Host: kafka.wikimedia.org" -H "X-Forwarded-Proto: HTTPS" http://kafka.wikimedia.org:3127/' [19:48:59] then:: [19:49:02] bblack@cumin1003:~$ sudo cumin 'A:cp' 'curl -I https://kafka.wikimedia.org/ 2>&1 | grep -Ei "(location|server):"' [19:49:20] this now gives "location: https://idp.wikimedia.org/login?service=https%3a%2f%2fkafka.wikimedia.org%2f" for all global cps [19:49:48] (although I'll note the Server line is still mysteriously-inconsistent, but I suppose inconsequential. We still get half saying Apache and half saying istio [19:49:51] ) [19:51:09] but a lot of this is just confusion and line-noise trying to sort out a problem after-the-fact [19:51:46] there's some interesting rabbitholes to chase here, especially why purgeList.php didn't fix it at both layers, and whether that problem is actually at some other layer [19:52:27] but regardless, I think all of this would have proceeded more rationally and involved far less debugging if we ensure the sequence of events is always: [19:53:10] 1) Make sure the applayer (from the CDN's PoV) is actually serving the content expected before touching the CDN (this may have been true in this case, I don't know) [19:53:27] 2) Deploy the routing map change in ATS [19:53:33] 3) Purge the new URIs in ATS [19:53:44] 4) Deploy the Varnish-level change [19:53:50] 5) Purge the new URIs in Varnish [19:54:05] (in theory 3+5 are overkill, but if it's cacheable and someone hit it before it was ready....) [19:55:12] arguably the 2->4 dependency can be handled inside puppet, in the single-backend case (which drmrs is still an exception to) [19:55:40] but if puppet fails anywhere, like it did apparently for backlogged stale VCLs, that's a whole separate issue which will screw up the process [19:56:46] I think in some cases what we may have faced here was a scenario where the initial VCL reload failed (due to the stales), and this blocked the ATS change from even rolling out, allowing the bad cache entries to spread to other site-local hosts easier [20:06:00] sukhe blblack: thanks y'all for all the time you did put into this [20:06:11] from my PoV, things are working, the UI is loading just fine now [20:07:11] envoy vs apache: the traffic goes to kafka-ui.discovery.wmnet, which is configured as an active-active record pointing to both dse-k8s clusters' ingress VIP [20:07:45] it then goes to through the usual ingress path --> envoy-tls-terminator container ---> httpd-cas ---> app [20:08:34] httpd-cas is an apache webserver configured to display the CAS login page if the requested does not have a proper auth cookie. We use this when the downstream app does not have oidc capabilities (eg turnilo) [20:08:48] so this is why you were seeing some traces of apache [20:09:33] well it's not important, so I don't want to burn too many wetware tokens on it, but it still seems oddly inconsistent that half our world sees an IDP redirect with "Server: Apache" and the other half with "Server: istio" [20:09:35] hence the `| grep idp` I was running earlier in the day. as `curl` didn't send an auth cookie, httpd-cas would return a 301 to idp.w.o [20:09:54] ah, that, TBH, I don't know why [20:10:37] what do you reckon cause drmrs to reload the config? Was it the purge? [20:11:20] it *almost* looks like the Server variance splits on core DCs (all pops hitting codfw see one variant, all pops hitting eqiad see the other)... except then magru sees both variants, depending on whether the server number is even or odd (which may be a whole separate layer of issue) [20:12:18] brouberol: at the end, I had to do a very manual purge of the kafka URI just on cp6016's ATS, then on all drmrs varnish frontends after that (because they were all getting the bad one from cp6016 before) [20:12:45] Was it something I did that caused this? [20:12:49] drmrs is unique in that it still has cross-node caching dependencies, unlike the other sites [20:13:44] it's a constellation of things.... our traffic stack's puppetization, the oddness of drmrs specifically, the pileup of stale VCLs, etc. [20:14:26] but it would have been at least a simpler failure to track and sort out, if we had separated the ATS and Varnish -level configuration changes into two separate commits, and then rolled out the ATS one fully before trying the Varnish one. [20:14:41] but I don't think that's a documented practice anywhere. probably should be. [20:16:12] The varnish change being the `cache: normal` entry in puppet? [20:17:32] This probably my 15-20th addition to these mappings and I’ve always bundled them. I’ll split them in the future! [20:17:39] TIL [20:18:25] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986822 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host cp5030.eqsin.wmnet with O... [20:19:34] woo thanks bblack! [20:19:39] reading the scrollback [20:20:26] so the manual purge worked but not the one via the script from the deploy host [20:20:34] yeah [20:20:44] followups in order of importance, IMHO: [20:21:30] brouberol: yeah, most people have typically bundled them and it has worked out. I suspect one of the things that was different in this case (I am still reading the scrollback) is that drmrs was already somewhat broken with stale VCLs that were _not_ discarded [20:21:37] I wonder if some of that had any bearing [20:21:48] in fact, we have a reload-vcl failure going back to two days that got lost in the scrollback (I checked) [20:21:53] 1. Figure out why telnet can do what our purge tool can't (it may be the tool, or something about kafka or purged or who knows) [20:22:38] 2. Recommend splitting the commit and rolling them out in order in our docs or something, at least for now. It's not always necessary, but when things go south it leaves things in a more-consistent state [20:22:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986833 (10cmooney) Ok @BCornwall the reimage seemed to work fine with the `--move-vlan` tag. I updated the IPs in hiera so I thin... [20:23:17] 3. Figure out what caused that stale VCL pileup (if we can, at this point), and set up some kind of monitoring for it recurring anywhere in the future. [20:24:30] 4. Take a deep look at the puppetization of cacheproxy nodes' wrt to rolling out these changes in a single commit (long run, it's better that way, and we can go back to recommending it once drmrs is single-backend). Does it consistently enforce a dependency ordering that the ATS change happens first (and is fully reloaded and in effect) before applying the VCL change? [20:25:13] 4) I don't think so because that would mean we have a way of coupling those changes (basically I don't see why it would call both unless we explicitly make that relation). but I can check. [20:25:27] 5. consider adding purge to the rollout process of changes like these, at some layer somehow... [20:26:19] re: 4 - I assume when we push an ATS map change, puppet does do a config reload on it. Ditto the VCL change and the VCL reload. The question is whether in a single commit puppet's deps do it in the right order. [20:27:06] blblack: ok to pool cp1100? [20:27:13] I thought I already did [20:27:15] to close this chapter for today [20:27:39] no worries, pooled now [20:27:43] also repooled cp6013 [20:28:12] thanks for the help blblack! we did not want to carry this over to a Friday :) [20:28:41] (noted the above points for discussion) [20:36:50] 06Traffic: Remove Digicert CAA records from most domains - https://phabricator.wikimedia.org/T428093#11986856 (10BCornwall) a:05BCornwallβ†’03Jgreen [20:56:18] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11986915 (10BCornwall) We discussed this and the general consensus seemed to be to just decomm the server and wait for the refresh which is happening shortly anyway. @ssingh Is that accurate, and if s... [20:58:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqsin, and 2 others: EQSIN: Setup VRRP on both routers for the new subnets - https://phabricator.wikimedia.org/T427393#11986930 (10BCornwall) Indeed, cp5030 is doing well, thanks! For the remainder of instances, should there be a separate task (or som...