[00:44:10] 10Traffic, 10DNS, 10Operations: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10leila) [01:37:46] 10Traffic, 10DNS, 10Operations: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10Reedy) Are there any subdomains etc? Or does the domain only need to point at `171.64.75.80`? [03:17:21] 10Traffic, 10DNS, 10Operations: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10BBlack) I'm assuming that, for now, the hosting of the web service (and email?) is not moving, just the whois ownership and DNS service? We usually need a fair bit more information tha... [10:12:27] 10Traffic, 10Operations: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10TheDJ) Question. https://wikitech.wikimedia.org/wiki/HTTPS/Browser_Recommendations Windows 7: I know it CAN support TLS 1.2, but I can't figure out if Microsoft released a patch to... [10:22:32] 10Traffic, 10Operations: Monitor and plot TTFB as seen by Varnish frontends - https://phabricator.wikimedia.org/T240180 (10ema) 05Open→03Resolved a:03ema Done: https://grafana.wikimedia.org/d/7-ZqK8-Wz/varnish-frontend-ttfb-comparison?orgId=1&from=now-15m&to=now [11:10:23] 10Traffic, 10Operations, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) [11:10:36] 10Traffic, 10Operations, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) p:05Triage→03Normal [11:11:08] 10Traffic, 10Operations, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) [11:13:50] 10Traffic, 10Operations, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) [11:37:47] 10Traffic, 10Operations, 10Pybal, 10SRE-tools, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10akosiaris) >>! In T239392#5701649, @akosiaris wrote: > `need to be able to understand the... [11:40:54] 10Traffic, 10DNS, 10Operations, 10Research: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10jcrespo) a:03leila Assigning to @leila as per BBlack and Reedy comments, as there seems to be some additional information required. Please feel free to reassign to the r... [11:43:22] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10mark) I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balanc... [11:59:40] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#5727644, @mark wrote: > I agree - it seems that PyBal adds no real value here, because it's es... [12:04:06] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10mark) >>! In T238909#5727693, @akosiaris wrote: >>>! In T238909#5727644, @mark wrote: >> I agree - it seems that PyBal a... [12:08:16] 10Traffic, 10Operations, 10Prod-Kubernetes, 10Pybal, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#5727698, @mark wrote: >>>! In T238909#5727693, @akosiaris wrote: > >> True. We could investig... [13:26:12] 10Traffic, 10DNS, 10Operations: redirect wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Bugreporter) [13:31:06] 10Traffic, 10DNS, 10Operations: redirect wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Reedy) Or just stop using it completely? ;) [13:32:59] bblack: I think this should be equivalent to the VCL re match? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556185/ [13:34:44] ema: yeah +1 . The Vary case is a little trickier (but probably also less-likely to be fooled, I don't think we have very many varies in use in practice) [13:34:57] 10Traffic, 10DNS, 10Operations: redirect wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Ammarpad) May be the question to ask is do we //really// need that? [13:35:38] 10Traffic, 10Operations, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) ema, confirmed that the traffic servers did need a restart which has now been preformed [13:35:44] 10Traffic, 10Operations, 10User-jbond: Check if traffic servers need restarting/reloading post CA change - https://phabricator.wikimedia.org/T240330 (10jbond) 05Open→03Resolved [13:35:56] but just getting some kind of word-barrier on it might be nice [13:36:26] I don't think lua patterns have an equivalent of \b though [13:37:44] there's [%p%s] which should match punctuation/spaces AFAIU [13:37:57] but we also have to match possible start/end as well [13:38:28] IOW the vary contents could be just 'Cookie' [13:38:43] or 'Cookie, Foo', or 'Foo, Cookie', or 'Foo, Cookie, Bar' [13:41:45] 10Traffic, 10DNS, 10Operations: redirect wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Bugreporter) I don't know any way to get the number of tried resolutions (or views) of a non-existent domain. [13:43:29] the docs on patterns are not awesome, but it sounds like () works like regexen, maybe [13:43:39] I don't see a | though [13:48:58] I don't think there's a | [13:49:43] https://stackoverflow.com/questions/10438358/what-is-the-alternation-operator-in-lua-patterns [13:56:36] yeah [13:57:24] which makes replicating something like this tricky: beresp.http.Vary ~ "(?i)(^|,)\s*Cookie\s*(,|$)" [13:57:40] unless they allow ^$ to be in char classes in their pattern language (I doubt it) [13:57:50] nothing is tricky with an arbitrary amount of ORs! [14:00:18] I wonder if lua [^%w] would match start/end? I guess I could test [14:05:10] the lua docs don't even say what some of the metachars do heh [14:05:24] like they call out $ as a special char, but never say why [14:05:49] (but it seems to be the same meaning as regex, while () are apparently not) [14:06:24] https://www.lua.org/pil/20.2.html [14:06:36] > If a pattern begins with a `^´, it will match only at the beginning of the subject string. Similarly, if it ends with a `$´, it will match only at the end of the subject string. [14:07:34] oh that's different than the 20.2 link I googled heh [14:08:53] even in that one, they don't really seem to explain parens [14:09:07] they're listed as special, and they even talk about using %( and such for literal ones [14:09:14] but I don't see a description of what () means [14:11:24] oh they're just captures, so they're only documented in the page about captures :P [14:11:37] they don't seem to group for (foo)? [14:14:24] ooooh meanwhile I found out why setting TS_LUA_CONFIG_HTTP_CACHE_HTTP to 0 did not work in _read_response(): on that side you need ts.http.set_server_resp_no_store(1) [14:14:46] so now we can likely get rid of the header hiding stuff [14:15:25] "did not work" meaning we cached things that shouldn't be cached? [14:15:44] oh nevermind, I follow now [14:16:14] when I tested on a labs instance a while back it cached stuff despite setting TS_LUA_CONFIG_HTTP_CACHE_HTTP to 0 in read_response() [14:16:21] right [14:16:35] so I concluded it was broken and hiding headers was the best course of action [14:17:21] so on the other end of this, I couldn't find a way in the do_global_cache_lookup_complete to say "don't use the object you just found" [14:17:27] but I'm guessing there must be one [14:18:04] you can access the fields of the object for the condition, though, I did find that [14:19:49] we'd probably need to find out how many cookie requests for non-vary:cookie responses we're getting [14:20:26] (to see if it's worth paying the price of cache lookup and possibly coalescing vs likelihood of getting a hit) [14:23:21] the price of a cache lookup is assumed cheap, IMHO, otherwise cache hits would be awful :) [14:23:33] but what I mean is something along the lines of: [14:23:45] https://phabricator.wikimedia.org/P9844 [14:24:42] I'm assuming we can still avoid coalescing the same way we do now in e.g. "function pass()", but the hard part is ignoring the found object first. [14:26:11] but in general, there's probably a ton of common objects that don't vary:cookie, which VCL today lets auth'd users get cache hits on [14:26:38] (and in the future, hopefully that trend increases with content composition, etc) [14:26:58] e.g. restbase html outputs for apps? [14:27:07] turns out that hit-for-pass really isn't a bad idea :) [14:27:53] yeah what I'm describing here is basically manual HFP, when used in combination with the response-side stuff you're talking about above [14:28:11] but instead of storing an HFP object, you're just checking the conditions that would've lead to the HFP creation, on both ends. [14:28:18] right [14:28:42] (which really even bare HFP doesn't solve this case, it's the combination of HFP and how varnish vary-slots the cache that makes it work) [14:29:06] which... hmmm [14:29:16] I assume ATS also does vary-slotting out of the box too [14:30:21] do we do cookie-value-hiding in ATS at all? otherwise ATS's presumed vary-slotting is hurting all of this too [14:31:23] I guess with the default.lua we have today it wouldn't matter [14:31:41] but trying to improve on the stuff I'm talking about above, I think it would start mattering [14:32:01] (and we'd have to do the "replace the session cookie with Token=1" hack to avoid inefficiencies) [14:32:48] dream wishlist: our future URI namespace delineates login-sensitive content by-path [14:33:34] e.g. only en.wikipedia.org/api/v1/sessioned/ cares about session cookies, and /api/v1/anythingelse does not [14:33:49] that would be way easier than trying to deal with Vary outputs on responses [14:34:19] yes my understanding is that we do not need the cookie hiding hack right now as we don't lookup cookie requests (and there's no hfp object to share) [14:34:27] right [14:34:49] and really, replacing it with Token=1 is very specific to true HFP objects [14:35:12] but we might have to temporarily delete it in some cases for cacheable objects [14:35:38] right now, also, the response-side code is only letting cache objects be created by anonymous users [14:36:05] (in the case of a Vary:Cookie'd URI) [14:36:19] so yeah it should be fine in the state it's in now [14:37:13] so it's just my paste-idea that would have to deal with that case, I think [14:37:35] maybe, maybe not [14:37:41] this stuff is hard to reason about [14:39:52] so you can sort of make a mental 2x2 matrix [14:40:12] does the request have a session cookie is one axis, and does the response have a vary:cookie header is the other [14:40:40] right now, in the no/no quadrant, we make cache objects and we hit on them [14:41:02] in the yes/yes quadrant, we neither look for nor store objects [14:41:08] both of those things are Good [14:41:34] in the case of vary:cookie but no session cookie in the request, we make objects and we hit them, which is also good [14:42:25] the last quadrant is the one that's not ideal: if there's a session cookie but the URI doesn't return vary:cookie: we'll store the response in a cache slot for future hits by unsessioned requests, but we won't hit on such an object for sessioned clients. [14:42:32] 10Traffic, 10DNS, 10Operations: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Aklapper) [14:43:03] 10Traffic, 10DNS, 10Operations: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10Aklapper) @Bugreporter: What does a namespace on a wiki have to do with a subdomain? [14:45:51] actually it's even a little bit sillier [14:46:06] for a URI /foo which doesn't Vary:Cookie: [14:46:27] anonymous fetches will hit an existing cache object, and if they miss they'll store an object for future hits by whomever [14:47:05] sessioned fetches will not use an existing cache object, but will store the response an object future future hits by whomever, regardless of whether there was already such an object [14:47:15] every sessioned request for it replaces an existing object that didn't need replacing :) [14:50:18] ah I hadn't thought about that! [14:52:26] we'll call it pre-refreshing and pretend it's a feature [14:53:25] 10Traffic, 10DNS, 10Operations: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10jcrespo) I am just here doing clinic duty for the #operations tag. #traffic should decide on this ticket, but based on my (limited) understanding of our... [14:55:21] :) [14:55:39] I'm peeking at varnish-fe bereqs to see what kind of URIs/traffic match this condition commonly [14:56:13] (on one cache for like 60s) [14:56:37] ack, I've been using "timeout 60 varnishncsa" very often myself lately :) [14:56:56] bblack: when you have a minute I've a followup question from one of your DNS patches from yesterday [14:57:16] oh I should probably also cut out uncacheable responses in general, since they don't matter to this [14:57:19] volans: ? [15:00:53] it's related to the diffscan email, basically it seems that as a result port 53 is now closed on the host's IPs of the authdns (as opposed as the service IPs) [15:01:00] wanted to check if that's intentional [15:01:25] yes, intentional [15:01:33] also because I think I need to adapt some bits in the dnsdiscovery module of spicerack [15:01:59] that was checking that changes were actually online querying directly the authdns servers [15:02:00] it is trying to audit all the authdns for results of a change? [15:02:05] yes [15:02:11] or just confirm that one did? [15:02:14] I guess now I need to audit the recdns instead [15:02:23] I mean the authdns colocated with the recdns hosts [15:02:26] so different cumin alias [15:02:28] right [15:02:32] and maybe different port? [15:02:33] well [15:02:48] A:dns-auth still works (that alias hits all things which contain an authserver) [15:02:53] but yes, use port 5353 [15:03:01] ok, just a port change [15:03:08] that's the per-host port to query authdns directly and not recdns [15:03:33] hmmm now I wonder if I probably missed something there for acme-chief too [15:03:43] I think it has a similarly-zealous check [15:04:43] :-) [15:04:49] I'll take care of the spicerack one [15:05:56] actually I think acme-chief is actually-ok, I think its checks rely on routing and just use the canonical ns[012] IPs [15:07:07] ema: https://phabricator.wikimedia.org/P9845 [15:07:30] for cp1075 for those 60s, it was 171 reqs, so we're ~3/sec rate on a single fe [15:07:42] the cases are basically restbase outputs and resourceloader [15:08:22] those were all reqs where the request had a session cookie, the response lacked vary:cookie, the frontend missed or whatever and asked the backend layer for it, and the frontend stored the object (was cacheable) [15:11:22] 10Traffic, 10Operations: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10Ahecht) Currently, the user experience for someone seeing an error message such as the one at https://en.wikipedia.org/sec-warning is quite poor. To find out what they actually have... [15:13:59] bblack: very interesting. I can do a similar check by adding Vary to logging.yaml on ats-be, doing [15:15:58] bblack: how "stable" do you think this port 5353 setting will be? [15:16:21] or how temporary :) [15:20:41] as stable as anything else in the puppet repo: until such future time as someone changes it [15:20:58] ok perfect [15:21:25] the reason for it is that the blended-role boxes have recdns+authdns, and only one can own $hostip:53 [15:21:54] the $hostip:53, in both cases, being primarily for monitoring/verification, but one had to move Elsewhere [15:22:00] yep [15:22:18] so for the boxes that have only authdns [15:22:29] we have 53 for the "service" ip and 5353 on the hostip? [15:23:46] right, same as the boxes that have authdns+recdns [15:24:05] nothing is listening on $hostip:53 on the authdns-only boxes right now [15:24:52] and where we do the 53->5353 conversion? [15:25:04] conversion? [15:26:00] right now the puppetization is still due for a lot of cleanup, the "5353" port number definition for the listeners and the monitoring are spread across two different layers of abstraction, like they shouldn't [15:26:09] if you mean where can you find that number in puppet or hieradata or something [15:28:10] sorry I meant from ns0:53 to host:5353 [15:28:26] bblack: how do we know that for all the responses in P9845 the frontend stored the object? [15:28:43] storage !~ transient? [15:28:56] ema: yes (which isn't perfect, but all the cases I looked at it "worked") [15:29:22] volans: there is no conversion? [15:30:09] volans: https://phabricator.wikimedia.org/P9846 [15:30:25] maybe the live contents of the file make it make more sense (ipv4 loopbacks shown above it for clarity) [15:30:45] I was looking at a authdns* host [15:30:50] so I clearly got it wrong :D [15:30:56] it doesn't matter [15:31:09] the authdns parts are configured identically on authdnsNNNN and dnsNNNN [15:31:12] bblack: https://phabricator.wikimedia.org/P9847 [15:31:14] ok [15:31:36] now I get it [15:31:37] thanks [15:31:43] bblack: this would be 14/sec, but it includes uncacheable stuff (eg: Set-Cookie) [15:32:55] well either way about the rate, RL and RB are cases that matter, presumably [15:33:24] (they actually fit the bill of cacheable, non-varying, potentially-hot, and logged-in sessions want to hit them too) [15:33:44] for some value of "hot" given this is the backend layer [15:34:01] I think there's a quite-long tail of different combinations of RL params [15:34:48] volans: the goal is to only have the combined-role boxes, eventually there won't be auth-only boxes. we'll just re-puppet authdns[12]001 to join the dnsbox set [15:35:02] ack [15:35:13] and 5353 will always be teh authdns part of them [15:35:35] for things that want to monitor/verify on a per-server basis and need to speak with authdns rather than recdns, yes [15:35:56] Sounds good, patch coming soon™ [15:41:42] hmmm the only reason I didn't put the combined role on the authdnsNNNN yet, is basically I don't want them joining the anycast pool yet (because of the 1G issues and risks) [15:41:52] sure [15:41:55] it might be simpler just to put them in, and do the hieradata to disable bird advertising [15:43:08] but I donno, there's thinking to do on all related things, maybe best left for after the hw upgrades and the holidays [15:44:47] in the meantime though, I can at least continue doing cleanups [15:44:51] bblack: I must be missing something very obvious re:P9844 ... why do we need to throw away the object if it has Vary:Cookie? Isn't it gonna be in a different Vary-slot anyways? [15:46:02] that's a good question [15:47:21] so basically we can assume that if the object has Vary:Cookie, we wouldn't have had a hit in this stanza anyways, right? Because the cookie is present and looking for an alternate vary-slot that doesn't exist (because we don't cache the varied responses on the storing side later) [15:47:26] right? [15:48:28] and if the above is true.... then we actually don't need any code at all on the request side? [15:48:32] surely I'm missing something [15:48:58] I guess we'd lose anti-coalesce though [15:49:07] but we're choosing to disable that globally anyways [15:49:21] so uh, hmmm [15:49:21] the only possible hit we can have for a vary:cookie response to a cookie:xxx request is if the cached object is in vary-slot:xxx right? [15:49:35] I'd assume so, assuming ATS has vary-slotting and it works [15:50:11] leaving aside the fact that now we do not store vary:cookie responses so the hit isn't gonna happen [15:50:21] we wouldn't want to anyways [15:50:31] right [15:51:04] so really we just want an alternate version of pass() called nocoalesce() or whatever [15:51:39] in the case of seeing a session cookie at do read_request() time, we want to ensure coalesce is off (in case we later don't globally disable it), but we do want to let it use a cache hit if such a thing exists [15:51:48] (which would implicitly be a non-vary-cookie cache hit) [15:52:46] the answer to the alternate support question is "yes, up to 3 versions by default": https://docs.trafficserver.apache.org/en/latest/admin-guide/configuration/cache-basics.en.html#caching-http-alternates [15:53:05] lol [15:53:17] anyways, we're effectively not using that, at least for the cookie case [15:53:32] we do have the Vary:AE cases to think about separately, but probably not commonly >3 anyways [15:53:57] and hopefully no Vary:UA which would be insane anyways [15:54:12] well all Vary is basically-insane, from our POV [15:54:22] anytime the app wants to Vary, we have to do something special on our end [15:54:35] but the "up to 3" part deserves at least a bit of giggling [15:54:37] (at a bare minimum, sanitize + normalize the possible/expected inputs from the client) [15:56:34] so essentially we're now thinking that we can change pass() into do_not_coalesce() aren't we [15:57:36] well, for this one case [15:57:45] might want to keep pass() as it is for other uses? [15:58:11] Authorization seems more like a pass() kind of thing, I donno if I want to think through how it would not be and mess up [15:58:13] the only other use is reqheader:Authorization, and yes we probably want to keep that [15:58:42] so just extract a do_not_coalesce(), use it from pass() and keep both [15:59:03] leaving aside the fact that Read-While-Writer can't be disabled from Lua as far as I understand (there's no TS_LUA_CONFIG_DISABLE_RWW or similar that I could find) [15:59:06] sometimes complicated things turn out to have very simple answers once you get to the bottom, it seems! [15:59:22] so we can disable "the other coalesce" (the one kicking in before response headers are received) [15:59:37] right, better than nothing here [15:59:51] I kind of assume for the backend case it's likely we'll keep on with the global coalesce kill [16:00:01] this will all get more interesting for the fe case! :) [16:00:30] I've applied the coalesce kill to text@esams only for now BTW [16:00:32] "only" :) [16:01:11] the change in hits we're talking about here may have indirect impact on the responsestart metrics too (at least the hit/miss/pass-split ones) [16:02:21] it could even conceivably make a positive impact in the overall p75 or whatever, but I'm not holding my breaht [16:03:20] worth a shot though! [16:05:02] for that matter it's possible the Token= stuff did too, I didn't try to look [16:05:20] but I think there are some inexact-match cookies floating around, it could've been making passes out of things that shouldn't have been [16:26:32] bblack: let me know if this seems sane (the commit message too!) :) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556217 [16:28:26] LGTM [16:29:01] ack then testing that patch and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/556201/ tomorrow [16:29:07] o/ [16:29:45] cya! :) [16:33:02] 10Traffic, 10Operations: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10TheDJ) @Ahecht this check doesn't care about browsers, because their behavior is not consistent. It only cares about which ACTUAL protocol you are using. Doing user-agents checks for... [17:16:33] ugh I had forgotten about the self-restoring wikimedia-lvs-realserver config :P [18:10:50] 10Traffic, 10Operations, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) Status: The actual LVS portion of this is now completely removed globally. The IP addresses themselves are also completely unconfigured and removed from service at the all the edge sites, but...