[00:22:27] 10Traffic, 10DNS, 10Operations, 10Toolforge, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10Krenair) [01:53:45] 10Traffic, 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Krinkle) [04:06:07] 10Traffic, 10MediaWiki-REST-API, 10Operations, 10Parsoid-PHP, and 2 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10mobrovac) >>! In T235478#5576774, @ema wrote: > This quarter we will carry on with the conversion of the cache_text cluster from Varn... [09:23:50] 10Traffic, 10Operations, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) As reported to upstream [[ https://github.com/apache/trafficserver/issues/6018#issuecomment-542590620 | here ]]... [09:54:27] ema, bblack, I'd love your comments on https://phabricator.wikimedia.org/T234887#5578895 [10:00:02] vgutierrez: do we even want POST to be buffered? By default (and in all cases here today, anyways) POST can't be cached anyways. Either way a slow POST will be inefficient, it's just a question of which resources it's inefficient with. [10:00:22] bblack: so that's the current behaviour with nginx [10:00:46] ATS lack of POST buffering results in a higher rate of failed fetches [10:01:05] (if your layers do no buffering, the waste is in open low-bandwidth connections live-streaming the slow post upload through. if they buffer fully, you avoid some of those low-bandwidth idle-ish connections on the inside, but you potentially consume huge ram at the buffering layer) [10:01:26] right now nginx buffers on memory up to 64k and up to 100M on disk [10:01:32] right [10:01:48] and we have a hack at the OS layer to put the "disk" cache in ram too [10:01:52] TS doesn't do that, and it forwards the connection to varnish-fe as soon as possible [10:02:36] so with an incomplete POST request, nginx stops it and never hits varnish [10:03:07] I guess we'd have to game out the various scenarios to make sense of this [10:03:10] with ats, it hits varnish, and because it's a POST is passed up to the app layer [10:03:20] are the failing cases all eventually-incomplete (as in the client is behaving awful?) [10:03:23] and is there up to 63 seconds [10:03:43] that's the ttfb timeout on varnish-fe [10:04:00] ttf[response]b? [10:04:14] it's wrongly reported as a failed fetch but that's a varnish "bug": https://github.com/varnishcache/varnish-cache/issues/2629 [10:05:30] anyways, what I'm driving at in general here, is that rather than trying to make ATS behave similarly to how we've historically had nginx configured, maybe we should re-evaluate the whys and see if it's really necessary to buffer [10:05:50] (vs e.g. adjusting timeouts or interpreting reasonable error results differently) [10:06:19] sure [10:07:43] my (perhaps naive) line of thinking at present is that it should be better to not buffer POST uploads. It's too easy for that to exhaust memory buffering in the edge nodes (or slow them down with disk, etc). Most layers should be efficient at handling the increased connection volume from letting it stream through if they're event-based, etc [10:08:00] but yes, it will change how we interpret and deal with timeouts at the inner layers (inside of the previous buffering) [10:09:35] but all my use of "buffering" above means "attempting to store-and-forward the whole POST data chunk", not the kind of small buffering that's inevitable (buffering small chunks before forwarding while still effectively "streaming" the data through, which may be what some related parameters are tuning) [10:09:59] yeah, I was using it like that as well [10:12:55] anyways, we can at least tune on this stuff through all of our layers and probably handle things effectively either way (adjusting timeouts and/or conn/thread limits at various layers of ours), but probably the most impactful input here would be knowing if it's going to create a problem for MediaWiki [10:14:10] it may be the case that the traffic layer doing store+forward of POST (regardless of which part of our stack does it) before starting a txn with mediawiki is an important defense because MW does scale well on slow POST inputs from many parallel clients [10:14:24] s/does/doesn't/ [10:15:37] (and so, given our rate/parallelism of public POST inputs towards mediawiki, switching from buffering to non-buffering may increase POST parallelism dramatically down there in MW) [10:16:11] (which is true in all our layers too, but I'm pretty sure we can tune for it and handle it, but maybe php/MW not so much?) [10:20:01] --- anyways, rewinding a bit to the original problem in the ticket, let me see if I understand the flow correctly [10:20:58] old: nginx buffers incoming POST bodies (subject to some ??? timeout there for a slow upload), then forwards the whole POST request quickly down through the rest of the layers, no timeout/error expected... [10:22:20] new: ATS isn't buffering, spools POST body inwards slowly. Varnish streams it inwards as well (v-fe -> v-be|ats-be -> applayer), but while varnish is still waiting for the very slow client to finish up sending the POST body (during which time obviously the applayer can't respond yet), we hit the 63s first_byte_timeout of Varnish basically saying it gave up waiting on the applayer response to the [10:22:26] request? [10:23:31] (if so, this really seems like a varnish bug at its root. it shouldn't be considering POST-upload time in the response time_first_byte timeout. POST-upload should be some other timer and the time_first_byte clock starts after the whole request is sent) [10:25:28] since there doesn't seem to be such a separate timer, I guess s/bug/design bug/ [10:25:55] varnish also complains when ATS decides to abort the request after the communication against v-fe began, cause ATS stops waiting for a response on that [10:26:29] the simplest way of reproducing it is with something like: curl -X POST -d "test" -H "Content-Length: 10" using HTTP/2 [10:26:45] so that triggers a HTTP/2 GOAWAY error on the ATS layer [10:26:55] but the request hits varnish-fe anyways [10:27:12] so poor varnish waits a minute for the POST data to be completed till it decides to kill the conneciton [10:27:16] *connection [10:27:40] and of course being a POST, you get all that up to eqiad [10:28:01] hmmm [10:28:04] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.10.14/logback?id=AW3JvgVox3rdj6D8rtVx&_g=h@44136fa [10:28:17] that's an example as viewed by cp1077 from a test request originated on cp5007 [10:28:35] note the time-bereq => ~60 secs [10:29:36] ok so in that case, I assume what's happening is: [10:30:38] 1) ATS forward the basic POST request data as is, with CL:10 and only 4 data bytes, straight through to varnish and beyond. [10:31:09] 2) curl signals some kind of end-of-stream with 6 bytes missing, so ATS does the H/2 GOAWAY (client<->ATS are now done) [10:31:45] 3) .... but inexplicably, ATS maintains the open connection to v-fe at this point, waiting on the response? ... this seems like a (possibly separate) issue. [10:32:04] (because if ATS dropped the backend side too, I think the varnishes would all give up quickly) [10:33:32] what's the destination URI you're using for your curl example? [10:33:44] nevermind I see it from logstash [10:33:52] /w/api.php?action=cspreport&format=json&reportonly=1&vgutierrez=1 [10:33:53] yeah [10:34:48] so.. from my local environment I'd say that on 3 ATS actually closes the connection to the origin server... cause I've seen a RST from ATS side when it gets the response [10:35:36] but maybe it's worth the test on the labs environment where the setup is more close to production [10:35:42] s/more close/closer/ [10:37:53] vgutierrez: I assume (I don't remember though) - we ended up configuring the ats-tls layer for 1:1 connection mapping (every client conn gets one unique v-fe conn)? [10:38:00] yes that's right [10:39:03] getting rid of the BS at that layer with ats-tls -> v-fe conns will be a really nice win once ats-tls morphs into a caching ats-fe :) [10:40:18] yup, and at that point 127.0.0.1 traffic can be replaced with UNIX sockets [10:46:54] I don't think we'll even have any [10:47:33] it will just be ats-fe (acting as tls + fe cache/logic) chashing to ats-be using the real IPs. 1/N of those will happen to be the local host, but it won't be over loopback. [10:48:29] anyways, yeah maybe look at exactly what that behavior is towards v-fe on GOAWAY [10:48:53] maybe there's some bug on one side or the other that's cause varnish to hang there when it should be immediately stopping as well. [10:49:33] same thing can be achieved with HTTP/1.1 [10:49:38] it will help understanding what's going on, but I assume the real organic cases aren't straight up errors like that, but probably slow POST uploads that don't complete before some timer [10:50:12] with http/1.1 it's triggered like this: "timeout 5 curl --http1.1 -X POST -d "test" -H "Content-Length: 10" .... [10:52:04] the big question in my mind is: are we really failing on important requests that previously succeeding? Or are we just exposing corner cases that were hidden before (e.g. nginx timed out or killed the client conn while waiting on data relatively-silently, and now it's exposed as a varnish fetcherr, but either way there was never any success involved) [10:53:30] I'd say is the latter [10:53:50] we are now seeing failed fetches for something that nginx handled silently [10:56:48] also is an increase on our 5xx stats for something that's most likely a 4xx [11:02:18] if you look on turnilo for 5xx responses with a request time from 50 to 60 secs you will see cp4027 and cp5007 [11:02:26] the two ats-tls text nodes [11:07:52] funny enough in the last 24 hours two esams text nodes are there as well [11:08:08] interesting cause ulsfo and eqsin requests won't go via esams [12:35:13] hey bblack - did you get my ping? I'd like to talk about the https://phabricator.wikimedia.org/T233609. We (Reading Web) have couple questions regarding what is feasible. [14:15:43] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Elitre) @Gilles FWIW, I just got: Request from 176.207.117.69 via cp3038 frontend, Varni... [14:37:20] raynor: every time we've attacked such a problem in the context of "server side", it's been a pretty deep and un-fruitful rabbit-hole. [14:39:20] I'd second nuria's suggestion that this is better-handled in a way that's transparent to caching, rather than trying to serve unique pageview for the same URI based on some client option. [14:40:55] (see also all the history on generic A/B testing mechanisms with server-side support in T135762 that was linked early on in the new ticket - TL;DR - it's pretty complex and hacky to support it, and it's hard to make it statistically valid without causing privacy issues, etc) [14:40:55] T135762: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762 [14:41:29] we're trying to explore all possible options, it's far far future, but for now we want to know what's feasible and what not [14:42:06] bblack, what if instead of doing A/B test we change the wgDefaultSkin to sth else on one small wiki [14:42:37] to be clear with examples, there are two basic ways it can work that I'm calling server-side and client-side A/B testing of things (within a wiki, or within all wikis, for the anonymous users). [14:42:49] context - we're going to work on Desktop Refresh project, we started an RFC about implementing a new skin, let's call it temporarily Vector2 [14:42:51] The server-side variant looks something like this: [14:43:41] 1) Some mechanism (e.g. an opt-out button for opt-out, or for A/B testing some random function) sets a client-side cookie for the B-case. [14:43:58] 2) Varnish splits the HTML caching based on this cookie [14:44:15] 3) MediaWiki offers differing HTML outputs for the same /wiki/Foo based on this cookie [14:44:31] the client-side variant looks more like: [14:45:19] 1) Some mechanism purely on the client side (opt-out: js sets some localstorage/cookie, A/B: js does some random function to self-select). Only the client really knows if it's in or out. [14:45:33] 2) Varnish doesn't do any splitting of unique outputs under a singular URI, and is unaware. [14:46:24] 3a) MediaWiki serves up whatever's necessary for both variants from a single URI and the client choses the thing to use (might work for very small things, e.g. small layout changes or js snippets). [14:46:28] OR: [14:46:39] 3b) MediaWiki serves up the varying bits from distinct URIs, and the client-side JS choses which to load [14:49:18] ---- [14:49:32] bblack, the problem is that we want to overhaul pretty big chunk of the UI, most probably the client-side A/B test won't provide us what we want [14:49:52] skins are loaded by resourceloader, no? [14:50:24] (in which case, client-side JS is already in charge of loading the skin... and surely a local random or opt-out decision can control that?) [14:50:46] I think that some parts of the ui are loaded on the first request [14:51:12] also, to make our life easier we might start working on a vector fork [14:51:26] it's hard any way you slice it :) [14:51:37] for logged in users it shouldn't be a problem, opt-in opt-out is only changing the user skin [14:51:54] but having two different kinds of output for the same basic URI also causes lots of pain on our end, and then cf all the problems from the old ticket. [14:52:04] bblack, out of curiosity, how our system works if we decide to change wgDefault skin on some smaller wiki [14:52:56] I think that would work (modulo purging that wiki on deploy, probably), but it doesn't make for a good test because of cultural/community splits between wikis (a test that says users of grwiki love it doesn't say anything about enwiki) [14:54:25] modulo purging - do you mean we have to purge all cached articles for that wiki? and by modulo - like in batches? [14:54:50] we can purge them all at once, and for a small wiki it's not too-impactful [14:55:29] this is because parts of the UI are embedded in the initial page output itself as you mentioned before, and it would be a jarring and/or broken experience if that changed based on pre- vs post- change cached outputs as you navigate from page to page [14:55:59] ok, what about enwiki? if we go with new skins approach that means we might want to change enwiki skin in like 2yr [14:56:17] is it feasible? might be a blocker for a new skin. [14:56:27] hopefully in two years we won't have the foundational problems of blended UI+content outputs in the first place, would be my hope [14:56:48] I'm coincidentally heading into the frontend architecture working group meeting in a couple minutes, which it talking about the long view on these very issues [14:57:35] the real underlying problem is how enmeshed our content and UI stuff is today, and the goal is split these things apart on clear interface boundaries such that UI experimentation and improvement (and diversity) has almost nothing to do with content output/caching and such. [14:57:52] Jon Robson and Jan Drewniak are in that working group if I'm right (both are from our team and will work on the Desktop Refresh project) [14:57:53] having that in place first makes all of this complexity dissappear [14:58:51] make sense [14:58:59] I think if you're looking far out (years), we should attack these things in the right order: fix the foundational bits to make UI work/experimentation/etc easier, instead of hacking on the mess we have today :) [14:59:19] we're just prepping for a pretty big project, first we're trying to understand how everything works and what are possible blockers/bumps on the way [15:00:02] better to find out what can be achieved, instead of hitting the wall after a year of work [15:00:21] so in short - to summarize the conversation [15:00:45] in general, for the anonymous-user case in today's world, it tends to open up huge cans of worms if we want to differentiate the html output of e.g. /wiki/Foo (through the caches) based on a randomized A/B or opt-out. [15:01:02] - it's possible to A/B test anon users, but there are privacy concerns and it makes caching layer much more complex [15:01:15] it is possible, but the solution is likely to involve a lot of work, end up looking like a duct-taped mess, and quite likely it will be statistically-invalid [15:01:22] and then yes, the privacy concerns [15:01:24] - if possible do not do server-side A/B test, only client side [15:02:05] - if we decide to go with new skin then after changing the default skin we should purge the varnish cache for that wiki [15:03:05] and the last question, I think that "no", but I prefer to ask - Does this change based on how many wikis we are testing on? [15:03:33] we need to do all those hacks anyway, so then there no difference if A/B test on 10 small wikis or 2 medium one? [15:04:33] the only concern is the available cache size so we do not hit constant cache eviction [15:04:51] the hacks would be actually support anonymous a/b within a wiki (or all wikis). Changing the skin of one wiki isn't hacky, and just needs the purge [15:05:26] using a cookie for A/B tests -> that's hacky. sorry, I should be more clear [15:05:27] the purging is a matter of impact. Small wikis are no big deal. A sudden purge of all enwiki might be nearly outage-inducing if we're not careful [15:05:56] I didn't meant the changing default skin, just using a cookie for opt-in/opt-out [15:06:08] changing the default skin is more like a "Idea B" [15:06:31] because as you said - it doesn't give us meaningful data, it will be only single wiki. [17:40:20] 10Domains, 10Traffic, 10DNS, 10Operations, and 2 others: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060 (10sbassett) p:05Triage→03Normal [17:42:45] 10HTTPS, 10Traffic, 10Operations, 10Research, and 4 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276 (10sbassett) [17:44:14] 10Wikimedia-Apache-configuration, 10Operations, 10Privacy, 10Security: Apache 2.4 exposes server status page by default? - https://phabricator.wikimedia.org/T113090 (10sbassett) [23:20:47] 10Traffic, 10Operations, 10ops-esams: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) p:05Normal→03High