[11:26:37] 10Traffic, 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: Make our docker registry public - https://phabricator.wikimedia.org/T150168#2775898 (10Joe) [11:27:04] 10Traffic, 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment: Make our docker registry public - https://phabricator.wikimedia.org/T150168#2775912 (10Joe) p:05Triage>03Normal [11:27:22] 10Traffic, 06Operations, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2775914 (10Danielsberger) Ok, here are the new results for cache sizes between 50GB and 400GB. For now, I only looked at the Filter and Exp admission policies. Disclaimer: The... [11:35:35] 10Traffic, 06Operations: reimage cp4016 and cp1055 - https://phabricator.wikimedia.org/T149843#2775933 (10ema) 05Open>03Resolved a:03ema The hosts have been reimaged on 2016-11-02. [11:52:24] 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2775954 (10elukey) Quick summary: Brandon and Ema debugged the upload issue and figured out that it was related to the absence of a... [12:10:35] 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2775972 (10elukey) From http://book.varnish-software.com/4.0/chapters/Tuning.html: ``` Varnish operates with multiple pools of thread... [12:53:33] elukey: thread_pool_queue_limit would be an issue if varnish ran out of available threads from thread_pool_max * thread_pools to accept new connections. So in our normal config today, about 16020 concurrent connections. [12:53:54] (and then the failure if that were the limit would be the dropping of incoming connections that were over the limit) [12:56:03] 10Traffic, 06Operations, 13Patch-For-Review: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187#2776071 (10Danielsberger) [[ https://github.com/dasebe/libvmod-cacheadmission/blob/master/vcl/ExpLRU.vcl | Here ]]'s some VCL/inline c that implements the exp-size amission poli... [12:57:51] bblack: hi! I was speculating that the waiting list could have played a role in filling up the thread queue, it was the only thing that made a bit of sense from my tests [12:57:59] ah I see [12:58:30] but in your test in the ticket, you set the varnish thread params at thread_pool_max=10 [12:59:24] so with the default of 2x pools, and queue_limit of 20, that means after 10+10+20=40 connections open up in parallel, further concurrent connections will be dropped if they arrive before another finishes up. [12:59:35] and you told apachebench to use 1000 concurrent connections [12:59:43] so you'd kind of expect that even without a waitlist problem [13:00:55] oh yes sorry the test is only to show a way to repro consistently [13:01:09] going off on a different track: I don't think the object in question (the AdvancedSiteNotices case) is being treated as a hit-for-pass properly in the first place [13:02:03] what I was observing debugging over the weekend is that each time varnishd fetches the object, the logs indicate it goes through our hit-for-pass code and creates an uncacheable 601s object, which further requests should hit on in their lookup and then log a pass-action rather than a miss (miss is how you end up on a waitinglist) [13:02:33] but even immediately after observing the creation of the uncacheable/hit-for-pass 601s object, a second request to the same varnish logs the lookup result as a "miss" [13:02:42] that's the mystery at the core of this, I think. [13:03:57] it will still show up in X-Cache as a "pass", because every time it misses we create a hit-for-pass and flag the request which created the hit-for-pass as "pass", but it's having to create a new hit-for-pass on every miss instead of reusing them like it should [13:04:06] (last one from the previous track - with ab set to 1k conns in parallel I was expecting a similar behavior from Varnish, but at least with some VSL tag indicating that Varnish dropped the request on purpose. But ReqAcct set to zero bytes transmitted is also a bit confusing) [13:04:50] does varnish give any indication that the transaction is "done" that can be relied on? [13:05:06] ReqAcct 95 0 95 0 0 0 ? [13:05:32] well, if nothing else there is an "End" [13:05:50] this one afaics says that no bytes are returned/transmitted, but only received.. And as far as I can see, a 503 is returned [13:05:59] yeah End is the only one [13:06:00] is "End" really in the shmlog, or just something varnishlog itself tacks on the displayed output? [13:06:35] I think it is in the shmlog, because IIRC vk matches it in the code. Checking [13:06:56] https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L1626 [13:07:01] SLT_End should be it [13:08:01] yeah so, I think in the much-broader picture, perhaps SLT_Begin and SLT_End are your markers of a complete transaction which are universal from vk's perspective [13:08:19] it's not that a transaction is "incomplete", it's that the completed transaction record lacks expected data [13:08:51] (which we may want to flag as some kind of error by default, and maybe tease out different scenarios to record them as specific kinds of errors) [13:09:43] a Resp timestamp is apparently not a gaurantee [13:39:29] I just tried another on cp4018, and was able to get another repro of course. this one took nearly 10 minutes though [13:39:36] real9m51.464s [13:40:01] frontend side: [13:40:04] - 208619406 VCL_call c HASH [13:40:04] - 208619406 VCL_return c lookup [13:40:04] - 208619406 Timestamp c Waitinglist: 1478525492.308061 591.222750 591.222750 [13:40:07] - 208619406 VCL_call c MISS [13:40:10] - 208619406 ReqHeader c X-CDIS: miss [13:40:12] - 208619406 VCL_return c fetch [13:40:14] - 208619406 Link c bereq 208850458 fetch [13:40:34] backend side: [13:40:38] - 208850458 BerespHeader b X-Cache-Int: cp2001 pass, cp4010 pass [13:40:38] - 208850458 BerespHeader b Connection: keep-alive [13:40:38] - 208850458 TTL b RFC 0 10 -1 1478525493 1478525493 1478525492 0 0 [13:40:40] - 208850458 VCL_call b BACKEND_RESPONSE [13:40:43] - 208850458 TTL b VCL 0 10 0 1478525493 [13:40:45] - 208850458 TTL b VCL 0 3600 0 1478525493 [13:40:48] - 208850458 TTL b VCL 601 3600 0 1478525493 [13:40:50] - 208850458 BerespHeader b X-CDIS: pass [13:40:53] - 208850458 VCL_return b deliver [13:40:55] - 208850458 Storage b malloc Transient [13:40:58] (as in, -c and -b of the "frontend" instance in these cases) [13:41:01] Transient has to be a big hint here [13:41:12] a 601s hfp shouldn't be going into transient, right? [13:42:43] one possible issue here (aside from "our related VCL logic is just messed up under v4") is that varnish is making the transient-storage decision based on the immediate beresp TTL rather than the one VCL fixes up in backend_response [14:19:22] after a chat with fgs on #varnish it seems that my repro case hits https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/http1/cache_http1_fsm.c#L179 [14:19:41] with -p debug+=waiting list I can see the kills logged [14:19:52] and from what he says the connection in this case is dropped [14:20:06] so the absence of ReqAcct and Timestamp:Resp is consisstent [14:20:47] now I have no idea if the logs that I am seeing in text are related to waiting list kills, but it should be something similar I suppose? [14:21:18] given we know we have a waitinglist problem, seems likely [14:21:25] bblack: hello :) [14:21:36] bblack: ok with routing text esams to codfw and starting the esams upgrade? [14:21:48] ema: no [14:21:57] ema: we may already be causing subtle problems in ulsfo [14:22:12] something is not right with hit-for-pass traffic [14:22:18] ok [14:29:14] throwing out all other higher-level observations and focusing on the most-puzzling/isolated/reproducible thing [14:29:31] is the stuff I pasted just above starting with 13:39 < bblack> I just tried another on cp4018, and was able to get another repro of course. this one took nearly 10 minutes though [14:29:42] I have cp4018 depooled to test on it myself right now [14:30:40] The "AdvancedSiteNotices" URL that elukey has been linking [14:31:02] on the -b side of the frontend instance, after fetching it, the log output looks like it creates a hit-for-pass for 601s [14:31:16] but the next request for the same thing, on the -c side, does lookup->miss, not lookup->pass [14:31:26] (and then creates yet another hit-for-pass for 601s, supposedly) [14:31:44] bblack: quick q [14:31:52] just to confirm that I'm not missing/forgetting something [14:32:05] lvs1007-lvs1012 do not carry traffic, right? [14:32:25] they're all configured but not carrying traffic because there are no BGP sessions set up with them at the router level, correct? [14:32:43] paravoid: correct. lvs100[789] are marked primary and 10-12 as secondary, but they have a parameter set in pybal config from a hieradata just for 7-12, which tells pybal not to talk bgp. [14:33:06] aah, hiera [14:33:24] grumble grumble [14:33:47] where is that? [14:33:54] even worse, it's in regex.yaml :) [14:34:00] lvs_temp_bgp_disable: __regex: !ruby/regexp /^lvs10(0[7-9]|1[0-2])\.eqiad\.wmnet$/ lvs::configuration::bgp: no [14:34:01] argh :) [14:34:03] found it yeah [14:34:15] I like the "temp" in there too :) [14:34:19] heh [14:34:22] it's been very temporary! [14:34:31] anyway, I'll ask chris to move those to asw2 now [14:34:46] their row D links? [14:34:48] yes [14:34:51] awesome [14:35:07] it's been so long I don't even remember the rest of their state [14:35:20] yeah :( [14:35:29] we may have disabled some of their ports in other rows during some testing, or who knows what. pretty much have to audit it all again after the D move [14:36:09] (and almost certainly even if basic network links look ok and no SNMP strangeness, they need reinstalls all over again before thinking about traffic) [14:36:38] yeah, I wasn't planning to turn them active anyway :) [14:36:43] ok :) [14:37:55] we're not done with asw2 either, quite a few steps ahead of us [14:38:13] but hopefully by Wednesday we can start using e.g. these LVS servers [14:38:55] I assume we're assuming that whatever the SNMP issues were or port issues (I think some row D ports we couldn't even get to come up?), it was all related to the old switches? [14:40:01] yes [14:40:23] asw2 is an entirely different series of switches, very different JunOS too [14:41:10] both in terms of series and of age (early R of 12.3 vs. a 2016-02 build of 14.1) [14:41:20] much much better switches too :) [15:06:30] I started questioning general TTL logic, after seeing some TTL oddities in the really bad requests [15:06:54] so on depooled cp4018 where I have total control, I can request a /wiki/whatever that hasn't been seen before by this frontend [15:07:36] (it's been seen by the backend more than 4 times though, so one-hit-wonder doesn't come into play) [15:08:00] it's a cache miss in this frontend, and then on the -b side of the frontend, for the fetch->(store+deliver), I see this: [15:08:06] (just the relevant headers): [15:08:15] - 3448188 BerespHeader b Date: Mon, 07 Nov 2016 13:37:08 GMT [15:08:16] - 3448188 BerespHeader b Server: mw1263.eqiad.wmnet [15:08:27] - 3448188 BerespHeader b Cache-control: s-maxage=1209600, must-revalidate, max-age=0 [15:08:32] - 3448188 BerespHeader b Last-Modified: Mon, 07 Nov 2016 13:06:09 GMT [15:08:43] - 3448188 BerespHeader b Age: 5000 [15:08:47] - 3448188 BerespHeader b X-Cache-Int: cp2013 miss, cp4009 hit/7 [15:09:02] - 3448188 TTL b RFC 1209600 10 -1 1478530828 1478525828 1478525828 0 1209600 [15:09:05] - 3448188 VCL_call b BACKEND_RESPONSE [15:09:08] - 3448188 TTL b VCL 91400 10 0 1478525828 [15:09:10] - 3448188 TTL b VCL 91400 3600 0 1478525828 [15:09:13] - 3448188 BereqUnset b Cookie: [15:09:15] - 3448188 VCL_return b deliver [15:09:15] so I can tell you what I've parsed from that so far... [15:09:18] - 3448188 Storage b malloc s0 [15:09:47] first off: 1478525828 == Mon Nov 7 13:37:08 2016, which is obviously from the Date: header [15:09:57] (it's when varnish first fetched+cached this from mediawiki) [15:10:21] and 1478530828 == Mon Nov 7 15:00:28 2016, which is the time of my request to this frontend [15:11:07] the gap between those two numbers is 5000, which is how Age: 5000 was derived by the backend to send to us, and also the same Age: we'd be using here in the frontend [15:11:36] (and AFAICS, the Last-modified doesn't come into play at all, but I included it for completeness) [15:11:50] so moving on to the TTL lines [15:13:22] "RFC" or "VCL" | TTL | Grace | Keep | Reference time for TTL | Age (incl Age: header value) | Date header | Expires header | Max-Age from Cache-Control header [15:13:33] ^ is the format/meaning of those fields in the docs [15:15:08] so the RFC TTL line makes sense: it reflects the cache-control s-maxage as the RFC TTL, default grace, has our request time as the reference time for TTL, [15:15:29] I don't get why "Age (including header value)" is set to the Date-field there, so I donno wtf [15:15:39] anyways, moving on to the next two though [15:16:12] well, this was where my mystery was, but talking out loud solved it [15:16:13] :) [15:16:29] ha :) [15:16:37] so it's using the Date-field value (not the request-time value) as the timestamp reference for the TTL in those lines [15:16:57] and we cap frontend TTL at 1d. It's interpreting our beresp.ttl = 1d as "now + 1d", but the object is already 5K seconds old [15:17:06] so that's why the TTL in that line is >1d [15:17:20] it's 1d + 5000, and the reference time is 5000 in the past, which is equivalent to now+1d [15:17:24] confusing as fuck [15:18:36] also, I thought we could do /wiki/Foo?whatever=whateverelse to get unique objects for debugging before, that acted like normal articles... [15:18:49] now it seems like any time I add an invalid parameter, mediawiki returns no-cache headers :( [15:19:50] it's probably a more legit way for MW to operate, since cacheability probably depends on parameters, and an unknown parameter might mean something shouldn't be cacheable but we just don't know that in the right places, but sitll [15:21:31] in any case, even in isolated (depooled), and with a fresh restart (so the only other queries coming through here are purges and monitoring-queries) [15:21:43] I still see the same repro on that URL [15:21:51] - 637102 ReqURL c /w/index.php?title=Template%3AAdvancedSiteNotices%2Fajax&action=render [15:21:59] - 637102 VCL_return c lookup [15:21:59] - 637102 VCL_call c MISS [15:21:59] - 637102 ReqHeader c X-CDIS: miss [15:22:07] and then on the backend side for the fetch for this: [15:22:35] - 354940 BerespHeader b Cache-control: private, must-revalidate, max-age=0 [15:22:47] - 354940 TTL b RFC 0 10 -1 1478532042 1478532042 1478532041 0 0 [15:22:50] - 354940 VCL_call b BACKEND_RESPONSE [15:22:52] - 354940 TTL b VCL 0 10 0 1478532042 [15:22:55] - 354940 TTL b VCL 0 3600 0 1478532042 [15:22:57] - 354940 TTL b VCL 601 3600 0 1478532042 [15:23:00] - 354940 BerespHeader b X-CDIS: pass [15:23:02] - 354940 VCL_return b deliver [15:23:05] - 354940 Storage b malloc Transient [15:23:08] that's almost exactly what I'd expect the first time through, other than the unexpected "Transient" [15:23:32] the problem is when I make a second request right after that, the exact same trace on both sides happens. It's a "miss", not a "pass" hitting a created hit-for-pass object [15:23:58] I think Transient is a clue there. the hfp is being created as some kind of temporary object and not making it to real storage for future requests [15:24:11] there's apparently a hint storage to transient for hfp https://www.varnish-cache.org/lists/pipermail/varnish-commit/2013-May/009590.html [15:24:39] yeah but: if (bo->exp.ttl < cache_param->shortlived || [15:24:54] "shortlived" is 20s by default and we don't change it [15:25:08] sorry, 10s [15:25:13] and the TTL on this object is 601s [15:25:47] oh wait, I missed the point of that comment [15:25:50] || pass = 1 [15:25:59] grrr || pass == 1 [15:26:09] yep [15:26:25] but... we do expect all actual pass-traffic to use transient storage [15:27:29] for instance, it was a known thing with our VCL in v3 (that we didn't care much about the inefficiency of, vs complexity of fixing it): that traffic that matched our hit-for-pass VCL, when it came back through as a pass (due to hit-for-pass created previously), it also created another hit-for-pass (in temporary storage for this request), pointlessly, because it matched the same conditions. [15:27:48] but the *first* time through, it should be a "miss", not a "pass", and thus should create an hfp in real storage [15:29:32] in any case, the code in that email doesn't exist in 4.1 anymore [15:29:41] at least, not in the same place [15:30:04] right I was looking for it now in 4.x and couldn't find it [15:30:18] if (bo->uncacheable || lifetime < cache_param->shortlived) [15:30:18] storage_hint = TRANSIENT_STORAGE; [15:30:28] it's in bin/varnishd/cache/cache_fetch.c [15:30:47] even so... [15:31:17] so going back to what I knew about varnish3 behavior and "transient", there's two possible confusing meanings of transient I think [15:31:45] the actual "transient storage", which is by default an unlimited malloc and is "real" storage in some sense, where objects with <10s TTL get held temporarily [15:32:16] but also if a request is a "pass", you'll see it allocate storage, but it's really storage that's private to that one request and not visible to other requests, and I think that may have also been called transient [15:32:25] storage_hint = TRANSIENT_STORAGE seems like the former? [15:32:34] but the behavior we're seeing is more like it's using the latter [15:33:06] I think they called those anonymous objects or something [15:33:15] (when creating objects that are local to 1x pass-request) [15:33:25] but I think varnishlog showed "transient" in that case as well [15:33:54] still, the logic there in cache_fetch at least indicates that the authors are considering bo->uncacheable as if it were a pass, or something.... [15:34:28] from that snippet it seems that all uncacheable objects get a hint to transient [15:34:30] but if it was really getting stored to the real transient storage pool, you'd think a followup request right afterwards would use the hit-for-pass [15:36:03] lol, here's an interesting undocumented behavior. for objects that get the transient hint above: [15:36:06] if (oc->exp.ttl > cache_param->shortlived) [15:36:09] oc->exp.ttl = cache_param->shortlived; [15:36:18] wat [15:36:18] so not only is 10s the limit for entry, it's also the cap on TTLs of objects that enter into it [15:36:46] I guess it makes sense to "defend" the unlimited-malloc transient storage against long-life objects that aren't intended to be there [15:37:06] but why are long-life objects going to transient anyways? why is my 601s "uncacheable" hfp going there? [15:38:58] maybe the documentation's simple explanation of s/hit_for_pass/beresp.uncacheable/ just isn't true, and beresp.uncacheable isn't really hfp in the V3 sense anymore? [15:39:27] I did notice a while back, that some Varnish5-specific commits talk about resurrecting hit-for-pass [15:39:31] (I'm not sure in what sense) [15:39:55] it's been renamed to hit-for-miss [15:42:23] fascinating :P [15:43:05] I wonder why? (why they're creating a new hit-for-miss? is v4 halfway through some transition where traditional hfp just doesn't work like it used to and there is no replacement?) [15:43:55] related change in v5: https://github.com/varnishcache/varnish-cache/commit/d9b0919109b402d60d5978d8ab5f0fe370699785 [15:44:24] in any case, I can do my 2x requests back-to-back (definitely under 10s total), and I still get misses on both requests (in the sense that lookup returns "miss", and it calls vcl_miss and not vcl_pass) [15:44:47] so it's not just a case of the hfp being limited to 10s in transient, it's also not an object shared between requests [15:47:38] > A hit-for-pass object (we still call it like this in the docs, logging and statistics) will now cause a cache-miss for all subsequent requests, so if any backend response qualifies for caching, it will get cached and subsequent requests will be hits. [15:47:46] https://www.varnish-cache.org/docs/5.0/whats-new/changes-5.0.html#hit-for-pass-is-now-actually-hit-for-miss [15:47:52] that's for v5 though [15:48:31] ema: I tried adding a !bereq.uncacheable to our hit-for-pass creator code, but we still went through it on both requests. So it does consider the 2nd request (which is inexplicably a miss) to be !bereq.uncacheable [15:49:20] I wonder why they're calling it a miss instead of a pass? that screws up our stats, too, as we wanted them to be "pass" so we can statistically count them as the passes that they are (hit-for-pass) [15:49:28] but that's all in v5 only anyways [15:50:17] wait so the 2nd request is cacheable? [15:51:32] yes, because it's a miss. it doesn't find out it's uncacheable until it gets the backend response with the CC:no-cache headers [15:52:03] which is why it's coalescing and stalling, because it's a normal cacheable miss at the early stage, when it should be a hit-for-pass->pass->uncacheable thing [15:52:38] what's really puzzling back your last link with the "what's new", is the language he's using there [15:52:38] I'm with you. The question then is why doesn't it hit the hfp object and pass [15:53:14] he's talking about hit-for-pass almost as if v4 didn't exist. He's comparing v5 to v3, basically, whereas v4 doesn't even have a hit_for_pass return, it was renamed uncacheable and it's not mentioned... [15:54:35] 10netops, 06Operations: HTCP purges flood across CODFW - https://phabricator.wikimedia.org/T133387#2776585 (10akosiaris) FTR, this still holds true today. There isn't really any reason it should have been fixed, just noting it. [15:55:24] ema: another thing I've confirmed: this isn't special to that URL from elukey. I can try other "normal" hit-for-pass cases and they're screwed up the same way [15:55:58] (HFP creation happens in "malloc Transient", and second requests (even <10s later) does a full miss (which implies undesirable coalesce) and creates another one all over again. [15:56:40] my confirmation test was a fetch of enwiki Main_page with -H 'Cookie: Token=asdf', which makes varnish believe I'm a logged-in user. So even those are now suffering coalesce when they shouldn't. [15:57:48] we must have something wrong about how you create hit-for-pass objects in v4 [15:58:19] because surely they still exist, and surely they're vtc-tested in the main repo, and surely the docs are wrong that beresp.ttl=X+beresp.uncacheable == old_hit_for_pass [16:01:00] bin/varnishtest/tests/r01818.vtc <- seems to imply that hit_for_pass behavior "works" for beresp.ttl=X+beresp.uncacheable=true ... [16:02:01] well maybe, it's complicated [16:05:26] r01821.vtc is also interesting [16:05:41] txresp -bodylen 65535 <- but still ending up in transient [16:09:30] bblack: I've modified r00425.vtc to use our frontend VCL -> cp1008:~ema/hfp.vtc [16:10:35] the test passes and running varnishtest with -v it does output XXXX HIT-FOR-PASS as debugging info [16:18:03] maybe only same session/client? [16:18:20] I still don't really understand the scope of Transient storage I guess [16:18:44] I mean, it seems to be just a specially-named storage backend, which happens use the generic malloc stevedore, and is configured without limits [16:19:06] you'd expect objects going into it to be normal cache objects and hittable from other future requests, other than the short TTL cap. [16:19:22] yet we know "transient" is documented as the place anonymous objects (e.g. synthetic responses) store things, too [16:19:41] we know it's almost certainly where request-scoped temporaries for actual pass-traffic go too [16:20:30] but why does miss->uncacheable+ttl ("hit-for-pass") end up in transient (well, because of that conditional pasted earlier), but is it also request-scoped and temporary, or supposed to be a "real" object. [16:21:16] I was trying to see if there was some specific conditional somewhere like "if storage == transient, do not insert into hash for future lookups (and/or let this object die at the end of this request and be private)" [16:21:19] but haven't found that yet [16:26:35] bblack: take a look at hfp-2.vtc in cp1008:~ema [16:26:58] with two clients, the test passes only with a delay at the beginning of client c2 [16:27:39] (same behavior with either our text VCL or stock VCL) [16:28:35] when it fails they're all miss instead of hfp? [16:28:53] well that could be a race I guess [16:29:17] 10netops, 06Operations: Migrate links from cr1-eqiad/cr2-eqiad fpc 5 to fpc 3 - https://phabricator.wikimedia.org/T149196#2776789 (10mark) a:03mark All ports have been moved off of FPC5 on both routers, and all configuration for FPC5 ports has been removed. The Equinix Ashburn IXP port is still awaiting the... [16:29:31] you'd have to use their sync/sema stuff to ensure c1 creates the hfp before c2 tries to use it [16:30:05] (but then again, they should coalesce anyways until the hfp gets created...) [16:33:13] anyways, even if we ignore the whole thing about whether they hit the hfp at all or not [16:33:39] it still seems clear in the current source code that hfp TTLs would be capped to "shortlived" (10s), which is also bad [16:33:52] but if I could even get them to hit in under 10s that would be something. they don't in live testing [16:35:22] mmh, the test fails with c1 timing out on rx (and 2 hfp) [16:37:49] *** v1 0.0 debug| /bin/sh: 1: exec: varnishd: not found\n [16:37:54] ^ how are you manually executing varnishtest? [16:38:06] you need to be root [16:38:34] lol [16:57:15] ema: it has to do with Vary [16:57:44] oh [16:58:32] I'm re-testing some things... [16:58:57] ok so it's specific to both Vary and our text VCL [16:59:15] look at ~bblack/bbhfp.vtc [16:59:27] I copied in the headers from an existing failing response [16:59:32] (only in the first server response) [16:59:49] as the file is now, it passes, because it's using stripped-down VCL (just default and a 601s hfp if CC~private) [17:00:03] but if you comment out varnish-v1 and use the one that includes our text VCL, it fails [17:00:12] and if you remove the Vary: header from s1 output, it passes again [17:01:15] with just one client, nice [17:01:23] so... something about how we're handling Vary in our VCL is mucking things up (or things related to Vary... this may come down to whether the applicable headers are being set to "" instead of undef and such things...) [17:01:44] bblack: ops meeting [17:01:54] bleh [17:01:59] :) [17:20:36] so we had the workaround earlier for Cookie Vary slots working right [17:20:55] maybe that hack "fixed" the Cookie case, but doesn't solve the general issue for other Varys [17:21:12] (which are all borked because of something to do with req vs bereq and/or "" vs under) [17:21:18] s/under/sundef/ [17:21:32] s/sundef/undef/ ? :) [17:23:32] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2777040 (10Gilles) >>! In T66214#2772934, @Tgr wrote: > Projects which can get away with being Wikimedia-only (such as the mobile apps) could just use the... [17:23:32] yeah maybe [17:24:02] the other Varies in the repro cases being Accept-Encoding and and Authorization [17:24:52] we don't ever set those, or unset them [17:25:10] in one case we conditionally "pass" on a regex match of the client-sent Authorzation, but that's it [17:26:16] the OAuth one you mean? [17:27:28] yeah [17:43:54] it may still be some oddity of our Cookie code [17:44:22] (perhaps even in the case of no cookies, we're causing some differentiation in what (be)?req.http.Cookie looks like for Vary purposes (undef, empty, real-value) [17:44:28] ) [17:45:12] but it clearly doesn't affect normal caching of normal cookie-less requests... so it's also something specific to hfp/uncacheable... [17:46:09] so with Vary: Accept-Encoding,Authorization the test passes [17:46:18] good call :) [17:47:27] yeah confirmed the same, and also just "Vary: Cookie" fails [17:47:40] so it's something specific to the Cookie-handling code + Vary + hfp .... [17:49:06] let me quickly try reverting https://gerrit.wikimedia.org/r/#/c/319561/2/modules/varnish/templates/text-common.inc.vcl.erb on pinkunicorn [17:49:15] we'll have to remember: after we get a test like this to pass by fixing whatever up, re-test that 601s hfp lasts > 10s (shortlived) [17:49:32] ema: pinkunicorn is puppet-disabled and nginx is dead, I was halfway through some other junk there [17:49:45] but manually maybe, yeah [17:49:54] bblack: ok yeah I'll do that manually [17:52:08] bblack: fails anyways with Cookie in Vary [17:53:00] try commenting out all the cookie-munging BS in text-common? [17:53:25] (also, might want to validate what builtin.vcl does with Cookie, and whether we're falling through into it in practice anywhere) [17:53:32] heh right [17:53:55] (and/or Vary, above) [17:55:52] ha [17:56:08] commented lots of Cookie-related stuff in text-common, test passes [17:56:29] it's a start! but too critical to just gut it [17:57:14] so another thing I tried once and then decided wasn't neccesary, during the last round of v4 cookie debugging [17:58:12] if (req.restarts == 0) { [17:59:00] replace that with: [17:59:11] unset req.http.X-Orig-Cookie; [17:59:20] if (req.restarts == 0 && req.http.Cookie) { [17:59:35] err wait, that's not right either [18:00:02] if (req.restarts == 0) { unset req.http.X-Orig-Cookie; if (req.http.Cookie) { ... the existing stuff ... }; } [18:00:39] I think this may avoid a case where undefined req.http.Cookie later becomes a defined-to-empty-string X-Orig-Cookie [18:01:12] so that's definitely the problematic code snippet, removing it makes the test pass (with all other cookie-related code in place) [18:01:50] my theory hinges on: [18:02:11] set req.http.X-Orig-Cookie = req.http.Cookie; // when Cookie isn't set at all, this creates X-Orig-Cookie with value "" [18:02:26] which would cause some bad vary-matching between responses with it set to "" and requests with it undefined [18:02:30] this works: [18:02:32] if (req.restarts == 0) { [18:02:32] //set req.http.X-Orig-Cookie = req.http.Cookie; [18:02:32] unset req.http.X-Orig-Cookie; [18:02:32] if (req.http.Cookie ~ "([sS]ession|Token)=") { [18:02:34] set req.http.Cookie = "Token=1"; [18:02:37] } else { [18:02:39] unset req.http.Cookie; [18:02:42] } [18:02:44] } [18:02:52] well yeah [18:02:58] but it still doesn't do the original job :) [18:03:15] indeed [18:03:44] it needs to be like I wrote above, with the origina "set req.http.X-Orig-Cookie = req.http.Cookie" as part of "... the existing stuff..." [18:04:16] (so the inner if/else gets a new layer of indent) [18:04:30] (and the set above it) [18:05:05] or I can edit it the way I'm thinking, I just don't want to step on the same file you're editing :) [18:05:48] bblack: I've edited it the way I think you're thinking and it doesn't pass the test, but please check whether I understood what you mean or not [18:06:17] the test I'm trying is ~ema/bbhfp.vtc [18:06:29] no, but you still have it open [18:06:55] oh no, I just wasnt root anymore [18:07:17] try that [18:07:34] yup [18:07:36] it works [18:08:04] do you see what I mean, about how if during evalute, req.http.Cookie is undefined, that would carry through and cause an effect on X-Orig-Cookie at restoration time? [18:08:54] I think "set req.http.Foo = req.http.Something-Undefined" changed semantics from v3->v4. before it was effectively like "unset req.http.Foo", now it acts like "set req.http.Foo = ''" [18:09:21] nice, I understand [18:10:07] probably should do a quick audit of VCL for other cases that might be important [18:10:45] looking for every set that might have something undefined on the right side of the assignment I guess [18:11:52] or something like that [18:12:14] really, I'd like to better understand what VCL does with string undef-vs-"" for boolean and header-set in general [18:12:27] I feel like I probably can't even answer all possible questions about that accurately [18:12:41] (and neither does the docs!) [18:13:10] maybe we can figure out what the v3 and v4 rules are explicitly and map out what changed [18:14:15] (or it could be that nothing changed with all of that, but that the Vary-handling for V4 now treats empty-vs-undef distinctly, and in V3 they were the same) [18:14:44] bblack: see ~ema/x.vtc [18:15:01] (it passes) [18:15:51] yeah, so that seems to confirm that "set req.http.X = req.http.Undef" sets X to an empty-string, which evaluates to true in a boolean context, whereas undef evaluates false [18:15:57] no other explanation would make sense [18:16:03] let's try that test on v3 [18:16:05] question #2 is whether that works out identically in V4 [18:16:07] err V3 :) [18:16:45] passes on cp3030 [18:16:47] it seems like a deep language design issue that "X = Y" causes if(X) to be choose differently than if(Y) [18:16:56] but that's neither here nor there I guess, it is what it is [18:17:22] so, probably it was this way in V3 as well, and the Vary handling changed [18:17:49] can test with a Vary: Foo, and client requests with the header not-set, set-empty, and set-to-string, and see which share vary cache slots [18:25:37] behavior there seems the same on v3 and v4 [18:26:25] so what are we missing here [18:32:48] what if it gets copied to bereq in a funny way? [18:33:14] we've verified that both in v3 and v4 setting something to a undefined header evaluates to true in a boolean context [18:33:53] but maybe between copying stuff from req to bereq is somehow related to the problem we're seeing? [18:35:34] ^that wasn't English, s/between// [18:39:02] well now I'm back to just trying to repro the behavior with clean (not ours) VCL in v4 and I can't yet [18:39:31] I've made a mock of what our vcl_recv + vcl_backend_fetch + vcl_backend_response currently does to Cookie, but s/Cookie/Foo/ to avoid interaction with builtins [18:39:54] and responses with Vary:Foo, and I can't yet repro the failure to use the vary slots for caching properly [18:40:37] see ~bblack/vary.vtc for where I'm at so far (that passes) [18:41:07] if this reproduced, it should miss x2 and fail the test [18:42:00] oh, wait, but we already said this was about hfp specifically... [18:42:11] I need hfp code in there too heh [18:42:16] right! [18:45:44] bblack: s/beresp.http.status/beresp.status/ :) [18:51:00] ok I'll stop spying your work in progress, I've gotta go now! o/ [18:51:32] ok, cya :) [19:24:58] ok I've got a "clean" repro (with constructed-from-scratch minimal VCL that emulates our cookie handling, but with s/Cookie/Foo/) [19:25:10] and I even understand why it's happening on some level, but more digging to do [19:27:47] so to recap what I think I know now: [19:28:38] 1. The original Cookie problem we identified, I think was correctly-identified: V4 now varies on beresp.http.Foo when storing an object, while V3 looked at req.http.Foo (this is the critical confusion that mucks up V3->v4 upgrade) [19:29:15] 2. The behavior of string undef/""/etc hasn't changed from V3 to V4 (whew!) [19:30:24] 3. In the case of a cookie-less request, we've always ended up effectively setting Cookie="" in bereq (I think, will confirm shortly), it just didn't affect Vary due to (1) above. [19:31:28] 4. Our fix the other day (setting bereq.http.Cookie in vcl_backend_response back to the unset|Token=1 values) fixes this for *cacheable* responses (those with real CC: TTLs, etc), which is why it worked in testing for /wiki/Foo and such [19:31:59] 5. But when you stitch all the VCL together from the various files, our actual relevant sequence in vcl_backend_response is: [19:33:14] check for TTL<=0s case and emite 601s hfp if applicable (return deliver in that case); hack bereq cookies as in (4); code to do the "special" token=1 607s hfp [19:34:09] so, basically, the bereq cookies hack isn't being applied *before* the generic hfp code, and thus the generic hfp code gets confused by the ""-vs-undef for Vary-slotting of cookie-free requests (which has always been the case in bereq.http.Cookie, but didn't matter before because V3 used req.http.Cookie to do Vary-processing) [19:36:24] so, the text-common hack that fixed things earlier (the additional if(req.http.Cookie) conditional) fixes this indirectly, by eliminating the bereq-level ""-vs-undef confusion [19:37:36] but it would still (I think...?) not quite be right, because the vary-slot will be unecessarily-multiplied for logged-in users with real session cookies (doesn't break anything, just isn't efficient) [19:38:23] the "right" fix is to move the vcl_backend_response Cookie-value-hacking back up above hfp [19:38:52] secondarily, we might also want to do the text-common if(req.http.Cookie) thing, if only to reduce fuure confusion and avoid sending a pointless empty cookie header to the applayer [20:23:11] core issue fixed in https://gerrit.wikimedia.org/r/#/c/320259/ (but see also the commit before it that inserted the new per-cluster hooks, and the commit after that does the "secondarily" above) [20:24:42] while I think this fixes up the Big Problems... I'm still left wondering about the potential shortlived cap on transient response for uncacheable... [20:28:15] although in manual testing on live cp4018, it seems like the HFP does live longer than 10s [20:29:31] ok that mystery is solved now too, I just didn't read the code hard enough [20:30:00] uncacheables (including hfp uncacheables) do get forced to transient storage, but the shortlived ttl cap doesn't apply to any normal case there [20:30:42] it's some corner/broken case where the storage_hint ends up undefined where it uses transient as a final fallback and caps the ttl to the shortlived ttl [20:33:06] so.... [20:33:15] 3x good documentation issues to file upstream with varnish: [20:33:52] 1. That (regardless of v3/v4) "set req.http.Foo = req.http.Undefined" creates Foo with an empty value, rather than making it undefined [20:34:32] 2. That (also regardless, but I think only from 3.0.4 onwards) HFP objects are allocated from transient storage in general, even if their TTLs are > shortlived. (I donno, maybe this is docced somewhere?) [20:35:05] 3. That in the V3 -> V4 transition, one should be way of the fact that Vary-slots in cache storage are set from bereq.http.*, whereas in V3 they were set from req.http.* [20:35:11] s/way/wary/ [20:35:35] (most people would never hit (3) I imagine, but the few with VCL that has hacks like our session cookie workaround would...) [23:10:47] bblack, ema: quick question if you are around [23:11:05] somewhat :) [23:11:06] were there any changes in retry behavior in v4? [23:11:25] we are investigating a burst of retries hitting RB [23:11:39] responses are persistent 504 bubbled up from Parsoid [23:11:49] I don't really know yet. Structurally, retry changed a lot. We'd like to think that didn't affect the retry-503-once stuff, but it could have. It's hard to see unless you're really looking for it an emulating failure [23:12:54] these are 504s, which in the old config would fall in the 5xx bucket [23:13:29] so, there is no 5xx bucket anymore, not in our explicit VCL anyways [23:13:37] there's only "retry 503 once" [23:13:41] Here's the example: https://logstash.wikimedia.org/goto/9eab4d4066c879eafc938d4f9011b577 [23:13:48] (that was standardized across clusters some time ago) [23:14:07] I see, so that means that 504 should not be retried at all? [23:14:32] but given how much things changed in the varnish internals with respect to retrying failured requests in general, it wouldn't be shocking if things no longer behave as we expect, and we have to suppress some retries elsewhere [23:15:06] the reason why we are suspecting that this might be happening at the varnish level is that a) we are seeing two different UAs being retried for the same title, and b) the request IDs are fresh every time [23:15:18] we're certainly not intentionally retrying a 504 in VCL, varnish may be doing so by some implicit default logic we haven't got VCL or config to workaround yet [23:16:24] what would be the best way to figure out if varnish is retrying these requests? [23:16:50] I've made a curl request to this problematic URI. Once it times out we'd be able to say what's retrying these requests - clients or Varnish [23:17:13] actually, Varnish or something else internally vs clients [23:18:07] good idea, lets see if this sets off another storm of requests [23:18:09] http://stackoverflow.com/questions/35470333/change-backend-during-retry-in-varnish-4 [23:18:20] ^ this pretty much explains what changed in varnish guts [23:18:37] I don't think we've adapted our VCL to it very well, if at all, or looked at default #bereq.retries, etc [23:19:59] well, looking at the VCL now, we did adapt it somewhat. but it still doesn't look quite right [23:32:35] lets see what Pchelolo's testing yields [23:33:31] I did some deeper digging in our VCL and all the varnish docs as well [23:34:17] I think I was being pessimistic above. Our adaptation for V4 for this stuff was fairly spot-on, and there is no implicit default behavior (at least, documented behavior) that should cause restarts/retries outside of what we do explicitly in VCL. [23:34:44] What I did find is we're actually failing to do the retry-503-once behavior under certain conditions where we used to do it, so I'm adding more retry heh. [23:35:25] V3 it was all one thing. In V4, you have to handle the retry of an explicit 503 separately from an implicit 503, and we were only even attempting to retry the implicit case. [23:35:52] (where "implicit" == internal 503 due to connectfail, timeout, protocol error, whatever, and "explicit" == the backend actually sent us a legit response with status 503) [23:38:06] ema: merging https://gerrit.wikimedia.org/r/#/c/320310/ because it seems pretty legit, and likely will reduce the minor 503 spikes we see on e.g. upload be restarts, etc [23:43:59] I have to run [23:44:37] Petr is still testing requests, now with extra parameters to avoid coalescing [23:47:29] ok [23:50:15] bblack: So, it seems like it's Varnish retrying. I've made a CURL request to 'https://fr.wikipedia.org/api/rest_v1/page/html/Liste_des_plan%C3%A8tes_mineures_non_num%C3%A9rot%C3%A9es_d%C3%A9couvertes_en_2006/123378288?a=b'. The query parameter is meaningless, it's added to avoid varnish merging it with the outstanding request by google API crawler. After 4 minutes RESTBase timeout I can see the log entry in [23:50:15] RESTBase about the timeout: https://logstash.wikimedia.org/goto/f9bd5924e4337664fb1b12e6ca5e62f7 but curl request is still pending in my console [23:50:46] so it seems something is retrying the request in our infrastructure, and the only thing between client and RB is Varnish [23:53:32] bblack: now I've got even more interesting results: there're now 3 log entries from different RB nodes with timeouts from this request. I've made a request only one time, curl still pending