[02:52:12] 10Traffic, 10DNS, 06Operations, 10Phabricator, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2366860 (10MC8) It looks like redirects existed for Bugzilla, so I guess you could call this a regression. [03:22:27] 10Traffic, 10DNS, 06Operations, 10Phabricator, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2366879 (10MZMcBride) >>! In T137252#2362373, @Krenair wrote: > Isn't commons.wikipedia.org just a historical thing? Yeah. >>! I... [12:12:27] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2367569 (10Gehel) It seems that at the moment only .js and .css files which ar in the js / css directories are processed by filerev. Is there any reason to not... [12:44:32] bblack: as an alternative solution to the "where to clear errors in varnishapi", we could perhaps just clean them in the consumer itself [12:44:43] and leave varnishapi as it is [12:45:04] (the whole thing is pretty convoluted I must say) [12:48:26] otherwise yes, we could clear self.error after VSLQ_dispatch_f https://github.com/xcir/python-varnishapi/blob/master/varnishapi.py#L637 [12:49:59] but that means the callback itself needs to consume it, while at the moment we have been focusing on reading the error after Dispatch https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/varnishlog4.py#L80 [12:52:48] bblack: also, re: sending pass straight to the applayer, it sounds like the approach of doing that from the first backend introduces less complexity [12:53:36] arguably without much of a performance impact (frontend and backend are in the same same DC) [13:02:20] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2367721 (10Jonas) We can only use filerev for files that are referenced from the html file, because the application itself is not aware about filerev and then will not find... [13:52:40] ema: yeah I tend to agree on the "don't pass to app from frontend" thing, I think. but in any case, we don't get to that decision until we're past outbound TLS, it's a ways off :) [13:53:06] on varnishapi, I've lost all mental context on that from last time I looked, I don't even know anymore :) [13:54:19] bblack: I've hacked varnishlog.py on cp1058 to log errors like this: [13:54:24] if self.vap.error: [13:54:24] sys.stderr.write("Error in execute(): %s\n" % self.vap.error) [13:54:27] self.vap.error = '' [13:55:19] and after lots of requests we end up with: [13:55:20] Error in execute(): Log overrun [13:55:20] Error in execute(): Log reacquired [13:55:40] (last time we got flooded with Log reacquired because we weren't resetting self.vap.error) [13:56:08] one thing I don't think I've figured out yet, but maybe you know... [13:57:32] does overrun->reacquire mean "txns happened so fast that we actually lost txns due to limited buffer size and how long we went between reading them". Or, is it just something that happens every time the buffer head cycles from top->bottom in the address space (it wraps, I think). [13:57:53] in which case there must be some other error for actually losing txns [13:59:03] I think the former [13:59:22] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 3 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2367841 (10Dzahn) [13:59:33] completely-unrelated topic, if you need something else to think about: ssl session resumption and forward-secrecy. [14:00:31] so the background is: if you keep resuming sessions with a browser indefinitely, and it stays open indefinitely, this starts putting a big ugly window on forward secrecy gaurantees. as long as the session is alive, breaking the server means you can break older data in this same resumed session from weeks ago... [14:01:07] and it's a little bit worse than that: even if resumed sessions only last a day, if the session-data itself persists longer on the server, that can be used to go back and break old logged data [14:02:16] so browser negotiaties session:12345 with us, the session only lasts 1 day, and never gets re-used. but someone logged the traffic. if in the distant future, someone breaks into our server and can steal the old session info, they can decrypted their old saved log. [14:02:40] (thus it breaks forward-secrecy gaurantees if we're keeping them around forever) [14:03:00] the obvious first advice is "don't ever store that stuff on disk, at least only in memory" [14:03:07] which we already do and that's not at issue [14:03:40] so on to the tricky part: [14:03:46] https://www.imperialviolet.org/2013/06/27/botchingpfs.html [14:04:28] we don't use tickets, so just paying attention to the sessionid part of the problem, that article (from a very trusted source) says: [14:04:31] In the case of ses­sion IDs, the ses­sion in­for­ma­tion will ei­ther be kept in mem­ory or kept on disk, de­pend­ing on how the server is con­fig­ured. (See the SSLSes­sion­Cache di­rec­tive in Apache.) The for­ward se­crecy of a con­nec­tion is bounded by how long the ses­sion in­for­ma­tion is stored at the server. A small, in-mem­ory cache likely has a high turnover ra [14:04:37] te, but a disk-cache could re­tain that in­for­ma­tion for a long time. Ide­ally one would have a medium sized, in-mem­ory cache that turned over once a day or so. [14:04:55] ugh, horrible fallout from justification :) [14:05:19] anyways... so he's talking about "keep the session cache small relative to your connection rate, so that old sessions get overwritten in reasonable time and thus wiped out" [14:05:52] then there's this relatively-high-referenced source, which links to the above and goes into more detail: [14:05:55] https://timtaubert.de/blog/2014/11/the-sad-state-of-server-side-tls-session-resumption-implementations/ [14:06:13] " [14:06:16] You might wonder whether the SSLSessionCacheTimeout directive can be of any help here - unfortunately no. The timeout is only checked when a session ID is given at the start of a TLS connection. It does not cause entries to be purged from the session cache." [14:06:32] and he says the same thing about nginx's identical ssl_session_timeout [14:07:22] basically, he's saying (and he's not necessarily contradicting the earlier link, either) that even if you set ssl_session_timeout to a reasonable forward-secret window of time, if your cache is big enough, old session data could live a very long time there and break FS. [14:07:36] because nothing's purging expired session data regularly from the cache [14:08:17] we (I) looked at all the above quite some time ago when we were first setting up our TLS config, which is why we kept the session cache relatively-limited, based on all of this reasoning... [14:09:33] but when I actually look at current openssl and nginx code, it seems to me that expired sessions *do* get securely purged on a regular basis, via two separate mechanisms: [14:10:21] 1) On every new session, the first thing the code does (unconditionally!) is look at a queue of session expiries (oldest session at the front of the queue) and remove up to 2 sessions from it if they're expired [14:10:51] thus if your session rate doesn't suddenly drop off extremely, you'd expect that alone would fix this most of the time. [14:11:34] 2) OpenSSL itself also automatically searches the whole session cache and purges expired sessions once every 256 accepted connections. there's an option to disable that behavior, but nginx doesn't set the option. [14:11:58] (1) was in nginx's code [14:12:32] all the involved code is convoluted to understand, so I could be wrong. but my read of the code is that those blog posts' concerns about timeout expiry not being enough on its own are false. [14:14:02] the openssl one is easier to look at. in openssl's source/docs, SSL_SESS_CACHE_NO_AUTO_CLEAR is what prevents the "wipe all existing expired sessions every 256th connection". nginx only sets this in one special case, which is a microsoft bug workaround when all session caching is disabled. [14:15:27] and the flush function openssl uses for that, calls remove_session_cb(), which is callback nginx sets to clear the nginx side of things [14:17:26] ngx_ssl_expire_sessions() is the other mechanism, which gets invokved (potentially multiple times if the cache is full of un-expired sessions, but at least once unconditionally to remove 1-2x actually-expired sessions) every time a new session is created in ngx_ssl_new_session() [14:18:31] and the ssl side of these flush/removal things does securely wipe the memory (zeroing it on deallocation), as you'd expect from a crypto lib handling temporary secrets [14:18:48] none of this seems new, either. the openssl one on every 256th connection has been there for years. [14:22:41] my inclination is to believe the code, but believing what I see in the code means contradicting the only decent public blog/information sources on this topic. [14:23:05] maybe they understand the code deeper than I do, and there's some critical flaw in all these mechanisms that means that the session data commonly doesn't really get purged from memory. [14:28:56] we can set the caches large and long and workaround the problem (if there is a problem) with periodic wipes. we splay them out over a week or so, and do the nginx zero-downtime-upgrade-restart thing, which also wipes the old cache and makes a new one. so worst-case no cache storage lasts longer than a week. [14:29:17] if there's no problem to begin with though, that's pointless churn that interrupts a bunch of valid and not-very-stale sessions at wipe-time. [14:32:16] maybe systemtap is a good answer here. I could confirm with systemtap on a live server whether the expire/flush functions are getting called at the expected rates compared to new sessions, etc. [14:33:27] that would be cool, yes [14:33:45] (we could also ask the openssl/nginx folks) [14:33:53] yeah that too [14:34:15] I'd rather get a concrete answer from the software, though [14:34:38] yep, I meant as a first step :) [14:35:06] even if nginx authors say their code (which implies their use of the openssl APIs) doesn't leak expired secrets over time in a big cache, we're left with "why do these two seemingly reasonable sources say otherwise and nobody's correct them?" [14:35:28] right [14:36:12] so, my understanding of this is that a cache eviction mechanism like LRU with a reasonably-sized cache would solve the problem [14:36:45] but my brain is still half in varnishlog :) [14:36:49] so, definitely nothing's tracking the RU part (recentness of use) [14:37:11] when sessions are stored into cache, an absolute expiry-time is set at now+N. [14:37:31] on attempted reuse, if it's not yet past expiry-time, it can be reused, but there's no bump to any kind of last-used timestamp. [14:38:21] we also know for sure that if the cache is small enough relative to your rate of new sessions, it has to evict the oldest ones (oldest expiry time, it keeps a queue based on that) to make room. so if you make it small enough relative to connection rate, you have no problem. [14:38:55] the question-mark is all about "if the cache is huge relative to connection-rate, do expired sessions linger in memory virtually-indefinitely because nothing's actively purging them?" [14:39:27] as for why we can't just tune the size and move on: [14:39:44] if we make the size too small, we lose client perf to lost opportunities for legit session resumption [14:39:50] if we make it to big, we face the above question. [14:40:27] but the session rate at e.g. codfw vs esams, or text vs maps/misc, can be 1 or more orders of magnitude different, and are dynamic with DC-failover, etc [14:40:57] even if we picked a rate per-cluster that's hand-tuned, codfw-vs-esams (and vs DC-failover scenarions) make a mess of it [14:41:20] we're much better off if we can pick a size that's big enough for our worst-case, and set them all that big and forget it. [14:41:43] but that means if the expired-leak-on-oversized-cache problem exists, we'll hit it at text-codfw [14:43:20] the next-best workaround if the problem exists is just to wipe each node's cache once a week to be sure, and splay them, as mentioned before. but that's still a notable percentage of session cache wiping in little bumps constantly, and when we wipe a whole cache we pointlessly destroy a ton of not-stale sessions too. [14:47:51] there's another option: tickets are better than sessionids, and we can implement some complex stuff to rotate ticket secrets and not have a true session cache, in theory. [14:48:29] the problem we face there is that significant percentage of clients don't support tickets. so we'd still need a session cache for them. it could be smaller, but we still face the same basic problem on top of sorting out tickets themselves. [14:49:12] (the clients that don't support tickets include all IE<11, IE11 on Win7, and all iphone/mac safari stuff including the latest) [14:53:11] (and as far as perf data goes: with old small/conservative sid caches, we were averaging ~22% resume in grafana. the new stuff that's testing much bigger looks like it will average at least 32%, may grow a bit more over the next day or two). [15:07:33] 10Traffic, 06Operations, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2368049 (10Aklapper) All patches seem to be merged. What are the next steps in this task? [15:14:34] 10Traffic, 06Operations, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2368064 (10BBlack) There's still long-term investigation ongoing on the effects of jemalloc tuning and the effect on frontend hitrates (and the latter in conjuction with co... [15:20:44] uh I think I found the varnishlog issue [15:21:03] we're not passing any grouping, which means a lot of stuff outside of our filtering gets processed [15:21:24] hundreds of thousand of things like: [15:21:31] {'arg': (<__main__.VarnishLog instance at 0x7fa961de1dd0>, {'level': 0L, 'type': 'c', 'reason': 0, 'vxid_parent': 0, 'length': 20L, 'tag': 75L, 'vxid': 513957568, 'data': 'req 512346966 rxreq\x00', 'isbin': 0L}, None), 'kwargs': {}} [15:23:39] passing -g request, cpu usage of varnishxcache is between 7% and 20% on cp1061 [15:23:46] ok [15:24:00] so "grouping" doesn't just mean "how are all these things grouped", it also filters down what you receive at all? [15:24:11] and I guess the default vxid is "everything" [15:24:19] looks like [15:24:34] makes sense [15:25:24] it shouldn't filter things afaik but maybe inspection is way easier with grouping? [15:25:40] well we actually get different matches [15:26:19] mmm [15:27:06] (the mmm is not I don't trust you but only me thinking a use case :) [15:28:14] inside the grouping you can decide what to inspect and what no, for example quickly deciding if you need backend req transactions or not, etc.. [15:29:49] so this is basically the easiest way to test varnishapi.py: [15:29:51] def callback(*arg, **kwargs): [15:29:51] print locals() [15:29:51] if __name__ == "__main__": [15:29:51] x = VarnishLog(["-n", "frontend", "-q", "not ReqMethod eq PURGE", [15:29:51] "-g", "request", [15:29:54] "-I", 'RespHeader:^X-Cache:']) [15:29:56] while True: [15:29:59] ret = x.Dispatch(callback) [15:30:01] if ret == 0: [15:30:04] time.sleep(.01) [15:30:15] comment out the -g request part, 100% CPU usage :) [15:32:01] ahahah [15:32:10] on a machine receiving the purges like cp1061, that is [15:32:28] so, does -g request make it survive a blast of high rate reqs without overrun->reacquire? [15:32:52] (given current existing sleep-mechanism at 0.01) [15:34:26] right, it hasn't exploded so far [15:36:00] I'll add logging to varnishlog.py and request grouping to varnishxcache on cp1061, this way we can easily compare [15:36:16] sounds good [15:36:29] also note, varnishkafka already does "-g request" in its code, so this isn't a problem there [15:36:34] conf.vslq = VSLQ_New(conf.vsl, &vsl_cursor, VSL_g_request, conf.q_flag_query); [15:36:39] ha! [15:37:09] arguably we should skip future pain and do this at the varnishlog.py level since that's our code [15:37:20] default to -g request and have an argument to explicitly change it if necc [15:38:06] otherwise we're just hoping all future varnishlog consumers remember to do this, or cargo-cult it from existing ones [15:38:34] or perhaps this could even be a bug in python-varnishapi, maybe if no grouping is specified it passes a banana instead of VSL_g_request to VSLQ_New [15:39:12] vxid is the default in other standard tools though. it would probably be "wrong" for varnishapi default to differ, way down at that level. [15:40:24] https://forum.nginx.org/read.php?2,254881,254881#msg-254881 [15:40:41] ^ someone else asked about the ssl_session_timeout and that blog post from tim taubert already, back in 2014. no reply heh. [15:40:52] repost! :) [15:41:23] you'd think if my read of the code is correct and the blog post is wrong, one of the nginx devs would've responded with something to that effect... [15:41:39] vk also filters a lot away: https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L1676 [15:43:57] ema: yeah you're right, before even bothing with testing I should ask the mailing list. if they agree with the blog post there's no point testing/wondering. [15:46:26] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, and 3 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2368164 (10hashar) 05Open>03stalled From a quick chat with @mark we dont want hosts in private... [15:47:25] yeah [15:52:49] 10Traffic, 06Operations, 13Patch-For-Review: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2368210 (10ema) It looks like the culprit might be lack of grouping at the varnishlog level. The following experiment is currently ongoing on cp1061:... [15:56:30] there are a bunch of overruns already on cp1061 [15:56:37] sudo journalctl --since=today|grep 'execute(' [15:56:50] but not for xcache [16:00:31] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368243 (10Gehel) We shoudl be able to also process at least: * vendor/jquery.uls/css/jquery.uls.css * logo.svg They are unlikely to change frequently but still it would b... [16:00:56] bblack, elukey: this would be my proposal for dealing with errors in varnishlog https://gerrit.wikimedia.org/r/#/c/293123/ [16:01:23] (it's also how I hacked varnishlog on cp1061) [16:04:25] ema: my concern (which could be unfounded?) is that in vap.error cases the data may be illegitimate anyways and we shouldn't be issuing a callback at all on that cycle, right? [16:04:50] or is it always the case that the error is already "handled" and the callback contains good data post-error? [16:05:43] from my understanding of python-varnishapi, that should be taken care of [16:06:00] for example, the callback is really called after re-connecting to vsm [16:06:18] in case of overruns (thus the overrun->acquire) [16:14:36] let's see the various cases: [16:15:00] https://github.com/xcir/python-varnishapi/blob/master/varnishapi.py#L640 [16:16:36] that's the overrun part which is taken care of in __cbMain [16:17:03] when error is "Log reacquired", that's not even an error, heh [16:17:21] https://github.com/xcir/python-varnishapi/blob/master/varnishapi.py#L635 [16:17:32] it acutally means we're good to go because we got the log again :) [16:18:30] all other errors seem to be no-go situations (eg: Can't open log) [16:20:23] typo of the day: "acutally" [16:21:51] to recap in a slightly more structured way: the only two errors we're concerned about are Log abandoned and Log overrun [16:22:49] if they're set, it means that our call to Dispatch (which calls __cbMain) failed, so our callback presumably didn't act on stupid data [16:23:24] given that it was executed by self.lib.VSLQ_Dispatch, which is the guy returning the error in the first place [16:33:17] back to the grouping thing [16:33:23] https://www.varnish-cache.org/docs/4.0/reference/vsl-query.html [16:33:30] > A query is run on a group of transactions [16:33:51] does that mean that without grouping it doesn't query? [16:35:00] Raw: Every log record will make up a transaction of it's own. All data, including non-transactional data will be reported. [16:36:20] the error stuff seems sane, yeah [16:36:56] might not want to merge it until -g is fixed in clients though, or we're going to spam some log somewhere [16:37:16] yep, will put a CR for -g soon [16:37:41] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368340 (10Jonas) >>! In T137238#2368243, @Gehel wrote: > We shoudl be able to also process at least: > * vendor/jquery.uls/css/jquery.uls.css This is fixed. > * main pag... [16:52:20] pinged nginx-devel on the sessionid thing, we'll see what happens: http://mailman.nginx.org/pipermail/nginx-devel/2016-June/008344.html [16:56:16] nice [16:58:31] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368472 (10Smalyshev) > * main page (index.html) -> no cache (or very short) I would cache it for the same as below. It doesn't change that much - pretty much once a week n... [17:13:34] -g request -> https://gerrit.wikimedia.org/r/#/c/293530/ [17:14:29] bblack: good morning! I have a question about x-cache again if you would be so kind to help out again. [17:15:31] bblack: I'm trying to make sure I fully understand this stuff :) I saw one request wich x-cache of "cp3032 miss, cp3033 miss" and another with x-cache of "cp3043 pass" [17:17:53] bblack: if we focus on that first request… I'd probably find the same request but a couple milliseconds earlier that had x-cache of "cp3033 miss", right?, because then it got forwarded to cp3032 and got recorded again, so then if I were to look I'd probably see it again a couple milliseconds later with x-cache "cp1XXXX [hit|miss], cp3032 miss, cp3033 miss", right? [17:18:19] bblack: I just want to make sure I'm thinking about this correctly. don't want to make any wrong assumptions. [17:21:03] bearloga: I don't fully understand what you mean when you mention timing stuff such as milliseconds. I wanted to point out that we have docs about x-cache though: https://wikitech.wikimedia.org/wiki/Varnish#X-Cache [17:24:07] bearloga: please let us know if the explanation is unclear of course :) [17:26:36] ema: neat! thanks for the link! [17:28:11] ema: I'm still trying to make sense of a request with "cp3032 miss, cp3033 miss" because if both esams caching servers have missed, shouldn't I expect to see an eqiad server on the left in that x-cache? [17:29:40] bearloga: oh, so "cp3032 miss, cp3033 miss" is the full content of x-cache that you got? [17:29:49] ema: yup! [17:29:53] funny [17:29:57] ema: yup! [17:31:53] bearloga: can you tell me some details about the request? URL, request/response headers, ... [17:34:34] it could mean a bug in our x-cache implementation, so yeah more details would be good :) [17:35:14] url: en.wikipedia.org/api/rest_v1/page/summary/Dane_Cook; http status was 503; made from android app [17:36:31] oh this could be yesterday's outage right? [17:38:00] ema: timestamp is 2016-06-08 21:50:22 if that helps [17:41:25] perhaps we failed to append the eqiad hostname because of the 503 from applayer? [17:48:10] probably [17:48:29] I found other examples of this in /srv/log/webrequest/archive/5xx.json-20160608.gz [17:48:35] probably has to do with all the stuff we're recently messing with under v4 for X-Cache via vcl_synth(), etc [17:48:52] I guess it still doesn't quite work right [17:49:56] https://phabricator.wikimedia.org/P3230 [17:50:15] bearloga: it's a bug :) [17:50:40] well, or alternatively, it does "work right", and it was cp3033 that generated the synthetic 503 on its miss [17:52:01] if the log had the Via header we'd know [17:52:07] based on count of varnishes [17:53:15] oh so maybe cp3033 got a 503 from eqiad? [17:53:25] well, it definitely did either way [17:53:36] right :) [17:53:45] from a eqiad cache I mean [17:53:59] the question is whether cp1xxx synthesized the 503 and 3033 passed it along, or 3033 synthesized it because it got no response from cp1xxx (or reached connection limits, or whatever) [17:54:07] that [17:54:26] in general, there's already a logical hole in current X-Cache/XCDIS thinking, right? [17:55:05] because if it's a miss, we set X-CDIS to "miss". then later if we internally synthesize a 503 because the miss-fetch failed, it's still "miss" [17:55:20] the "int" status is just implicit if no other status was set, which works for e.g. synthetic 301 or 204, but not for synthetic 503. [17:55:22] (btw, thanks ema and bblack for looking into this!) [17:55:53] so we need to override "miss" with "int" when we internally fail+synth on the miss-fetch [17:56:24] (although one could argue that's inaccurate if "miss" is meant to represent the latency of the attempt at a backend, and "int" is not) [17:56:58] I think overriding miss with int in that case is more-accurate for what we actually care about seeing in logs and graphs, though. [17:57:56] we could invent a new category for it too, call them "err" if it was miss/pass->sythnetic-error, to distinguish from the other cases. [17:58:08] mmh the way I see it is that we have two level of detail: 1) did the request go to another layer 2) what happened [17:59:08] even in (1) there's fine distinctions though. did the request want to go to another layer, but immediately-failed without taking the latency hit because we reached the max_conns parameter for that backend, or the network stack did an immediate failure on connect() for other network-fail reasons. [17:59:22] vs actually talking to the other layer and timing out a bit. [17:59:42] right [18:01:06] so, +1 for overriding miss with int in case of synth 503 [18:01:15] or "err" [18:01:34] then we can tell it apart from the known internal-only 301/204 cases for "int" [18:02:03] oh sure, for debugging purposes it's not important (int+503 == err) but for graphs and all it is! [18:03:14] I've gotta go, see you tomorrow! [18:03:37] ema: cheers! [18:04:08] cya ema :) [18:04:47] the other related bit that needs fixing, is miss->hit_for_pass should override "miss" with "pass", and then we should split passes from misses in graphing based on seeing "pass" from a non-frontend cache. [18:06:30] (then we'll have a much clearer picture on the cache_text inefficiencies. I expect the hit:miss ration is really really good, and most of the current "misspass" in the graph is actually pass-traffic under the above definition) [18:07:00] and then we have some justification and some feedback to go start chasing those down in MW and fixing them, because we shouldn't be pass-mode for ~15% of traffic like we apparently are today,. [18:12:40] Dumb question, but what is miss-for-pass? [18:14:23] bblack, ema : btw, it is really fun and inspiring to listen to your thinking ! [18:20:15] gehel: it's "hit_for_pass" in varnish3, and in varnish4 it's called beresp.uncacheable instead [18:20:36] the logical breakdown of the concepts go like this: [18:21:05] 1. In an ideal/naive world, everything is cacheable for at least some short time, and there's only hits and misses (which fill in the cache for future hits) [18:21:34] 2. In the real world, some objects (URLs) are uncacheable, or are uncacheable under certain conditions (e.g. a cacheable wiki page is uncacheable for a logged-in user) [18:22:45] 3. The other introductory piece of the puzzle is coalesce-on-miss: if many clients request the same object in parallel from the cache, and it's a miss, we don't want to forward many requests to the backend. so varnish makes the backend request for the first one, and makes the other parallel requests wait on the first response before answering them. [18:23:14] (3 is really important on a lot of levels, it's a huge efficiency/perf win in the net) [18:23:52] (50 connections come in nearly-simulataneously for "Foo" not in cache. varnish makes one backend request for "foo", and once it has the answer it answers all 50. the first one is a miss and the other 49 are hits) [18:24:19] when you put 1 + 2 + 3 together in practice, you'll notice logically (or in real-world fallout) a problem: [18:25:40] 4. If we don't know an object is uncacheable until after we fetch it, it's initially identical to a normal miss. which means coalescing, but after that first fetch we get an uncacheable object, which can't answer the other 49 requests. so because of that they all get serialized and you've destroy the performance of hot (high-parallelism) objects that are uncacheable. [18:27:07] hit-for-pass is the answer to that problem. when we make that first request (no knowledge), and get an uncacheable response, we create a special cache entry that says something like "this object cannot be cached, remember it for 10 minutes". and then the other 49 and all remaining queries for the next 10 minutes proceed in parallel without coalescing, because it's already known the object isn't [18:27:13] cacheable. [18:28:31] (there is another simpler case too: where we know based on attributes of the request alone that it's uncacheable. that's simple to deal with and doesn't have coalescing problems. it's when the response tells us it's uncacheable we need hit_for_pass) [20:57:13] 10Traffic, 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2369740 (10demon) [20:57:52] bblack, ema: I also added some of the explanation of hit-for-pass to https://wikitech.wikimedia.org/wiki/Varnish#X-Cache. bblack explanation was great and the subject is most interesting! [21:14:40] gehel: thanks! [21:49:59] bblack: I know I mentioned this yesterday, but SSL session re-use increased by about 20% in the last 24H, which I believe improved first paint by 3-4%. I say "believe" because I don't know when changes went out so I haven't definitively connected the dots, but I don't have a better explanation. [21:50:43] ori: the big downward spikes in the reuse graphs are when various experiments deployed [21:50:59] (because the sessionid caches have to be wiped to change their parameters) [21:51:33] you can see them in the 7d view, they might average out on longer graphs [21:52:57] yeah, the one on 2016-06-08 07:40 was a major win, it looks like [21:53:06] ori: there's still a fundmental tradeoff against forward secrecy (well, and general resource-waste), so likely we'll end up scaling back from the current peaks. right now I'm in a phase of finding the limits of what's possible if we ignore other constraints temporarily [21:53:31] if I know what the maximum possible is, I can then evaluate what the tradeoffs are worth, basically. [21:54:18] right now we're letting sessionids last up to 25hours if they aren't evicted just to make room, and each nginx has 4GB of sid cache space (which they initially consume in less than a day, but unsure how much of that is already-useless at that point). [21:54:58] our normal before was something like 15 minutes and 100MB [21:55:07] (before the very first experiment, which happend on May 30) [22:01:25] is the cache shared, or is it local to each instance? if the latter, do we do anything to make request routing sticky? [22:26:17] sorry had to step out a bit [22:26:51] ori: it's local to each instance, and LVS persists clientip->instance mapping (but of course, clients can move IPs during a day, too) [22:28:01] (so we definitely thrash it on instance depool or dc-failover, but those are relatively-rare and we can handle the quick perf bump. all other solutions are tricky atm) [22:29:47] some of the apparent ceiling probably comes from clients that simply don't persist sids (e.g. bots/scripts/wget/curl/whatever) [22:30:14] also, the reuse graph is per-request. so if some of the non-sid sources do more reqs/conn than the sid-capable agents, that skews things too. [22:30:52] hence the experiments. given unreasonably-unlimited sessionid lifetimes and storage space, what does "100% of possible" really look like on that graph. [22:31:35] I guess it's even trickier than that, to unwrap how the reqs-vs-conns thing plays out [22:31:42] some example scenarios: [22:32:16] 1. A modern HTTP/2 browser client like Chrome, first connection of the day: reuse:0, and then that reuse:0 could apply to many many many requests within that very long-lived and coalesced session. [22:32:54] eventually that user will disconnect and reconnect and have the chance to make many reqs on one conn with reuse:1, but the long idle times on HTTP/2 may already-cover a lot of the same ground that sid-reuse would have. [22:33:51] 2. some kind of awful php bot/script, has no concept of saved sessionids, but spams APi requests all day: tons of reuse:0 requests. [22:34:26] 2. some other kind of awful bot/script, maybe it even knows about sessionids, but its daily pattern is to connect once and spam 10,000 requests over one keepalive connection for some batch processing and then disconnect for the day. they all get reuse:0 [22:34:34] s/2/3/ :) [22:35:27] there are so many more ways to get lots of reuse:0 that we can't really do much about server-side. [22:35:46] which is why I suspect the low 30s percentage we keep bumping into is probably close to the max we can get. [23:06:46] I checked if ngx_http_v2_module provides an embedded variable that identifies a connection (it does not) [23:07:19] and then the HTTP/2 spec to see if there is a connection identifier that would be a candidate for exposing via an embedded variable (there is not) [23:07:54] yeah even for http/1, would be nice if there was a request serial# per-connect available in nginx