[13:38:26] bblack: max-revalidate should mean that no stale object can be served right? [13:38:36] this test passes: https://phabricator.wikimedia.org/P3599 [13:39:18] *must*-revalidate, not max [13:41:30] yeah I think the RFC meaning is that after the object expires, you can't serve stale content without revalidating (e.g. IMS check) [13:41:56] right. That doesn't happen though, the only request received by s1 is the one initiated by c1 [13:42:00] my question-mark about that is whether varnish considers its own cache sufficient for validation of an incoming request [13:42:14] (I think it should) [13:43:05] if c1 -> v1 -> v2 -> s2 is the scenario, and v2 has a 300-second cache object, but v1 caps at say 10s and it expires, I'd expect v1 to have to check with v2, but v2 to answer direct from cache, even with must-revalidate. [13:43:39] but the age-reset data was raising questions about, whether due to must-revalidate, v2 decided to ask s2 even though its object was still alive [13:43:45] but I hope/think it shouldn't [13:43:53] in the specific case of P3599 there's one varnish only though [13:43:54] P3599 max-revalidate-grace.vtc - https://phabricator.wikimedia.org/P3599 [13:44:34] so I would have expected c2's request to trigger a IMS from v1 to s1, but that doesn't happen [13:44:54] *and* c3's request also doesn't end up hitting s1 in any way [13:45:15] P3599 passes completely? [13:45:15] P3599 max-revalidate-grace.vtc - https://phabricator.wikimedia.org/P3599 [13:45:20] yep [13:45:28] oh also, nothing here is testing IMS, you're giving nothing to IMS-check against [13:45:41] oh right [13:45:42] the server needs to use a Last-Modified field for IMS to be possible [13:46:40] but still shouldn't v1 try to fetch the object again after the 6 seconds delay? [13:47:55] it might only do a background fetch in that case (serve grace-but-stale content while fetching), but I would've thought the third request would have a fresh object for sure. [13:48:40] oh, server s1 only does one request and dies, right? [13:48:50] s1 needs a loop over its rxreq->txresp or something [13:48:51] oh, note that this passes on v4 but not on v3 [13:49:05] grace keeps working because s1 is unresponsive [13:50:43] bblack: ha! Yes that was it [13:51:43] good test of grace-mode on a dead server though :) [13:51:58] :) [13:53:32] on the ciphers subject: apparently some old symbian phones gained forward secrecy from the dhe+3des thing. it's crazy, I wouldn't have thought anything implemented that as its best cipher. [13:53:46] (and with working 2048-bit dhe, too) [13:54:02] they're not statistically significant, but still [13:55:22] and in other ciphers news: I've sampled repeatedly all over the place, and basically AES128-SHA256 and AES128-GCM-SHA256 are used almost exclusively (completely exclusively in the data I've seen) by downgrading evil TLS proxies. All UAs I've logged behind those are capable of way better ciphers. [13:55:59] AES128-SHA is slightly different. it's also used *mostly* by such proxies, but then there are also a small number of legit ancient clients, too. [13:57:27] given anything implementing the former two can do the latter as well, we could just dump those two minority ciphers and bin this all up under AES128-SHA. their security sucks no matter which one they use, so why bother? [13:58:16] (sha256 and gcm aren't really helping them much in practice here, and the population is too tiny to matter anyways, and they're all evil proxies anyways) [14:53:22] bblack: in T141373 I mentioned that we would expect to eventually see Age reaching 14d, but I guess it's wrong, we cap the ttl to 7d so we would expect it to reach 7d right? [14:53:22] T141373: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373 [15:01:25] ema: well, yes and no. definitely >1d though :) [15:01:46] bblack: that's for sure [15:01:46] ema: it's 7d per-layer, but e.g. in ulsfo, the layers are ulsfo->codfw->eqiad [15:02:04] so it's capping for 7d at each layer separately [15:02:25] which means we wouldn't expect many objects over 7d really, but some small amount will be [15:02:43] because right before the object expired at the 7d mark at a lower layer, it got re-fetched into one of the upper layers [15:02:50] and the s-maxage is longer (14d) [15:03:09] (ops hangout started) [15:03:17] s-maxage should be the real limit, but our per-layer caps are going to skew things statistically to be much more likely to be <= 7d I think [15:05:18] (what we really want, I think, is to stop trying to cap per-layer obj.ttl, and instead cap the real life of the object. the problem is even just in the Cache-Control case that's more difficult to parse and modify, but also obj.ttl can be derived from other things too, like Expires [15:05:23] ) [15:06:04] so probably the right answer is let the backend-most varnish do all the fancy stuff to deriver obj.ttl from one of many things, cap that value, and communicate it as s-maxage in Surrogate-Control within our cache layers. [15:23:23] back on the "strange Windows+Chrome/41 crap" front: I've done more analysis on the IPs and such, and it doesn't seem to be any one shared country-level proxy or anything like that. [15:23:44] and the population is too big to be something dumb like a popular "alternative" packaging of Chrome/41 people are downloading or whatever [15:24:53] The Microsoft TLS stack changes are still the best guess at a culprit, since they patched SChannel/CryptoAPI at about the right time (meaning these Chrome41/Win were already in our population in these numbers before, but they've behaving differently now due to the updated library) [15:25:12] there's a slim possibility that it's somehow all interrelated with this as well: [15:25:19] https://github.com/openssl/openssl/pull/1350 [15:26:05] ^ which, if you chase the tree of links from there, is all about some MS TLS stack implementations having a bug with certain DH pubkey lengths (which is not 1024 vs 2048, it's about whether the encoding of 2048 is padded to 256 or naturally accidentally ends up 255 bytes) [15:26:55] but the fact that it's also an MS TLS stack bug thing, and it was just noted+fixed in openssl's master branch since this problem cropped up in our stats (and thus since the possible target update MS made to their stack recently), and there's a pullreq to bugfix this back in 1.0.2 as well... [15:27:43] I donno, there's a lot of "if" in there, but it would be useful to get that bugfix onto our openssl-1.0.2 (it'll be in the next 1.0.2 release probably anyways) and see if it makes this go away. [15:29:16] the first open of the report of that bug on the openssl side was just 14d ago, which is around the time MS was first publicly releasing that patch (but then probably slowly rolling to consumer auto-updates, + slow os/browser restarts, explains the slight delay before our stats ramp in) [15:30:39] MS's own bug reports on it go back a bit further, but still [16:09:10] bblack: trying to get varnish to issue a IMS request (and failing!). This is how the origin server responds: txresp -hdr "Cache-Control: s-maxage=10,must-revalidate,max-age=0" -hdr "Last-Modified: Thu, 28 Jul 2016 13:42:13 GMT" [16:10:14] ema: that should trigger it to at least issue an IMS. of course server s1 may not know how to respond to that with a 304 on its own... [16:10:35] it's also possible v3 doesn't do IMS in as many situations as v4 does. [16:10:56] bblack: I'm trying with v4 for the time being [16:11:01] hmmm ok [16:11:05] ha! [16:11:18] so when you say "at least issue an IMS" are you looking for 304, or for outbound IMS header on the req? [16:11:43] because server s1 is free to ignore IMS on the req and respond with 200 (which it probably will by default) [16:11:44] outbound IMS, but I managed to get one now. It looks like only grace mode triggers IMS [16:11:58] yeah that should be correct, only IMS in grace window [16:12:08] well... [16:12:46] actually i'm not sure about the boundary conditions. it might be possible and legal that the object has fully expired (out of grace as well) but not yet been purged/evicted, and still triggers IMS as a legal optimization? [16:13:10] not sure. What I cannot reproduce is varnish issuing a IMS request with default_grace=0 [16:13:20] I'm not sure whether varnish categorically ignores fully grace-expired objects [16:13:44] (or still tries to use them for 304-optimization) [16:14:33] mmh. All of this is really intricated, there are lots of variables influencing lots of other variables triggering corner cases... :P [16:16:05] but it's really good to have vtc available and be able to write down a reproducible test [16:30:27] 10Traffic, 06Operations, 13Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#2505559 (10BBlack) Recapping latest investigations, stats, and changes: 1. We're down to just `DES-CBC3-SHA` and `AES128-SHA` on the non-forward-secret list. Ever... [16:40:38] bblack: I think grace-expired objects can be used for IMS requests if within the keep period [16:42:02] oh right I forgot about "keep", that doesn't exist in v3 [16:42:05] makes sense! [16:44:21] bblack: quickly grepping around the v3 source code I couldn't see any varnish-issued IMS request [16:44:49] that could answer question 2 in T141373 :) [16:44:50] T141373: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373 [16:44:52] yeah that's entirely possible [16:45:04] I wonder what it does with client IMS? [16:45:15] I mean, it must satisfy them from cache, I think [16:45:36] maybe that explains a lot of things, if it can satisfy IMS from cache but can't effectively ask a deeper varnish3 for IMS [16:47:41] https://www.varnish-cache.org/lists/pipermail/varnish-misc/2011-May/020452.html [16:47:49] long ago discussion on the first grace/keep implementation [16:55:25] 10Traffic, 06Operations: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#2505650 (10ema) [18:53:47] bblack: I am seeing an apparent stale cache for https://www.mediawiki.org/wiki/HyperSwitch, both when logged in & when logged out [18:54:07] however, Pchelolo is seeing the latest version of that page [18:55:01] expected content is https://www.mediawiki.org/w/index.php?title=HyperSwitch&oldid=2204727 [18:55:03] some more info: I'm the one who made an edit, I've been editing in VE, but I made A LOT of switches back and forth for source editing [18:59:29] sorry, I wasn't actually logged in -- it's only stale when logged out, so a stale CDN response [19:15:01] it does happen [19:15:09] have you tried purging it with action=purge? [19:22:07] bblack: I was worried about it also happening when I was logged in, but I had not noticed that the last browser restart had logged me out [19:23:05] and yes, action=purge does clear it [20:15:04] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2506628 (10ellery) @Nuria, @BBlack I need to clarify that in the example that I gave above, the experiments were not run concurrently, but in sequence. [20:16:13] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2506631 (10ellery) @Nuria I'm confused about how your statement "a bucket will have control and treatment for 1 experiment". I though that a bucket represents a group of users... [20:39:01] 10Traffic, 10Analytics, 06Operations, 06Performance-Team: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2506770 (10ellery) Another issue that is independent of proper randomization, is that for most use cases, the data produced by the system cannot be used for statistical testing...