[12:32:29] ema: I got the grafana miss/pass split worked out yesterday. At least, better than it was. [12:33:01] I left off the bug/err stuff, so we could still be missing some strange cases through e.g. vcl_backend_error, not sure. but the bulk of the real miss/pass is split in the graph correctly now I think. [12:45:04] bblack,ema: I installed the new vk on cp3008 and it seems working fine, even with the new -T option. I am inclined to upload the new version to reprepro and upgrade maps [12:46:06] (I also checked kafka logs via stat1002 for cp3008, all good) [12:54:41] elukey: sounds ok to me :) [12:55:45] all right thanks! [13:13:20] ok just finished maps, tomorrow I'll do misc :) [13:14:20] awesome :) [13:32:45] bblack: nice! [13:49:46] bblack: there is still some 3% of misspass currently, shouldn't it be 0? [13:52:04] ema: historical, I think [13:52:26] if you limit the time window to after the change was deployed, it should go away [13:52:44] indeed [13:53:06] it's a bit confusing though that it's listed under "current" [13:54:09] "current"? [13:54:32] oh, yeah [13:54:32] not sure why [13:54:32] yep, in the min/max/avg/current split on the right [13:55:18] I would have to guess that "current" uses the right-most datapoint that's still visible on the graph for that metric [13:55:25] since they often come in async in various cases [13:55:31] anyways, really interesting to see the actual misses [13:55:49] yeah I dug into them a bit in varnishlog on text yesterday. I was honestly expecting "pass" to be higher [13:56:18] a lot of the miss seem legitimate though, unique API queires that aren't actually cacheable, and fetches on real articles that just aren't cached (perhaps purged just before?) [13:56:53] sorry that should have read: "unique API queries that *are* actually cacheable" (but unlikely to ever be hit again due to unique combinations of params) [13:57:20] good candidates for the cache-on-second-hit filter! [13:57:53] :) [13:58:02] how about the 0% pass on upload? [13:58:11] it makes sense [13:58:23] we don't really do POST traffic there, or any other reason to pass [13:58:30] there's no sessions, etc [13:58:53] upload's dataset is both huge-enough and updated frequently enough that we expect some level of miss there [13:59:29] oh, no POST? the actual uploads go elsewhere then [13:59:30] (esp again given there is some purge traffic, and images that are purged for a good reason are probably almost always viewed at least a few times shortly after) [13:59:43] right, upload.wm.o is basically a readonly image-output service [13:59:52] should be called download then :) [13:59:58] lol [14:00:06] or "uploaded" :) [14:00:10] right [14:00:20] the uploads happen on commons.wikimedia.org, which is text-cluster, confusingly :) [14:11:12] bblack: I'm back trying to make our upload vcl load on v4 despite the vlc_{synth,backend_error} madness [14:11:14] https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/upload-frontend.inc.vcl.erb#L119 [14:11:35] obj.response became beresp.reason, and we can live with that [14:11:54] however, beresp.reason cannot be set from vcl_synth, only from vcl_backend_error [14:12:03] do you happen to remember how we worked around that? [14:12:21] and at any rate, how important is it to set reason to "ok"? :) [14:13:50] we'll run into the same problem on text-cluster. seems like a common pattern.... [14:14:12] I mean, fundamentally why would any code location be able to set the response code and not the response reason? [14:14:19] they go together [14:14:23] yeah /o\ [14:14:32] there must be something obvious we're missing here [14:14:44] either that or set it up in vcl_recv recv when the synthetic error code is thrown. [14:14:45] we had to deal this for errorpage already right? [14:14:53] *with this [14:15:16] I don't think so [14:15:28] isn't that why we came up with backend_error_errorpage and synth_errorpage? [14:15:35] but in any case, probably the "right" fix is in vcl_recv, change it to s/CORS/OK/ there [14:15:40] when iut emits the 667 [14:18:14] oh no, PEBKAC [14:18:27] I was using beresp instead of resp [14:23:04] set <%= @resp_obj %>.<%= @reason_response %> = "OK"; [14:23:10] this is getting more and more readable [14:33:41] 10Traffic, 10Varnish, 06Operations: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2457902 (10ema) a:03ema [14:42:15] 'bereq.http.Range': cannot be set from method 'cluster_fe_miss' [14:42:17] mmh [14:44:53] makes sense, miss is in the client side, bereq is backend [14:58:53] bblack: ema: I read your discussion right now and...I have NO idea what you 2 are talking about there. There are snippets I understand, but mostly it sounds like gibberish....how in the world do you 2 keep wikimedia running and even improving it? I´m more than impressed! [15:02:31] Oh and btw: So the article for "Adolf_Hitler" is modified less than every 2 days and it still only exists for 24h so the data cache doesn´t have it after that. Which is strange....I would have thought that Adolf_Hitler is quite popular with the game "4-5 clicks to Hitler" [15:02:49] yeah [15:03:35] it would be interesting to try to dissect the problem for sure [15:03:46] if it was any kind of natural effect, you wouldn't think it would fall right on the 24h boundary [15:04:02] something explicit is going on, could be related to our insane purge traffic rates [15:04:37] it could also be some poorly-understood interaction with the TTL capping stuff [15:05:26] or it could be some misunderstanding in what's being tested and what the outputs mean, but IIRC from yesterday you were saying this was based on Age: resetting back to a low value, and that seems like a pretty reliable indicator. [15:15:13] Snorri_: I opened a screen session to leave open for a few days, which is doing a raw network sniff on our PURGE traffic and matching only packets that contain the string '/wiki/Adolf_Hitler' [15:15:44] Snorri_: should give some us some evidence/direction on why the Age of that article never goes over 24H [15:20:47] actually I'm limiting that further to 'en.wikipedia.org/wiki/Adolf_Hitler' to reduce clutter. already had a match on jawiki heh [15:29:03] bblack: oh yeah, the error makes sense. I was wondering why we're copying req.http.Range into bereq.http.Range [15:30:00] ema: maybe varnish (only 3?) filters it on req->bereq copy by default? it would make sense if (at least in V3) varnish fetched the whole object into cache to satisfy the Range request, as a basic not-so-advanced way of handling it. [15:34:25] <_joe_> bblack, ema did we have another insecure post cutoff yesterday? [15:34:36] <_joe_> I have to report to SoS that, I guess [15:40:45] _joe_: yes [15:41:54] _joe_: the original announcement is here for posterity: https://lists.wikimedia.org/pipermail/wikitech-l/2016-May/085618.html [15:42:40] given I was on vacation just before the final cutoff date (which was yesterday), and we still had a fair bit of outstanding labs traffic, including the unfortunate Merlbot case, we gave labs a partial stay of execution [15:43:12] yesterday this was merged: https://gerrit.wikimedia.org/r/#/c/298336/ [15:43:40] which did the insecure post complete cutoff as planned for the outside world, but gives labs bots another week, and raises their random-failure rate from 10% to 20%. [15:47:36] <_joe_> ok thanks [15:49:56] <_joe_> what's the merlbot situation? [15:52:40] well, there's a handful of bots, even in labs that just didn't update and have missed all warnings and errors and emails and attempts to contact them so far. It's kind of a given that the real access cutoff is going to be the wake-up call for some cases, no helping that on our end. [15:53:01] <_joe_> yeah makes sense [15:54:04] Merlbot is one of those, but it's confounded by some additional factors: (1) it's a major bot on dewiki that people care about the actions of (2) the bot's owner/author dissappeared from the community for a few months around the same time we started this final insecure-post phaseout and couldn't be reached, and apparently due to some serious accient and hospital stay, etc. [15:55:06] but on the other hand, we had already communicated with him directly like a year ago about his bot, before the latest announce->cutoff efforts. there was no solution to fix it back then because it's closed Java source code, and he claims his code will only do HTTPs correctly on Java 8, and we don't support Java 8 on gridengine nodes, etc... [15:55:27] there's been a handful of tickets about the whole gridengine vs java8 thing. We don't have a single OS distribution that can do both things. [15:55:58] he's been back in contact as of a couple weeks ago briefly, but we still don't have a working solution for his bot [15:57:03] https://phabricator.wikimedia.org/T121279 [15:57:13] ^ one of the better related tickets [16:02:12] bblack: you were right, Range does not get copied into bereq (also on v4) [16:03:01] so basically we need to set X-Range in vcl_recv and copy it into bereq in vcl_backend_fetch I guess [16:03:20] ema: yeah. it makes sense. basically if you're not going to do really tricky/optimal things with how a cache handles range, the best simplistic strategy is cache the whole object then satisfy the Range request (and other future ones) from it [16:03:41] <_joe_> bblack: closed source? [16:03:43] <_joe_> wat? [16:03:53] <_joe_> I am pretty sure it's against the tools TOS [16:04:03] _joe_: yeah this gets back into the debate on tool/bot source code. labs policy hasn't historically required open source. [16:04:42] <_joe_> bblack: but now we have java 8 on k8s! [16:04:45] in his recent update https://phabricator.wikimedia.org/T121279#2392449 he said (for the first time) his code is MPL-licensed, but no answer as to where we can download it. [16:04:58] <_joe_> well if we have the jar... [16:05:00] <_joe_> :) [16:05:01] 10Traffic, 06Operations, 06Performance-Team: Split stats/metrics by cache cluster - https://phabricator.wikimedia.org/T109378#2458286 (10fgiunchedi) 05Open>03Invalid we do have these stats now under `varnish.` in graphite and grouped by cluster. Tentatively resolving, unless there are some missing stats. [16:06:10] 10Traffic, 06Operations, 06Performance-Team: Split stats/metrics by cache cluster - https://phabricator.wikimedia.org/T109378#2458304 (10BBlack) Yeah good call. We don't have TLS broken down, but IMHO it's not that important. We did get status codes broken down in https://grafana.wikimedia.org/dashboard/db... [16:07:43] ema: obviously, we're trying to be smarter with how our v3-range stuff works on upload, because we don't want a Range request for someone skipping to the end of a long video to wait on loading up the whole file into varnish. [16:08:31] ema: but it's not necessarily true that VCL alone can fix that problem. If we pass the Range through to the backend, what's Varnish going to actually do with the response? cache each range chunk separately? pass-mode-only? .... [16:08:43] ema: so that's what some of the old Plus range support was dealing with [16:09:22] heh [16:10:30] in any case, I've never fully understood even our current VCL on it [16:11:04] it seems like the frontend passes all range requests (makes sense, let BE deal with that problem), but it doesn't it in two separate places: on seeing Range on the request, and on seeing Content-Range in the response. [16:12:03] then in the backend, we do hash_ignore_busy on high-range requests (so it can still cache, but it won't wait on a concurrent fetch of the whole object that might take a long time to reach this range) [16:12:34] and we pass on those apparently, too [16:12:53] I guess normally pass would imply no busy-wait, but here we're avoiding busywait between range-pass and whole-cacheable [16:14:42] we don't have any good reason to avoid caching the large objects the range requests reference [16:14:48] we just want to avoid delays [16:15:01] in an ideal world, I think the behavior we want is this: [16:15:48] 1) Frontend: If the Range request is a miss, make it a pass. Obviously, if it's satisifiable by a fully-cached object already, use that instead. [16:16:49] 2) Backend: On any Range request that's a miss, we want to at least start loading in the whole object. But if the range is very high, there's a good chance there's significant delay waiting for the whole object to load, so do a (pass) on that request (taking an applayer hit to reduce latency). [16:17:39] I think the problematic part is the first part of 2. If we get a high-range miss, and the object hasn't already started loading due to some other whole-object request, how do we trigger loading the whole object into cache, while also passing on this one request at the same time? [16:18:31] 1) is to be the current behavior already, isn't it? [16:18:41] yes, on v3 [16:19:13] I'm not sure to what degree v4 supports Range at this point, we should check that it does satisfy them from a full cache object [16:19:39] right [16:20:06] I think the "problematic part" above, we don't even have under v3. at least, it's not apparent in the VCL. but perhaps the plus patch was giving it to us? [16:22:02] I guess it's time to more-fully understand the old plus patches' behavior [16:22:08] yep [16:22:42] in the meantime, the current VCL is fully ported to v4 with the exception of the bereq.Range line part mentioned above [16:22:49] https://gerrit.wikimedia.org/r/#/c/298744/ [16:22:51] ok cool [16:23:16] the simplest approach, that almost certainly "works" is: [16:23:24] noop on v3 so we might want to merge it (if it looks good to you) and then tackle the whole topic of range requests [16:23:28] 1) As before, FE passes on range misses [16:24:20] 2) BE passes on high-range, doesn't pass on low-range, and we hope (test) that v4 will satisfy the low-range request to the user before the rest of the file is done loading in. [16:25:23] in theory that should work fine. if a file is popular for high-range requests (e.g. from skipping forward quickly or bookmarking the end), *somebody* will hit the low range a few times too and get it into cache eventually. [16:27:13] sounds like a plan! [16:29:10] it's not as optimal as it could be, and I suspect v3-plus does better than that in some cases, but I don't see why the above wouldn't fundamentally work and not melt us. [16:35:59] 10netops, 06Operations: Turn up new eqiad-esams wave (Level3) - https://phabricator.wikimedia.org/T136717#2458474 (10faidon) The esams side is apparently ready, we're waiting for the LOA; the eqiad side is being actively worked on. Commit date is the 26th, firm date is the 22nd, optimistic delivery is on the... [16:43:03] ema: looking at the old plus diffs and ignoring unrelated things about storage silos and bans and nukes, etc... what's left seems mostly to be about streaming more than being about range [16:43:35] ema: but there's some minor interactions with range, too. as in, how to support pass-requests with streaming and range, and how to presumably support non-pass streaming+range. [16:43:52] it may be that those basic things are already covered in V4, since it streams by default [16:44:42] ok [16:44:53] the biggest question mark is, from above, testing if it will satisfy a missed low-range request while the rest of the huge body is still streaming in from the backend. [16:45:26] and I guess whether that applies to other concurrent clients, too [16:45:41] e.g. imagine this sequence on an initially empty cache: [16:46:33] 1) User1 requests range 1-10 of a 1GB object. We hope varnish fetches the whole thing, and sends those 10 bytes to User1 as soon as they arrive from the BE, not after the whole 1GB is into cache. [16:47:25] 2) Halfway through (512MB transferred in from backend), User2 requests range 800-900. We hope varnish satisifies that from the partially-loaded object immediately. [16:49:44] 3) I guess we hope the above is considered a "hit", but if it had been for the final 10 bytes of the object, it's considered a miss. [16:50:27] 4) On misses above a certain bytes-from-the-start, we convert to "pass" and assume that won't be common because someone or something is going to trigger loading it with a lower-range request or a whole file request. [16:51:06] in (3) above, another possibility of course is the request for the final 10 bytes that comes when we're still half-done loading it would be default stall on request coalescing. [16:56:26] dinner, bbl [19:28:23] 10Traffic, 10netops, 06DC-Ops, 06Operations, and 2 others: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#2459327 (10Cmjohnson) 05Open>03Resolved The servers have been installed and setup. the only issue is the snmp errors but that has a separate linked task. resolving t... [22:04:36] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2459936 (10Boshomi) >>! In T132521#2454541, @demon wrote: > Where would such an deprecation be...