[10:31:46] 07HTTPS, 10Traffic, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup: After login, user not logged in when "prefershttps" set to false and "wgSecureLogin" set to true - https://phabricator.wikimedia.org/T149977#2771208 (10Aklapper) [10:58:34] mmh so, logged-in pageviews of an article causing pass of the same article for anon temporarily [10:58:50] I don't seem to be able to repro this behavior on esams [10:59:18] what I'm testing now is: [10:59:33] load a page as anon, it's a hit [10:59:42] log in, reload the page, it's a pass [10:59:50] log out, reload the page, it's a hit again [11:00:21] is that the expected behavior? [11:20:22] while I try to understand this: ulsfo->codfw https://gerrit.wikimedia.org/r/#/c/319823/ and ulsfo upgrade https://gerrit.wikimedia.org/r/#/c/319824/ [11:36:30] proper caching of URLs in direct-backends seems to work as far as RB request mangling is concerned [11:37:09] tested on cp1008 with: [11:37:11] curl -v -H 'Host: wikimedia.org' -H 'X-Forwarded-Proto: https' http://localhost:3128/api/rest_v1/media/math/render/svg/d494a1039a0708e25bfeb9abb059bbefdaf2bcc3 [11:42:53] setting XFP is useless against a backend, it's basically muscle-memory :) [11:42:59] writing a vtc test for this one [11:53:43] https://gerrit.wikimedia.org/r/#/c/319829/1/modules/varnish/files/tests/text/07-backend-misspass_mangle-proper-caching.vtc [12:36:13] <_joe_> ema: logged in users are not cached, yes [12:36:31] <_joe_> but that should not affect anons [13:03:18] _joe_: right, I knew that requests from logged in users should always pass [13:03:32] I was trying to repro this though: [13:03:33] 23:41 < bblack> 1. We should validate that placing beresp.uncacheable (was: hfp) in a Vary slot at all still works as it did before. A good test of this would be whether logged-in pageview of an article causes miss (well, pass) of the same article for anon temporarily [13:05:29] ema: yeah what you saw on esams is expected behavior re: logged-in vs anon [13:05:45] just wanted to ensure codfw is doing it as well [13:06:12] the bad behavior would be: load a page as anon, it's a hit, log in it's a pass, log out and it remains a pass for anons for 10 minutes [13:07:00] except there are some exceptions to what logging out really means. probably should use an incognito to simulate "anon" after you've been logged in [13:07:28] (or curl) [13:08:06] yeah I've tested that with curl, no bad behavior [13:08:14] <_joe_> bblack: I have an interesting problem for you :) So, we have a docker registry, which we want to use to serve images to the public, but I have to "exceptional" requests: 1) the backend only speaks https 2) I want to only let HEAD and GET work from the internet [13:08:24] <_joe_> I think 2) can be achieved at the nginx level [13:08:31] <_joe_> but 1) is clearly a new thing [13:09:06] <_joe_> the point being, I don't want anyone to be able to downgrade the connection of docker to http even internally, and still being served some content [13:09:38] <_joe_> this is not urgent, but I want to know if it's feasible or not; else, I would have to think of something [13:09:58] <_joe_> here "nginx level" means the local nginx on the registry [13:10:09] <_joe_> that handles ssl termination and authentication [13:14:50] what's new about (1)? I think I'm missing some context [13:15:19] the docker registry which only speaks https is.... an nginx server on some internal host? [13:15:39] and you want to expose it via cache_misc you mean, or? [13:19:00] _joe_: so my best overall guess is: "backend" here means "the docker registry", which already has an nginx on the front of it as part of it? and you want it to expose through cache_misc, and you want it to be https-only both internally (those reaching the backend directly) and externally (cache_misc, which also has to talk to the backend)? [13:20:16] the GET|HEAD restriction is easy at the varnish level per-service, we just haven't actually bothered to do so in any existing cases (we do it at the whole-cluster level, though) [13:21:38] the https part is tricky. I don't think there would be a sane way that we could really force the issue internally while still allowing varnish to use regular http. [13:22:15] we can make it annoying and difficult to use regular http internally, but someone could always look up the magic in puppet and use it (e.g. alternate port and/or requiring some magic header value) [13:22:51] could have it look up the network source IP and match against a list of cache_misc backend IPs, but that seems annoying and fragile [13:23:50] <_joe_> bblack: no way to make varnish make an https request, right? [13:24:36] 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2771447 (10elukey) The issue re-appeared again in upload today (early UTC morning time), all concentrated in `ulsfo`. I managed to cap... [13:25:58] _joe_: that would be one of varnish's primary existential issues :) [13:26:45] <_joe_> heh, so I remembered correctly [13:27:10] but you could have the nginx listen e.g. 80, 90, and 443. have 80 unconditionally redirect to https, but 90 allows skipping the redirect if X-Whatever:Magic [13:27:24] and then internal bypassers at least have to be very explicit that they're doing Something Wrong [13:28:10] this raises lots of other questions aside from that though, as I don't think we've had an internal https service before [13:29:05] is this using the puppet cert as a server cert? and how do we configure https internally when it comes to the finer details (e.g. requiring TLSv1.2 + strong ciphers + no DHE, emitting HSTS, etc) [13:29:57] or are we issuing the cert via some other mechanism? how are the rest of the hosts trusting its path? [13:30:14] <_joe_> we're using puppet certs with alt names defined if necessary [13:30:27] <_joe_> but I have to go to lunch, sorry [13:30:31] I know we talked about that, I just haven't actually seen a working example before [13:31:00] what's the internal hostname (is it already running w/ https?) [13:32:13] <_joe_> atm it's damrstadtium.eqiad.wmnet [13:32:18] <_joe_> will add a CN later [13:32:21] <_joe_> sorry, bbl [13:34:43] Server public key is 4096 bit [13:34:57] ewww, we still need to fix that sometime, esp if we're going to start using those for internal TLS [13:35:59] ema: did you start on ulsfo things yet? [13:36:47] bblack: nope, I wanted to check with you first if https://gerrit.wikimedia.org/r/#/c/319829/ covers what you mentioned yesterday about mangled direct requests [13:37:18] ema: yeah I think so. I think you wouldn't get the hit on the second request if mangling broke vcl_hash [13:38:29] cool [13:38:54] really, now that all of that is confirmed, it seems odd that they moved Vary processing to where they did [13:39:18] it may have just been an unintended consequence of how the new fe/be code split was done [13:40:07] the upgrading to v4 doc doesn't mention any of this though [13:40:45] well it mentions the fe/be code split :) [13:42:03] apparently it's intentional: [13:42:06] https://www.varnish-cache.org/lists/pipermail/varnish-commit/2013-May/009628.html [13:42:28] - if (http_GetHdr(req->http, VSB_data(sbh), &h)) { [13:42:28] + if (http_GetHdr(bo->bereq, VSB_data(sbh), &h)) { [13:43:20] seems like wrong-thinking to me. sure it's the backend appserver that's asking for the Vary, and by implication wants to Vary on the headers being sent to it. [13:43:53] but there are lots of cases like that where varnish/VCL are inserting themselves into that surrogate role and you can't just leave the interpretation of things that way [13:44:49] it seems more reasonable to favor the other kind of consistency here: if initial lookup at vcl_recv/vcl_hash time to find an object uses the client req.http.* headers, then storage into those vary slots should use the same for consistency. [13:45:26] > since that's what the backend saw [13:45:34] that's the reasoning apparently [13:46:28] yeah but, VCL is in the business of messing with what the backend saw in general :P [13:46:38] :) [13:46:39] 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2771535 (10elukey) The following request takes ages to complete on `cp400[67]` but it completes very quickly on `cp1099`: ``` curl "h... [13:46:47] anyways, I've merged the commit routing ulsfo->codfw [13:47:35] will puppet-salt that to ulsfo hosts and then start upgrading soon if that's OK [13:51:01] (if you guys are not super busy, not urgent) - https://phabricator.wikimedia.org/T148412#2771535 - Has anything similar happened in the past? (or maybe something trivial that I am missing) [13:51:18] I tried also in esams and it works fine, seems weird only in ulsfo [13:51:39] elukey: I'm looking at it [13:51:57] thanks! [13:52:44] elukey: how long is "ages" in ulsfo for it to finish? [13:53:14] not very precise sorry, several seconds [13:53:22] something like 30/40 [13:53:52] mine's been hung a lot longer than that now, I think [13:57:38] tried again and I have been waiting several minutes [13:58:05] (just ctrl+C, time stated 4 minutes) [13:58:35] ok [14:09:43] bblack: can I proceed with ulsfo? [14:11:03] ema: yes [14:11:49] elukey: the repro went away, before I could find out anything real [14:12:20] I did manage to eliminate the expiry mailbox queue backlog as a cause [14:12:32] the double-requrl is almost certainly due to a 503 retry, too [14:13:10] we're probably seeing an issue where the frontend is willing to retry a 503 twice, and the 503 itself takes very long to return at all, perhaps waiting through various maximum request timeouts through the stack. [14:13:59] our ttfb is fairly short, but bbt is long I think [14:14:04] (timeouts) [14:15:05] bblack: I am a bit ignorant about where the expiry mailbox queue backlog can be [14:16:00] I imagine that it as on the uslfo varnish frontends? (Maybe only some of them, that explains why I saw tons of failures only for few cp ulsfo hosts) [14:16:04] elukey: as you should be, it's a deep detail of varnish that nobody should ever have to know about except varnish developers. But we've had issues with it under v4 for cache_upload, and we have mitigations in place too, but I'm just ocnfirming it's not that. [14:16:28] ah okok [14:16:44] sorry I misread the sentence about it [14:16:46] :) [14:17:21] I feel a philosophical rant about bad software coming on [14:17:33] but I'm trying to avoid it until it collects itself together a little more succinctly :P [14:18:17] ok now I re-read the whole thing and got your theory about 503s [14:19:43] yeah I'm assuming the double-requrl in the log is what we'd normally see when the frontend sees a 503 from the (varnish) backend [14:19:55] it's configured to retry those once and se if it 503s again, before giving a 503 to the user [14:19:59] s/se/see/ [14:20:20] but that still usually happens fast [14:20:55] also, this particular request is uncacheable (the backend returns Cache-Control:no-cache), so every request walks the whole varnish stack and talks to swift [14:21:34] I think for , it was taking a very long time for those 503s to bubble up the stack, hitting maximum timeouts all along the way. [14:22:05] cp4007 seems showing the same issue [14:22:08] if you want to check [14:22:10] still? [14:22:27] I had assumed when it poofed on 4006 it was gone everywhere. that's telling heh [14:22:40] I just killed my curl after 5 mins [14:22:49] not sure if it is gone now [14:24:36] ok I'm not going to control-C this one [14:24:50] and I have a varnishlog on the frontend side at least, that will see it [14:29:15] ok 2m28s and got cp4007 frontend instance's client-side and backend-side logs [14:31:57] so on the client side: [14:31:58] - ReqHeader X-CDIS: miss [14:31:58] - VCL_return fetch [14:31:58] - Link bereq 161161619 fetch [14:31:58] - Timestamp Fetch: 1478269701.861253 148.827785 0.759859 [14:32:13] 148.8 being seconds it took to fetch [14:32:31] - RespHeader Backend-Timing: D=30574 t=1478269701765589 [14:32:59] ^ that header comes straight from the mediawiki renderer, and claims it only spent 30ms creating the 404 response [14:33:43] - RespHeader X-Cache-Int: cp1099 miss, cp2017 miss, cp4007 miss [14:33:50] it went through a seemingly-normal chain of backends [14:34:34] everything else looks fairly normal [14:35:08] it just took 148 seconds for the next varnish (4007-backend) to give it a response which was a full-stack miss, which the true origin apache claims was generated in 30ms [14:35:58] so the Timestamp:Fetch in this case is related to frontend 4007 -> backend 4007 [14:36:35] yes (and implicitly any other hang beneath that) [14:36:45] going back over the client-side timestamp markers in a bit more detail: [14:36:53] - Begin req 159933463 rxreq [14:36:53] - Timestamp Start: 1478269553.033468 0.000000 0.000000 [14:36:53] - Timestamp Req: 1478269553.033468 0.000000 0.000000 [14:37:12] - VCL_return lookup [14:37:13] - Timestamp Waitinglist: 1478269701.101394 148.067926 148.067926 [14:37:15] - VCL_call MISS [14:37:22] - Link bereq 161161619 fetch [14:37:22] - Timestamp Fetch: 1478269701.861253 148.827785 0.759859 [14:37:22] - RespProtocol HTTP/1.1 [14:37:29] there we go [14:37:32] it was not the fetch [14:37:43] it spent 148 seconds in the "waitinglist" [14:37:43] so varnish<->varnish? [14:37:47] no [14:37:56] oh waitinglist [14:37:56] inside the frontend [14:37:59] (just tried the request from cp2017, no delay) [14:38:08] I don't even remember what that is, but I've seen references to it before [14:39:00] me too, while skimming through the code trying to find hints about $bug [14:39:09] ahhahah [14:39:16] oh waitinglist is coalesce [14:39:27] what hfp is supposed to solve [14:39:35] ha [14:39:57] so in other earlier logging, I was also noting there's a crapload of requests to this specific URL we're debugging coming from facebook IPs with facebook-external-hit UA [14:40:14] that's likely why we're seeing this specifically in ulsfo: tons and tons of requests for this, which can cause waitinglist pileup [14:40:34] and facebook hits ulsfo, because silly valley and all [14:40:59] there's still the deeper question of why these requests coalesce at all and how we should avoid having waitinglist problems with them [14:41:47] but if a given frontend is getting tons of requests for the same no-cache URL and we're not finding a way to translate those into hit-for-pass behavior, they will waitlist as they all try to coalesce [14:44:29] on the backend side: [14:44:31] - TTL RFC 604800 10 -1 1478269702 1478269702 1478269701 0 0 [14:44:34] - VCL_call BACKEND_RESPONSE [14:44:37] - TTL VCL 86400 10 0 1478269702 [14:44:39] - TTL VCL 0 10 0 1478269702 [14:44:42] - TTL VCL 0 3600 0 1478269702 [14:44:44] - VCL_return deliver [14:44:47] - Storage malloc s0 [14:45:06] so it's getting set to 0 TTL, which I guess is good considering it said Cache-Control: no-cache [14:45:16] but why isn't it getting set as hit-for-pass? [14:47:53] ok, it's a combination of ugly behaviors here [14:47:57] I think I have a grasp on it now [14:48:17] so the two notable things about this 404 response from the renderer->swift->varnish: [14:48:32] 1) It has "Cache-Control: no-cache" and no other expiry/caching related headers [14:50:52] 2) It has (again, when emitted by Swift) Content-Length: 465 and Content-Type: text/html; charset=utf-8 [14:51:51] however, by the time it reaches this frontend where I have the log, the response from the varnish backend has: [14:52:03] - BerespHeader Content-Encoding: gzip [14:52:16] because varnish is gzipping for us, because do_gzip [14:52:24] - BerespHeader Content-Length: 282 [14:52:55] hmmm nevermind that last part. I thought at some stage content-length got unset on a response [14:52:58] but it doesn't [14:53:05] that's another bug :) [14:53:38] ok so that eliminates some possible issues [14:55:03] so there's two things I'm looking at now: [14:55:33] 1) The 4-hit-wonder code. It creates 0-second uncacheables, until the backend hits more, which it never will in this case. [14:56:18] but that can't be it, because requires that the local disposition is "miss" rathern than "pass" (good in general), and only operates on response status == 200 [14:56:45] 2) The fact that upload seems to lack the code that text has for creating hfps from <= 0s TTLs in general [14:57:27] 2) seems particularly interesting [14:58:04] meanwhile, 2 text ulsfo nodes upgraded so far, all bueno [14:58:15] codfw hitrates have also improved [14:58:41] yeah the extra traffic from ulsfo will make the codfw backends fill in their dataset quicker [14:58:51] right [14:59:03] I was just double-checking, and on text the ordering/interaction between 4-hit-wonder and general hit-for-pass is correct [14:59:32] what we're missing on upload is the last bit of text's text_common_backend_response [14:59:53] but we only want the first sub-case, not the second that deals with cookies [15:00:18] (and as with text, it needs to happen ahead of 4-hit-wonder, and happen at all layers unlike 4-hit-wonder) [15:01:39] rewinding to why we've never had this general hfp support on upload: [15:01:47] in general, we don't get uncacheable responses there, at least not often [15:01:53] apparently 404s are an exception, though :P [15:02:06] and very-popular 404s with facebook spamming them probably even rarer [15:02:22] I think swift should probably stop doing cache-control: no-cache on the 404s, but that's neither here or there [15:02:33] some variation of the general hfp code should probably get promoted to common VCL [15:05:48] lol: [15:05:50] sub wm_common_backend_response { [15:05:52] ... [15:06:01] if (beresp.http.Cache-Control ~ "(private|no-cache|no-store)") { [15:06:04] set beresp.ttl = 0s; [15:06:07] /* This should be translated into hit_for_pass later */ [15:06:14] ^ that translation only happens on cache_text :P [15:06:27] nice [15:12:12] ema: https://gerrit.wikimedia.org/r/#/c/319869/ [15:12:34] the tricky things I think I verified there is that, when everything's compiled together for a specific cluster (e.g. text and upload)... [15:13:05] that the 4-hit-wonder code executes after the general hfp code (otherwise 4-hit-wonder's intentional 0s uncacheables would turn into 10 minute hfps) [15:13:38] it does make me wonder about 4-hit-wonder having some edge-cases, though (separately from all of this) [15:14:07] the 4-hit-wonder-matching URLs are going to suffer from coalesce->waitinglist [15:14:14] which is fine in the "normal" case where they're actually-rare [15:14:42] it's probably awful if a URL suddenly becomes "hundreds of reqs/sec" popular out of nowhere, at least for the first few requests... [15:14:59] (or on cold cache) [15:15:28] although you'd think it would fix itself quickly and worst-case our response time is multiplied by 4-5 for that first spam of requests that happens during that window [15:16:37] I should probably pick different values for these two 601s cases, too (the whole reason I picked 601 instead of 600 was so it's easy to debug the origin of these HFPs in the VCL) [15:17:25] bblack: I'd add a comment to the 4-hit-wonder code mentioning that it should execute after the hfp stuff (for our future selves) [15:17:36] yeah [15:17:59] we'll probably end up promoting that with a parameter to common code eventually anyways [15:18:09] (and then screw it up when we move it, without a comment) [15:18:20] :) [15:20:40] amended [15:22:03] looking [15:22:43] I'll make a ticket about the swift 404->no-cache issue [15:22:52] there's probably some reason for it we'll have to debate [15:25:38] amended for a tiny detail (s/601/607/ in the comment) [15:26:50] 10Traffic, 06Operations, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2771888 (10BBlack) [15:30:22] nice work, thanks a lot! [15:32:14] X-Cache: cp1099 pass, cp2005 pass, cp4015 pass, cp4006 pass [15:32:21] ^ what I get now on that 404, seems right-er :) [15:32:40] it's still blowing through all the layers, but shouldn't suffer from deranged waitinglist issues [15:37:34] hopefully this weekend I won't get data alerts, thanks! [15:38:26] well [15:38:43] there will probably always be edge cases like these that defy all reason [15:39:02] there's some deeper issue going on here with how we handle log entries that take a long time to complete [15:39:47] I don't know if that issue is as simpler as setting longer timeouts in vk (I doubt it), or about setting different timeouts in varnish for shmlog-related things, or just a general shmlog design issue. [15:40:21] the main problem that I saw is that I didn't get any VSL tags carrying "this is a VSL timeout or overflow" [15:40:35] this one we were looking at today, ultimately the log entry itself was fine. it existed, and it completed in some sense. [15:40:52] it just took it over 2 minutes to do so. in varnishlog output, nothing appears until that point. [15:41:12] I imagine for a raw shm client, something appears at the beginning, but the record isn't complete until those 2+ minutes expire [15:41:12] but there must be something that eventually ends up in the two ReqURLs [15:41:41] *something == some weird stack timeout or routing scenario [15:41:44] the 2x ReqURLs isn't always going to be the case. I think that's when we hit a 503 from some timeout happening before the waitinglist was done, in a backend. [15:42:31] this is probably the explanation, but VSL should contain more data rather than (what seems) a truncated log [15:42:34] what's the actual problem with these requests from vk's perspective? that a shm transaction starts, but never completes in reasonable time? [15:43:11] I don't even understand this to that level: is the 2 minutes perceived by vk to happen between records for this log entry? or it all comes at once but just lacks expected tags? [15:43:13] nono only that I don't get half of the tags for a request, like Timestamp:Resp, content len, etc.. [15:43:19] ok [15:43:38] so I end up with incomplete logs in hadoop [15:43:44] (and before in kafka) [15:44:19] vk is now set to wait 1500 seconds for each incomplete transaction and keep 10000 incomplete records before start flushing them [15:44:25] in the case I logged, I did get Timestamp:Resp and content length, etc, after 2+ minutes [15:44:41] I think the cases that tripped vk were a different subset [15:45:16] hopefully it is a subset of the problem solved today :) [15:45:17] again, probably the waitinglist hold was hit in the (ulsfo) backend varnish, and waited so long that the fe->be fetch timed out and generated a 503, then retried that again and got it again [15:45:40] I think what's causing your problem for vk isn't the waitinglist or the timeout [15:45:54] it's 503s that come from connection errors or connection/request timeouts, which will have incomplete log entries [15:46:47] theoretically even a 503 backend fetch should end up logging a Timestamp:Resp for the main request that originated it [15:47:54] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2772002 (10faidon) [15:48:29] you'd think [15:49:06] I think we're looking specifically at the case of a varnish frontend internally-generating a 503 because of connection timeout/error when talking to the varnish backend [15:49:17] (as opposed to receiving an actual 503 from it) [15:49:28] ahhh ok this new for me [15:49:33] *this is [15:49:57] maybe there is a way to repro it with VTC [15:50:14] maybe [15:50:16] * elukey sees ema running away [15:50:37] or at least, repro it manually by having a test varnishd set up to talk to a "backend" which is just netcat giving no response, to hit a timeout [15:51:00] it may be a between-bytes timeout rather than ttfb though, so you might have to generate a partial response and then hang [15:51:31] (e.g. send basic response and some headers, but never finish the headers. or start the content but don't send all the CL bytes expected) [15:52:22] * elukey takes notes [16:08:22] 3/6 nodes upgraded so far, we can probably speed up a bit for the next ones [16:08:46] I meant 4/6 ! [16:19:14] yeah [16:19:25] I went only a few minutes between on codfw [16:19:44] maybe 10 for ulsfo? [16:19:49] (for these last few) [16:21:27] yeah [16:27:20] 10Traffic, 06Operations, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772092 (10fgiunchedi) AFAICS it is `thumb_handler.php` from MW generating CC: no-cache on 404s and then proxied back to varnish by swift's `rewrite.py`, e.g. on m... [16:29:22] the way the frontend hitrate on a single host goes up on a cold cache is really crazy [16:29:29] from 0% to 60% in just a few seconds [16:30:06] then getting beyond that takes quite a while [16:38:11] 10Traffic, 06Operations, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772108 (10BBlack) When I hit a renderer directly, I get: ``` bblack@cp1099:~$ curl "http://rendering.svc.eqiad.wmnet/wikipedia/commons/thumb/6/63/Taissa-Farmiga--... [16:39:15] godog: well as soon as I said that, I realized that can't be the right URL on the renderer [16:39:24] but for the record, what *does* swift hit for mw rendering? [16:42:05] bblack: hehe yeah it is slightly odd for histerical raisins, the url I pasted in phab is what gets hit, http://commons.wikimedia.org/w/thumb_handler.php/<...> [16:42:47] so rewrite.py needs to rewrite (duh) the host for mw to pick up, plus thumb_handler.php in the url [16:43:16] but surely not actual commons.wm.o for the TCP connection, right? [16:43:27] ok [16:43:29] bblack: cp3041's varnish-be and varnish-be-random are depooled, is that on purpose? [16:43:42] ema: no, likely another backend-restart fallout [16:43:48] ok, fixing [16:44:06] bblack: ah, no that'd be rendering.svc as you figured out [16:45:11] all text-codfw upgraded to v4 and repooled [16:45:15] heh [16:45:17] text-ulsfo [16:45:20] godog: ok! [16:47:53] 10Traffic, 06Operations, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772143 (10BBlack) Ignore the above comment, the URL is wrong. What @fgiunchedi pasted is right, you just have to connect to `rendering.svc.eqiad.wmnet` while usi... [16:49:43] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772146 (10GWicke) > In the current examples, I think it's unfortunate that height-constraining isn't considered. Not as a feature that would be available... [17:02:49] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772169 (10GWicke) [17:09:27] bblack: Hi (I'm back from traveling a lot). A few days ago in T144187 you asked for additional simulations for 100GB and 200GB caches. I just wanted to say, I'm working on it (just takes a while, I have old hardware). [17:09:27] T144187: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187 [17:10:41] bblack: also, I've started to open-source the simulator code at https://github.com/dasebe/webcachesim so you can run stuff with your own params etc. (I just started doing this an hour ago, so the repo is lacking a few bits and pieces.) [17:10:51] 10Traffic, 06Operations, 10media-storage: thumb_handler.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772190 (10BBlack) [17:11:07] 10Traffic, 06Operations, 10media-storage: thumb_handler.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2771888 (10BBlack) Title/desc fixed up to not implicate Swift :) [17:13:36] dberger: awesome :) [17:16:06] bblack: I was also wondering, are you all set with the current 4-hit-wonder code? I'm not sure I understood some of the edge cases mentioned above. On how many servers is this running atm? [17:23:40] dberger: yeah I think it's running ok for now. It may not be ideal, but it's better-than-nothing. It's running on all of our big caches (text and upload). [17:24:31] dberger: also, after I was commenting on the exp(-size/c) method and cutting out the corner cases, I realized it would be easier to just do that with numerical rounding of the result :) [17:25:24] dberger: we can run that function to decide cache-insertion probability, and then round out towards 0 or 1 at a few significant digits (say <.00001 or >.99999 or whatever) and get the same effect [17:25:27] bblack: do we know how much of a difference it makes for the hit ratio? [17:26:02] well, realistically every floating point representation eventually has to do the same rounding, we just probably want to round at fewer digits [17:26:36] so that there's not this lingering 0.00000001% chance we'll hit 6 months from now where it accidentally tries to insert a 4GB object into a small frontend cache and wreak havoc [17:27:04] bblack: yes, that makes sense [17:27:39] but if you play with c and the sizes, you can see it's easy to set reasonable probability limits that enforce that [17:28:14] bblack: would you be up for some kind of a/b testing, where some servers run no admission control (or the old config), some run the 4-hit-code, and some the exp-probabilistic thing? [17:29:01] dberger: yeah. we'd have to rig up a custom graphing dashboard to see the results, too (right now it doesn't show down to the per-cache node level, only per-site aggregated) [17:29:16] e(-1024*1024/32768) [17:29:16] .00000000000001266416 [17:29:22] bblack: right, and things are probably quite different from site to site [17:29:56] ^ so there if "c" is 32K, a 1MB object still has a tiny probability of insertion. but if we cap things off at "if x < .000001; x = 0" or some such, we can kill those cases. [17:31:03] dberger: probably the biggest question is the degree to which these relatively-small variations in total frontend cache size affect the optimal parameters [17:31:47] dberger: given the large dataset, it may make very little difference betwen 64G and 200G in the tunables. but if there is a significant effect on the optimal values, then we'll need to at least roughly account for it somehow [17:31:51] bblack: yes, I'm working on that -- my simulations are rather slow, though. I'm using the whole 2 week trace, tough, maybe I can speed it up.. [17:32:30] bblack: ok, I'll try to answer that in a few days. From what I've seen so far, the optimal parameters varies very little between 64GB and 200GB caches [17:32:33] I figure with 3 data points at 64, 100, 200 for the optimal values, we can at least roughly do it in our VCL templating based on machine properties [17:32:59] (or decide that ~100G values are "close enough" for all cases and be done with it) [17:33:26] bblack: alright, I'll simulate those and also put in the rounding thing. [17:34:09] https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/text-frontend.inc.vcl.erb#L222 [17:34:21] ^ FWIW, that's our VCL for the 4-hit-wonder (identical in the upload cache case, too) [17:35:21] X-CDIS is telling us the local disposition of the request (it was a miss in this frontend), and X-Cache-Int is telling us whether the backends (with large file storage) hit on it and how many times [17:37:27] bblack: thanks! When we do the probabilistic policy, we probably need some c-code. Would you that in a little vmod or put it directly in vcl? (I have used a varnish 4.2 vmod in my own experiments, happy to share it.) [17:38:18] bblack: the c-code would be simple: parse the size into integer & draw random number based on it [17:39:27] 10Traffic, 06Operations, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2772387 (10BBlack) The AES flip doesn't seem to have had any notable influence on performance metrics, either. Now after all the above merges, our server-side ciph... [17:41:36] 10Traffic, 06Operations, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2772402 (10BBlack) Another notable effect was that there's apparently a small subset of clients out there in the world (somewhere in the 0.1% -> 1% ballpark) who im... [17:43:41] dberger: probably inline C, as varnish has trouble loading new versions of vmod code without restarting the whole server process [17:45:51] (so it wreaks havoc on our overall management of instances if we keep depoying new independent vmod packages while VCL reloads are ongoing) [17:47:30] we'd need inline-C for the exp(-size/c) too, to take care of the floating math and then making the probability decision accurately [17:48:05] (have it return 0 or 1 to use in a simple VCL conditional for insertion vs ttl=0+uncacheable) [17:49:34] bblack: yes, I was thinking along the same line. [17:59:18] ema: cp1066 now, checking it [18:00:19] I guess even 5s isn't enough. maybe sync + 10s will get us through until v4-complete? [18:03:01] (trying that) [18:06:05] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2772466 (10faidon) [18:07:43] 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10faidon) asw<->asw2 links are done, 4x10G on racks D 2 and D 7 (2 each). The 4x10G links from cr1-eqiad:ae4 to asw-d-eqiad:ae1 have been moved over to asw2-... [19:39:23] would something like https://gerrit.wikimedia.org/r/#/c/269086/ still be valuable? [19:39:33] I haven't kept up with the state of the art in varnish instrumentation [21:07:26] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772934 (10Tgr) >>! In T66214#2767815, @Gilles wrote: > These issues have been decoupled. Thumbor is currently set to be a drop-in replacement for image s... [21:18:39] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772979 (10GWicke)