[10:31:46] <wikibugs>	 07HTTPS, 10Traffic, 10MediaWiki-Authentication-and-authorization, 10MediaWiki-User-login-and-signup: After login, user not logged in when "prefershttps" set to false and "wgSecureLogin" set to true - https://phabricator.wikimedia.org/T149977#2771208 (10Aklapper)
[10:58:34] <ema>	 mmh so, logged-in pageviews of an article causing pass of the same article for anon temporarily
[10:58:50] <ema>	 I don't seem to be able to repro this behavior on esams
[10:59:18] <ema>	 what I'm testing now is:
[10:59:33] <ema>	 load a page as anon, it's a hit
[10:59:42] <ema>	 log in, reload the page, it's a pass
[10:59:50] <ema>	 log out, reload the page, it's a hit again
[11:00:21] <ema>	 is that the expected behavior?
[11:20:22] <ema>	 while I try to understand this: ulsfo->codfw https://gerrit.wikimedia.org/r/#/c/319823/ and ulsfo upgrade https://gerrit.wikimedia.org/r/#/c/319824/
[11:36:30] <ema>	 proper caching of URLs in direct-backends seems to work as far as RB request mangling is concerned
[11:37:09] <ema>	 tested on cp1008 with:
[11:37:11] <ema>	 curl -v -H 'Host: wikimedia.org' -H 'X-Forwarded-Proto: https' http://localhost:3128/api/rest_v1/media/math/render/svg/d494a1039a0708e25bfeb9abb059bbefdaf2bcc3
[11:42:53] <ema>	 setting XFP is useless against a backend, it's basically muscle-memory :)
[11:42:59] <ema>	 writing a vtc test for this one
[11:53:43] <ema>	 https://gerrit.wikimedia.org/r/#/c/319829/1/modules/varnish/files/tests/text/07-backend-misspass_mangle-proper-caching.vtc
[12:36:13] <_joe_>	 ema: logged in users are not cached, yes
[12:36:31] <_joe_>	 but that should not affect anons
[13:03:18] <ema>	 _joe_: right, I knew that requests from logged in users should always pass
[13:03:32] <ema>	 I was trying to repro this though:
[13:03:33] <ema>	 23:41 < bblack> 1. We should validate that placing beresp.uncacheable (was: hfp) in a Vary slot at all still works as it did before.  A good test of this  would be whether logged-in pageview of an article causes miss (well, pass) of the same article for anon temporarily
[13:05:29] <bblack>	 ema: yeah what you saw on esams is expected behavior re: logged-in vs anon
[13:05:45] <bblack>	 just wanted to ensure codfw is doing it as well
[13:06:12] <bblack>	 the bad behavior would be: load a page as anon, it's a hit, log in it's a pass, log out and it remains a pass for anons for 10 minutes
[13:07:00] <bblack>	 except there are some exceptions to what logging out really means.  probably should use an incognito to simulate "anon" after you've been logged in
[13:07:28] <bblack>	 (or curl)
[13:08:06] <ema>	 yeah I've tested that with curl, no bad behavior 
[13:08:14] <_joe_>	 bblack: I have an interesting problem for you :) So, we have a docker registry, which we want to use to serve images to the public, but I have to "exceptional" requests: 1) the backend only speaks https 2) I want to only let HEAD and GET work from the internet
[13:08:24] <_joe_>	 I think 2) can be achieved at the nginx level
[13:08:31] <_joe_>	 but 1) is clearly a new thing
[13:09:06] <_joe_>	 the point being, I don't want anyone to be able to downgrade the connection of docker to http even internally, and still being served some content
[13:09:38] <_joe_>	 this is not urgent, but I want to know if it's feasible or not; else, I would have to think of something
[13:09:58] <_joe_>	 here "nginx level" means the local nginx on the registry
[13:10:09] <_joe_>	 that handles ssl termination and authentication
[13:14:50] <bblack>	 what's new about (1)? I think I'm missing some context
[13:15:19] <bblack>	 the docker registry which only speaks https is.... an nginx server on some internal host?
[13:15:39] <bblack>	 and you want to expose it via cache_misc you mean, or?
[13:19:00] <bblack>	 _joe_: so my best overall guess is: "backend" here means "the docker registry", which already has an nginx on the front of it as part of it? and you want it to expose through cache_misc, and you want it to be https-only both internally (those reaching the backend directly) and externally (cache_misc, which also has to talk to the backend)?
[13:20:16] <bblack>	 the GET|HEAD restriction is easy at the varnish level per-service, we just haven't actually bothered to do so in any existing cases (we do it at the whole-cluster level, though)
[13:21:38] <bblack>	 the https part is tricky.  I don't think there would be a sane way that we could really force the issue internally while still allowing varnish to use regular http.
[13:22:15] <bblack>	 we can make it annoying and difficult to use regular http internally, but someone could always look up the magic in puppet and use it (e.g. alternate port and/or requiring some magic header value)
[13:22:51] <bblack>	 could have it look up the network source IP and match against a list of cache_misc backend IPs, but that seems annoying and fragile
[13:23:50] <_joe_>	 bblack: no way to make varnish make an https request, right?
[13:24:36] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2771447 (10elukey) The issue re-appeared again in upload today (early UTC morning time), all concentrated in `ulsfo`. I managed to cap...
[13:25:58] <bblack>	 _joe_: that would be one of varnish's primary existential issues :)
[13:26:45] <_joe_>	 heh, so I remembered correctly
[13:27:10] <bblack>	 but you could have the nginx listen e.g. 80, 90, and 443.  have 80 unconditionally redirect to https, but 90 allows skipping the redirect if X-Whatever:Magic
[13:27:24] <bblack>	 and then internal bypassers at least have to be very explicit that they're doing Something Wrong
[13:28:10] <bblack>	 this raises lots of other questions aside from that though, as I don't think we've had an internal https service before
[13:29:05] <bblack>	 is this using the puppet cert as a server cert? and how do we configure https internally when it comes to the finer details (e.g. requiring TLSv1.2 + strong ciphers + no DHE, emitting HSTS, etc)
[13:29:57] <bblack>	 or are we issuing the cert via some other mechanism? how are the rest of the hosts trusting its path?
[13:30:14] <_joe_>	 we're using puppet certs with alt names defined if necessary
[13:30:27] <_joe_>	 but I have to go to lunch, sorry
[13:30:31] <bblack>	 I know we talked about that, I just haven't actually seen a working example before
[13:31:00] <bblack>	 what's the internal hostname (is it already running w/ https?)
[13:32:13] <_joe_>	 atm it's damrstadtium.eqiad.wmnet
[13:32:18] <_joe_>	 will add a CN later
[13:32:21] <_joe_>	 sorry, bbl
[13:34:43] <bblack>	 Server public key is 4096 bit
[13:34:57] <bblack>	 ewww, we still need to fix that sometime, esp if we're going to start using those for internal TLS
[13:35:59] <bblack>	 ema: did you start on ulsfo things yet?
[13:36:47] <ema>	 bblack: nope, I wanted to check with you first if https://gerrit.wikimedia.org/r/#/c/319829/ covers what you mentioned yesterday about mangled direct requests
[13:37:18] <bblack>	 ema: yeah I think so.  I think you wouldn't get the hit on the second request if mangling broke vcl_hash
[13:38:29] <ema>	 cool
[13:38:54] <bblack>	 really, now that all of that is confirmed, it seems odd that they moved Vary processing to where they did
[13:39:18] <bblack>	 it may have just been an unintended consequence of how the new fe/be code split was done
[13:40:07] <ema>	 the upgrading to v4 doc doesn't mention any of this though
[13:40:45] <ema>	 well it mentions the fe/be code split :)
[13:42:03] <bblack>	 apparently it's intentional:
[13:42:06] <bblack>	 https://www.varnish-cache.org/lists/pipermail/varnish-commit/2013-May/009628.html
[13:42:28] <bblack>	 -		if (http_GetHdr(req->http, VSB_data(sbh), &h)) {
[13:42:28] <bblack>	 +		if (http_GetHdr(bo->bereq, VSB_data(sbh), &h)) {
[13:43:20] <bblack>	 seems like wrong-thinking to me.  sure it's the backend appserver that's asking for the Vary, and by implication wants to Vary on the headers being sent to it.
[13:43:53] <bblack>	 but there are lots of cases like that where varnish/VCL are inserting themselves into that surrogate role and you can't just leave the interpretation of things that way
[13:44:49] <bblack>	 it seems more reasonable to favor the other kind of consistency here: if initial lookup at vcl_recv/vcl_hash time to find an object uses the client req.http.* headers, then storage into those vary slots should use the same for consistency.
[13:45:26] <ema>	 > since that's what the backend saw
[13:45:34] <ema>	 that's the reasoning apparently
[13:46:28] <bblack>	 yeah but, VCL is in the business of messing with what the backend saw in general :P
[13:46:38] <ema>	 :)
[13:46:39] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2771535 (10elukey) The following request takes ages to complete on `cp400[67]` but it completes very quickly on `cp1099`:  ``` curl "h...
[13:46:47] <ema>	 anyways, I've merged the commit routing ulsfo->codfw
[13:47:35] <ema>	 will puppet-salt that to ulsfo hosts and then start upgrading soon if that's OK
[13:51:01] <elukey>	 (if you guys are not super busy, not urgent) - https://phabricator.wikimedia.org/T148412#2771535 - Has anything similar happened in the past? (or maybe something trivial that I am missing) 
[13:51:18] <elukey>	 I tried also in esams and it works fine, seems weird only in ulsfo
[13:51:39] <bblack>	 elukey: I'm looking at it
[13:51:57] <elukey>	 thanks!
[13:52:44] <bblack>	 elukey: how long is "ages" in ulsfo for it to finish?
[13:53:14] <elukey>	 not very precise sorry, several seconds
[13:53:22] <elukey>	 something like 30/40
[13:53:52] <bblack>	 mine's been hung a lot longer than that now, I think
[13:57:38] <elukey>	 tried again and I have been waiting several minutes
[13:58:05] <elukey>	 (just ctrl+C, time stated 4 minutes)
[13:58:35] <bblack>	 ok
[14:09:43] <ema>	 bblack: can I proceed with ulsfo?
[14:11:03] <bblack>	 ema: yes
[14:11:49] <bblack>	 elukey: the repro went away, before I could find out anything real
[14:12:20] <bblack>	 I did manage to eliminate the expiry mailbox queue backlog as a cause
[14:12:32] <bblack>	 the double-requrl is almost certainly due to a 503 retry, too
[14:13:10] <bblack>	 we're probably seeing an issue where the frontend is willing to retry a 503 twice, and the 503 itself takes very long to return at all, perhaps waiting through various maximum request timeouts through the stack.
[14:13:59] <bblack>	 our ttfb is fairly short, but bbt is long I think
[14:14:04] <bblack>	 (timeouts)
[14:15:05] <elukey>	 bblack: I am a bit ignorant about where the expiry mailbox queue backlog can be
[14:16:00] <elukey>	 I imagine that it as on the uslfo varnish frontends? (Maybe only some of them, that explains why I saw tons of failures only for few cp ulsfo hosts)
[14:16:04] <bblack>	 elukey: as you should be, it's a deep detail of varnish that nobody should ever have to know about except varnish developers.  But we've had issues with it under v4 for cache_upload, and we have mitigations in place too, but I'm just ocnfirming it's not that.
[14:16:28] <elukey>	 ah okok
[14:16:44] <elukey>	 sorry I misread the sentence about it
[14:16:46] <elukey>	 :)
[14:17:21] <bblack>	 I feel a philosophical rant about bad software coming on
[14:17:33] <bblack>	 but I'm trying to avoid it until it collects itself together a little more succinctly :P
[14:18:17] <elukey>	 ok now I re-read the whole thing and got your theory about 503s
[14:19:43] <bblack>	 yeah I'm assuming the double-requrl in the log is what we'd normally see when the frontend sees a 503 from the (varnish) backend
[14:19:55] <bblack>	 it's configured to retry those once and se if it 503s again, before giving a 503 to the user
[14:19:59] <bblack>	 s/se/see/
[14:20:20] <bblack>	 but that still usually happens fast
[14:20:55] <bblack>	 also, this particular request is uncacheable (the backend returns Cache-Control:no-cache), so every request walks the whole varnish stack and talks to swift
[14:21:34] <bblack>	 I think for <some unknown reason I wish I was able to debug more before the repro poofed>, it was taking a very long time for those 503s to bubble up the stack, hitting maximum timeouts all along the way.
[14:22:05] <elukey>	 cp4007 seems showing the same issue
[14:22:08] <elukey>	 if you want to check
[14:22:10] <bblack>	 still?
[14:22:27] <bblack>	 I had assumed when it poofed on 4006 it was gone everywhere.  that's telling heh
[14:22:40] <elukey>	 I just killed my curl after 5 mins
[14:22:49] <elukey>	 not sure if it is gone now
[14:24:36] <bblack>	 ok I'm not going to control-C this one
[14:24:50] <bblack>	 and I have a varnishlog on the frontend side at least, that will see it
[14:29:15] <bblack>	 ok 2m28s and got cp4007 frontend instance's client-side and backend-side logs
[14:31:57] <bblack>	 so on the client side:
[14:31:58] <bblack>	 -   ReqHeader      X-CDIS: miss
[14:31:58] <bblack>	 -   VCL_return     fetch
[14:31:58] <bblack>	 -   Link           bereq 161161619 fetch
[14:31:58] <bblack>	 -   Timestamp      Fetch: 1478269701.861253 148.827785 0.759859
[14:32:13] <bblack>	 148.8 being seconds it took to fetch
[14:32:31] <bblack>	 -   RespHeader     Backend-Timing: D=30574 t=1478269701765589
[14:32:59] <bblack>	 ^ that header comes straight from the mediawiki renderer, and claims it only spent 30ms creating the 404 response
[14:33:43] <bblack>	 -   RespHeader     X-Cache-Int: cp1099 miss, cp2017 miss, cp4007 miss
[14:33:50] <bblack>	 it went through a seemingly-normal chain of backends
[14:34:34] <bblack>	 everything else looks fairly normal
[14:35:08] <bblack>	 it just took 148 seconds for the next varnish (4007-backend) to give it a response which was a full-stack miss, which the true origin apache claims was generated in 30ms
[14:35:58] <elukey>	 so the Timestamp:Fetch in this case is related to frontend 4007 -> backend 4007 
[14:36:35] <bblack>	 yes (and implicitly any other hang beneath that)
[14:36:45] <bblack>	 going back over the client-side timestamp markers in a bit more detail:
[14:36:53] <bblack>	 -   Begin          req 159933463 rxreq
[14:36:53] <bblack>	 -   Timestamp      Start: 1478269553.033468 0.000000 0.000000
[14:36:53] <bblack>	 -   Timestamp      Req: 1478269553.033468 0.000000 0.000000
[14:37:12] <bblack>	 -   VCL_return     lookup
[14:37:13] <bblack>	 -   Timestamp      Waitinglist: 1478269701.101394 148.067926 148.067926
[14:37:15] <bblack>	 -   VCL_call       MISS
[14:37:22] <bblack>	 -   Link           bereq 161161619 fetch
[14:37:22] <bblack>	 -   Timestamp      Fetch: 1478269701.861253 148.827785 0.759859
[14:37:22] <bblack>	 -   RespProtocol   HTTP/1.1
[14:37:29] <bblack>	 there we go
[14:37:32] <bblack>	 it was not the fetch
[14:37:43] <bblack>	 it spent 148 seconds in the "waitinglist"
[14:37:43] <ema>	 so varnish<->varnish?
[14:37:47] <bblack>	 no
[14:37:56] <ema>	 oh waitinglist
[14:37:56] <bblack>	 inside the frontend
[14:37:59] <elukey>	 (just tried the request from cp2017, no delay)
[14:38:08] <bblack>	 I don't even remember what that is, but I've seen references to it before
[14:39:00] <ema>	 me too, while skimming through the code trying to find hints about $bug
[14:39:09] <elukey>	 ahhahah
[14:39:16] <bblack>	 oh waitinglist is coalesce
[14:39:27] <bblack>	 what hfp is supposed to solve
[14:39:35] <ema>	 ha
[14:39:57] <bblack>	 so in other earlier logging, I was also noting there's a crapload of requests to this specific URL we're debugging coming from facebook IPs with facebook-external-hit UA
[14:40:14] <bblack>	 that's likely why we're seeing this specifically in ulsfo: tons and tons of requests for this, which can cause waitinglist pileup
[14:40:34] <bblack>	 and facebook hits ulsfo, because silly valley and all
[14:40:59] <bblack>	 there's still the deeper question of why these requests coalesce at all and how we should avoid having waitinglist problems with them
[14:41:47] <bblack>	 but if a given frontend is getting tons of requests for the same no-cache URL and we're not finding a way to translate those into hit-for-pass behavior, they will waitlist as they all try to coalesce
[14:44:29] <bblack>	 on the backend side:
[14:44:31] <bblack>	 -   TTL            RFC 604800 10 -1 1478269702 1478269702 1478269701 0 0
[14:44:34] <bblack>	 -   VCL_call       BACKEND_RESPONSE
[14:44:37] <bblack>	 -   TTL            VCL 86400 10 0 1478269702
[14:44:39] <bblack>	 -   TTL            VCL 0 10 0 1478269702
[14:44:42] <bblack>	 -   TTL            VCL 0 3600 0 1478269702
[14:44:44] <bblack>	 -   VCL_return     deliver
[14:44:47] <bblack>	 -   Storage        malloc s0
[14:45:06] <bblack>	 so it's getting set to 0 TTL, which I guess is good considering it said Cache-Control: no-cache
[14:45:16] <bblack>	 but why isn't it getting set as hit-for-pass?
[14:47:53] <bblack>	 ok, it's a combination of ugly behaviors here
[14:47:57] <bblack>	 I think I have a grasp on it now
[14:48:17] <bblack>	 so the two notable things about this 404 response from the renderer->swift->varnish:
[14:48:32] <bblack>	 1) It has "Cache-Control: no-cache" and no other expiry/caching related headers
[14:50:52] <bblack>	 2) It has (again, when emitted by Swift) Content-Length: 465 and Content-Type: text/html; charset=utf-8
[14:51:51] <bblack>	 however, by the time it reaches this frontend where I have the log, the response from the varnish backend has:
[14:52:03] <bblack>	 -   BerespHeader   Content-Encoding: gzip
[14:52:16] <bblack>	 because varnish is gzipping for us, because do_gzip
[14:52:24] <bblack>	 -   BerespHeader   Content-Length: 282
[14:52:55] <bblack>	 hmmm nevermind that last part.  I thought at some stage content-length got unset on a response
[14:52:58] <bblack>	 but it doesn't
[14:53:05] <ema>	 that's another bug :)
[14:53:38] <bblack>	 ok so that eliminates some possible issues
[14:55:03] <bblack>	 so there's two things I'm looking at now:
[14:55:33] <bblack>	 1) The 4-hit-wonder code.  It creates 0-second uncacheables, until the backend hits more, which it never will in this case.
[14:56:18] <bblack>	 but that can't be it, because requires that the local disposition is "miss" rathern than "pass" (good in general), and only operates on response status == 200
[14:56:45] <bblack>	 2) The fact that upload seems to lack the code that text has for creating hfps from <= 0s TTLs in general
[14:57:27] <ema>	 2) seems particularly interesting
[14:58:04] <ema>	 meanwhile, 2 text ulsfo nodes upgraded so far, all bueno 
[14:58:15] <ema>	 codfw hitrates have also improved
[14:58:41] <bblack>	 yeah the extra traffic from ulsfo will make the codfw backends fill in their dataset quicker
[14:58:51] <ema>	 right
[14:59:03] <bblack>	 I was just double-checking, and on text the ordering/interaction between 4-hit-wonder and general hit-for-pass is correct
[14:59:32] <bblack>	 what we're missing on upload is the last bit of text's text_common_backend_response
[14:59:53] <bblack>	 but we only want the first sub-case, not the second that deals with cookies
[15:00:18] <bblack>	 (and as with text, it needs to happen ahead of 4-hit-wonder, and happen at all layers unlike 4-hit-wonder)
[15:01:39] <bblack>	 rewinding to why we've never had this general hfp support on upload:
[15:01:47] <bblack>	 in general, we don't get uncacheable responses there, at least not often
[15:01:53] <bblack>	 apparently 404s are an exception, though :P
[15:02:06] <bblack>	 and very-popular 404s with facebook spamming them probably even rarer
[15:02:22] <bblack>	 I think swift should probably stop doing cache-control: no-cache on the 404s, but that's neither here or there
[15:02:33] <bblack>	 some variation of the general hfp code should probably get promoted to common VCL
[15:05:48] <bblack>	 lol:
[15:05:50] <bblack>	 sub wm_common_backend_response {
[15:05:52] <bblack>	 ...
[15:06:01] <bblack>	         if (beresp.http.Cache-Control ~ "(private|no-cache|no-store)") {
[15:06:04] <bblack>	                 set beresp.ttl = 0s;
[15:06:07] <bblack>	                 /* This should be translated into hit_for_pass later */
[15:06:14] <bblack>	 ^ that translation only happens on cache_text :P
[15:06:27] <ema>	 nice
[15:12:12] <bblack>	 ema: https://gerrit.wikimedia.org/r/#/c/319869/
[15:12:34] <bblack>	 the tricky things I think I verified there is that, when everything's compiled together for a specific cluster (e.g. text and upload)...
[15:13:05] <bblack>	 that the 4-hit-wonder code executes after the general hfp code (otherwise 4-hit-wonder's intentional 0s uncacheables would turn into 10 minute hfps)
[15:13:38] <bblack>	 it does make me wonder about 4-hit-wonder having some edge-cases, though (separately from all of this)
[15:14:07] <bblack>	 the 4-hit-wonder-matching URLs are going to suffer from coalesce->waitinglist
[15:14:14] <bblack>	 which is fine in the "normal" case where they're actually-rare
[15:14:42] <bblack>	 it's probably awful if a URL suddenly becomes "hundreds of reqs/sec" popular out of nowhere, at least for the first few requests...
[15:14:59] <bblack>	 (or on cold cache)
[15:15:28] <bblack>	 although you'd think it would fix itself quickly and worst-case our response time is multiplied by 4-5 for that first spam of requests that happens during that window
[15:16:37] <bblack>	 I should probably pick different values for these two 601s cases, too (the whole reason I picked 601 instead of 600 was so it's easy to debug the origin of these HFPs in the VCL)
[15:17:25] <ema>	 bblack: I'd add a comment to the 4-hit-wonder code mentioning that it should execute after the hfp stuff (for our future selves)
[15:17:36] <bblack>	 yeah
[15:17:59] <bblack>	 we'll probably end up promoting that with a parameter to common code eventually anyways
[15:18:09] <bblack>	 (and then screw it up when we move it, without a comment)
[15:18:20] <ema>	 :)
[15:20:40] <bblack>	 amended
[15:22:03] <ema>	 looking
[15:22:43] <bblack>	 I'll make a ticket about the swift 404->no-cache issue
[15:22:52] <bblack>	 there's probably some reason for it we'll have to debate
[15:25:38] <ema>	 amended for a tiny detail (s/601/607/ in the comment)
[15:26:50] <wikibugs>	 10Traffic, 06Operations, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2771888 (10BBlack)
[15:30:22] <elukey>	 nice work, thanks a lot!
[15:32:14] <bblack>	 X-Cache: cp1099 pass, cp2005 pass, cp4015 pass, cp4006 pass
[15:32:21] <bblack>	 ^ what I get now on that 404, seems right-er :)
[15:32:40] <bblack>	 it's still blowing through all the layers, but shouldn't suffer from deranged waitinglist issues
[15:37:34] <elukey>	 hopefully this weekend I won't get data alerts, thanks!
[15:38:26] <bblack>	 well
[15:38:43] <bblack>	 there will probably always be edge cases like these that defy all reason
[15:39:02] <bblack>	 there's some deeper issue going on here with how we handle log entries that take a long time to complete
[15:39:47] <bblack>	 I don't know if that issue is as simpler as setting longer timeouts in vk (I doubt it), or about setting different timeouts in varnish for shmlog-related things, or just a general shmlog design issue.
[15:40:21] <elukey>	 the main problem that I saw is that I didn't get any VSL tags carrying "this is a VSL timeout or overflow"
[15:40:35] <bblack>	 this one we were looking at today, ultimately the log entry itself was fine.  it existed, and it completed in some sense.
[15:40:52] <bblack>	 it just took it over 2 minutes to do so.  in varnishlog output, nothing appears until that point.
[15:41:12] <bblack>	 I imagine for a raw shm client, something appears at the beginning, but the record isn't complete until those 2+ minutes expire
[15:41:12] <elukey>	 but there must be something that eventually ends up in the two ReqURLs
[15:41:41] <elukey>	 *something == some weird stack timeout or routing scenario
[15:41:44] <bblack>	 the 2x ReqURLs isn't always going to be the case.  I think that's when we hit a 503 from some timeout happening before the waitinglist was done, in a backend.
[15:42:31] <elukey>	 this is probably the explanation, but VSL should contain more data rather than (what seems) a truncated log
[15:42:34] <bblack>	 what's the actual problem with these requests from vk's perspective? that a shm transaction starts, but never completes in reasonable time?
[15:43:11] <bblack>	 I don't even understand this to that level: is the 2 minutes perceived by vk to happen between records for this log entry? or it all comes at once but just lacks expected tags?
[15:43:13] <elukey>	 nono only that I don't get half of the tags for a request, like Timestamp:Resp, content len, etc..
[15:43:19] <bblack>	 ok
[15:43:38] <elukey>	 so I end up with incomplete logs in hadoop
[15:43:44] <elukey>	 (and before in kafka)
[15:44:19] <elukey>	 vk is now set to wait 1500 seconds for each incomplete transaction and keep 10000 incomplete records before start flushing them
[15:44:25] <bblack>	 in the case I logged, I did get Timestamp:Resp and content length, etc, after 2+ minutes
[15:44:41] <bblack>	 I think the cases that tripped vk were a different subset
[15:45:16] <elukey>	 hopefully it is a subset of the problem solved today :)
[15:45:17] <bblack>	 again, probably the waitinglist hold was hit in the (ulsfo) backend varnish, and waited so long that the fe->be fetch timed out and generated a 503, then retried that again and got it again
[15:45:40] <bblack>	 I think what's causing your problem for vk isn't the waitinglist or the timeout
[15:45:54] <bblack>	 it's 503s that come from connection errors or connection/request timeouts, which will have incomplete log entries
[15:46:47] <elukey>	 theoretically even a 503 backend fetch should end up logging a Timestamp:Resp for the main request that originated it
[15:47:54] <wikibugs>	 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2772002 (10faidon)
[15:48:29] <bblack>	 you'd think
[15:49:06] <bblack>	 I think we're looking specifically at the case of a varnish frontend internally-generating a 503 because of connection timeout/error when talking to the varnish backend
[15:49:17] <bblack>	 (as opposed to receiving an actual 503 from it)
[15:49:28] <elukey>	 ahhh ok this new for me
[15:49:33] <elukey>	 *this is
[15:49:57] <elukey>	 maybe there is a way to repro it with VTC
[15:50:14] <bblack>	 maybe
[15:50:16] * elukey sees ema running away
[15:50:37] <bblack>	 or at least, repro it manually by having a test varnishd set up to talk to a "backend" which is just netcat giving no response, to hit a timeout
[15:51:00] <bblack>	 it may be a between-bytes timeout rather than ttfb though, so you might have to generate a partial response and then hang
[15:51:31] <bblack>	 (e.g. send basic response and some headers, but never finish the headers.  or start the content but don't send all the CL bytes expected)
[15:52:22] * elukey takes notes
[16:08:22] <ema>	 3/6 nodes upgraded so far, we can probably speed up a bit for the next ones
[16:08:46] <ema>	 I meant 4/6 !
[16:19:14] <bblack>	 yeah
[16:19:25] <bblack>	 I went only a few minutes between on codfw
[16:19:44] <bblack>	 maybe 10 for ulsfo?
[16:19:49] <bblack>	 (for these last few)
[16:21:27] <ema>	 yeah
[16:27:20] <wikibugs>	 10Traffic, 06Operations, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772092 (10fgiunchedi) AFAICS it is `thumb_handler.php` from MW generating CC: no-cache on 404s and then proxied back to varnish by swift's `rewrite.py`, e.g. on m...
[16:29:22] <ema>	 the way the frontend hitrate on a single host goes up on a cold cache is really crazy
[16:29:29] <ema>	 from 0% to 60% in just a few seconds
[16:30:06] <ema>	 then getting beyond that takes quite a while
[16:38:11] <wikibugs>	 10Traffic, 06Operations, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772108 (10BBlack) When I hit a renderer directly, I get: ``` bblack@cp1099:~$ curl "http://rendering.svc.eqiad.wmnet/wikipedia/commons/thumb/6/63/Taissa-Farmiga--...
[16:39:15] <bblack>	 godog: well as soon as I said that, I realized that can't be the right URL on the renderer
[16:39:24] <bblack>	 but for the record, what *does* swift hit for mw rendering?
[16:42:05] <godog>	 bblack: hehe yeah it is slightly odd for histerical raisins, the url I pasted in phab is what gets hit,  http://commons.wikimedia.org/w/thumb_handler.php/<...>
[16:42:47] <godog>	 so rewrite.py needs to rewrite (duh) the host for mw to pick up, plus thumb_handler.php in the url
[16:43:16] <bblack>	 but surely not actual commons.wm.o for the TCP connection, right?
[16:43:27] <bblack>	 ok
[16:43:29] <ema>	 bblack: cp3041's varnish-be and varnish-be-random are depooled, is that on purpose?
[16:43:42] <bblack>	 ema: no, likely another backend-restart fallout
[16:43:48] <ema>	 ok, fixing
[16:44:06] <godog>	 bblack: ah, no that'd be rendering.svc as you figured out
[16:45:11] <ema>	 all text-codfw upgraded to v4 and repooled
[16:45:15] <ema>	 heh
[16:45:17] <ema>	 text-ulsfo
[16:45:20] <bblack>	 godog: ok!
[16:47:53] <wikibugs>	 10Traffic, 06Operations, 10media-storage: Swift should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772143 (10BBlack) Ignore the above comment, the URL is wrong.  What @fgiunchedi pasted is right, you just have to connect to `rendering.svc.eqiad.wmnet` while usi...
[16:49:43] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772146 (10GWicke) > In the current examples, I think it's unfortunate that height-constraining isn't considered. Not as a feature that would be available...
[17:02:49] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772169 (10GWicke)
[17:09:27] <dberger>	 bblack: Hi (I'm back from traveling a lot). A few days ago in T144187 you asked for additional simulations for 100GB and 200GB caches. I just wanted to say, I'm working on it (just takes a while, I have old hardware).
[17:09:27] <stashbot>	 T144187: Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187
[17:10:41] <dberger>	 bblack: also, I've started to open-source the simulator code at https://github.com/dasebe/webcachesim so you can run stuff with your own params etc. (I just started doing this an hour ago, so the repo is lacking a few bits and pieces.)
[17:10:51] <wikibugs>	 10Traffic, 06Operations, 10media-storage: thumb_handler.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2772190 (10BBlack)
[17:11:07] <wikibugs>	 10Traffic, 06Operations, 10media-storage: thumb_handler.php should not set CC:no-cache on renderer 404 responses? - https://phabricator.wikimedia.org/T150022#2771888 (10BBlack) Title/desc fixed up to not implicate Swift :)
[17:13:36] <bblack>	 dberger: awesome :)
[17:16:06] <dberger>	 bblack: I was also wondering, are you all set with the current 4-hit-wonder code? I'm not sure I understood some of the edge cases mentioned above. On how many servers is this running atm?
[17:23:40] <bblack>	 dberger: yeah I think it's running ok for now.  It may not be ideal, but it's better-than-nothing.  It's running on all of our big caches (text and upload).
[17:24:31] <bblack>	 dberger: also, after I was commenting on the exp(-size/c) method and cutting out the corner cases, I realized it would be easier to just do that with numerical rounding of the result :)
[17:25:24] <bblack>	 dberger: we can run that function to decide cache-insertion probability, and then round out towards 0 or 1 at a few significant digits (say <.00001 or >.99999 or whatever) and get the same effect
[17:25:27] <dberger>	 bblack: do we know how much of a difference it makes for the hit ratio?
[17:26:02] <bblack>	 well, realistically every floating point representation eventually has to do the same rounding, we just probably want to round at fewer digits
[17:26:36] <bblack>	 so that there's not this lingering 0.00000001% chance we'll hit 6 months from now where it accidentally tries to insert a 4GB object into a small frontend cache and wreak havoc
[17:27:04] <dberger>	 bblack: yes, that makes sense
[17:27:39] <bblack>	 but if you play with c and the sizes, you can see it's easy to set reasonable probability limits that enforce that
[17:28:14] <dberger>	 bblack: would you be up for some kind of a/b testing, where some servers run no admission control (or the old config), some run the 4-hit-code, and some the exp-probabilistic thing?
[17:29:01] <bblack>	 dberger: yeah.  we'd have to rig up a custom graphing dashboard to see the results, too (right now it doesn't show down to the per-cache node level, only per-site aggregated)
[17:29:16] <bblack>	 e(-1024*1024/32768)
[17:29:16] <bblack>	 .00000000000001266416
[17:29:22] <dberger>	 bblack: right, and things are probably quite different from site to site
[17:29:56] <bblack>	 ^ so there if "c" is 32K, a 1MB object still has a tiny probability of insertion.  but if we cap things off at "if x < .000001; x = 0" or some such, we can kill those cases.
[17:31:03] <bblack>	 dberger: probably the biggest question is the degree to which these relatively-small variations in total frontend cache size affect the optimal parameters
[17:31:47] <bblack>	 dberger: given the large dataset, it may make very little difference betwen 64G and 200G in the tunables.  but if there is a significant effect on the optimal values, then we'll need to at least roughly account for it somehow
[17:31:51] <dberger>	 bblack: yes, I'm working on that -- my simulations are rather slow, though. I'm using the whole 2 week trace, tough, maybe I can speed it up..
[17:32:30] <dberger>	 bblack: ok, I'll try to answer that in a few days. From what I've seen so far, the optimal parameters varies very little between 64GB and 200GB caches
[17:32:33] <bblack>	 I figure with 3 data points at 64, 100, 200 for the optimal values, we can at least roughly do it in our VCL templating based on machine properties
[17:32:59] <bblack>	 (or decide that ~100G values are "close enough" for all cases and be done with it)
[17:33:26] <dberger>	 bblack: alright, I'll simulate those and also put in the rounding thing.
[17:34:09] <bblack>	 https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/templates/text-frontend.inc.vcl.erb#L222
[17:34:21] <bblack>	 ^ FWIW, that's our VCL for the 4-hit-wonder (identical in the upload cache case, too)
[17:35:21] <bblack>	 X-CDIS is telling us the local disposition of the request (it was a miss in this frontend), and X-Cache-Int is telling us whether the backends (with large file storage) hit on it and how many times
[17:37:27] <dberger>	 bblack: thanks! When we do the probabilistic policy, we probably need some c-code. Would you that in a little vmod or put it directly in vcl? (I have used a varnish 4.2 vmod in my own experiments, happy to share it.)
[17:38:18] <dberger>	 bblack: the c-code would be simple: parse the size into integer & draw random number based on it
[17:39:27] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2772387 (10BBlack) The AES flip doesn't seem to have had any notable influence on performance metrics, either.  Now after all the above merges, our server-side ciph...
[17:41:36] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2772402 (10BBlack) Another notable effect was that there's apparently a small subset of clients out there in the world (somewhere in the 0.1% -> 1% ballpark) who im...
[17:43:41] <bblack>	 dberger: probably inline C, as varnish has trouble loading new versions of vmod code without restarting the whole server process
[17:45:51] <bblack>	 (so it wreaks havoc on our overall management of instances if we keep depoying new independent vmod packages while VCL reloads are ongoing)
[17:47:30] <bblack>	 we'd need inline-C for the exp(-size/c) too, to take care of the floating math and then making the probability decision accurately
[17:48:05] <bblack>	 (have it return 0 or 1 to use in a simple VCL conditional for insertion vs ttl=0+uncacheable)
[17:49:34] <dberger>	 bblack: yes, I was thinking along the same line.
[17:59:18] <bblack>	 ema: cp1066 now, checking it
[18:00:19] <bblack>	 I guess even 5s isn't enough.  maybe sync + 10s will get us through until v4-complete?
[18:03:01] <bblack>	 (trying that)
[18:06:05] <wikibugs>	 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2772466 (10faidon)
[18:07:43] <wikibugs>	 10netops, 06Operations, 10ops-eqiad: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2724967 (10faidon) asw<->asw2 links are done, 4x10G on racks D 2 and D 7 (2 each).  The 4x10G links from cr1-eqiad:ae4 to asw-d-eqiad:ae1 have been moved over to asw2-...
[19:39:23] <ori>	 would something like https://gerrit.wikimedia.org/r/#/c/269086/ still be valuable?
[19:39:33] <ori>	 I haven't kept up with the state of the art in varnish instrumentation
[21:07:26] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772934 (10Tgr) >>! In T66214#2767815, @Gilles wrote: > These issues have been decoupled. Thumbor is currently set to be a drop-in replacement for image s...
[21:18:39] <wikibugs>	 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2772979 (10GWicke)