[11:09:26] <elukey>	 ema: you there?
[11:09:32] <elukey>	 (buongiorno)
[11:10:30] <elukey>	 the analytics data consistency alerts are firing for text, and with varnishlog I can see the same mailbox issue that you guys fixed yesterday for upload 
[11:14:51] <elukey>	 curl "http://localhost/w/index.php?title=Template%3AAdvancedSiteNotices%2Fajax&action=render" --header "Host: id.wikipedia.org" --header 'X-Forwarded-Proto: https' -i for example
[11:15:34] <elukey>	 hangs to me on cp4018.ulsfo.wmnet but not on cp2013
[11:21:47] <elukey>	 I am using elukey@cp4018:~$ sudo varnishlog -n frontend -T 1500 -L 10000 -g request -q 'not Timestamp:Resp and not HttpGarbage' to spot reqs logged without timestamp resp
[11:24:49] <elukey>	 the interesting thing on cp2013 for the above req is "Expires: Thu, 01 Jan 1970 00:00:00 GMT" in the response headers 
[12:00:12] <elukey>	 I tried to check on hadoop from 3 to 7 AM UTC time all the faulty requests logged, I put the results on stat1004:/home/elukey/nov_05_text_faulty_reqs.txt
[12:00:27] <elukey>	 very few reqs compared to the overall volume
[12:00:43] <elukey>	 probably our alarms are firing because they are too sensible
[12:05:14] <elukey>	 *sensitive
[12:06:08] <ema>	 elukey: hey
[12:07:20] <ema>	 elukey: which alerts are firing for text? 
[12:08:24] <elukey>	 o/
[12:08:45] <elukey>	 data consistency checks, the usual requests logged with dt:-
[12:08:52] <elukey>	 but they are very few
[12:08:59] <elukey>	 probably some weird corner case
[12:09:05] <ema>	 ok
[12:09:20] <ema>	 did you mean s/mailbox/waitinglist/ above?
[12:09:30] <elukey>	 yes sorry :)
[12:11:20] <ema>	 alright!
[12:11:34] * ema prepares some lunch
[12:11:38] <ema>	 see you in a bit :)
[12:19:32] <elukey>	 o/
[12:29:31] <elukey>	 double checked the numbers, /w/index.php?title=Template%3AAdvancedSiteNotices%2Fajax&action=render for id.w.o seems the most problematic one, affecting ~60k requests per hour per text ulsfo host maximum
[12:30:01] <elukey>	 will resume the investigation on Monday :)
[12:30:03] <elukey>	 thanks!
[13:29:46] <bblack>	 seems likely to be waitinglist-related, yes
[13:29:59] <bblack>	 fetching that directly from appservers, it's notable that it has:
[13:30:00] <bblack>	 < Cache-control: private, must-revalidate, max-age=0
[13:30:55] <bblack>	 I wonder if some other aspect of the req or resp is causing it to fail to generate a proper (minutes-long) hit-for-pass/uncacheable 
[13:52:59] <ema>	 this one has both CC and Expires, yesterday's had no Expires header IIRC?
[13:53:11] <ema>	 Expires: Thu, 01 Jan 1970 00:00:00 GMT
[14:02:49] <ema>	 cp4018.ulsfo.wmnet:~ema/v.log
[14:03:05] <ema>	 out of 14 requests, 14 seem to be waitinglist-related
[14:03:32] <ema>	 (filtering for ReqURL ~ "title=Template%3AAdvancedSiteNotices%2Fajax&action=render" and RespStatus eq 200)
[14:54:28] <ema>	 bblack: cp1068's backend failed to restart today at 10:44:46 :(
[14:54:41] <ema>	 looks like sync and sleep 10 are not enough 
[14:54:57] <ema>	 repooling, varnish.service is running fine now
[15:14:38] <bblack>	 yeah
[15:14:56] <bblack>	 I did some debugging on related things a long time ago, but I don't remember much of it
[15:15:15] <bblack>	 re: fallocate and filesystems.  basically the FS layer is "lazy" about accounting for free space.
[15:15:38] <bblack>	 I don't think there's a succint way around it, either, except "wait long enough"
[15:16:00] <bblack>	 we could also just make the cron conditional on varnish_version4, since this only seems to affect the persistent ones
[15:16:40] <bblack>	 (although, I don't know *why* it only affects persistent.  we're wiping and re-allocating in both cases.  I hope fallocate is actually-working in v4 after our switch from our patches to upstream fallocate support)
[15:18:22] <bblack>	 one way to be sure, if this problem weren't unimportant, would be to replace the rm -f /srv/sd*/varnish* with unmount -> mkfs -> mount :)
[15:19:06] <bblack>	 (it's really not an awful idea for "file" either, clear out any fs-level metadata cruft accumulation over time.  but probably not worth the hassle)
[15:26:47] <bblack>	 ema: re CC/expires, they'll both end up meaning the same thing, and I think CC takes precedence
[15:27:00] <bblack>	 (I think!)
[15:29:18] <bblack>	 ema: I made the restart cron conditional on varnish_version4, and salt-wiped the existing cron files on esams+eqiad cache_text
[15:47:27] <bblack>	 yeah
[15:51:11] <bblack>	 it is a waitinglist problem, the AdvancedSiteNotice thing
[15:51:12] <bblack>	 -   Timestamp      Req: 1478360966.267407 0.000000 0.000000
[15:51:21] <bblack>	 -   Timestamp      Waitinglist: 1478360983.854913 17.587506 17.587506
[15:51:37] <bblack>	 ^ that one happened to complete in ~17s, it's the one I got the log on.  others went over 2 minutes as before
[15:52:29] <bblack>	 -   TTL            RFC 0 10 -1 1478360984 1478360984 1478360984 0 0
[15:52:29] <bblack>	 -   VCL_call       BACKEND_RESPONSE
[15:52:30] <bblack>	 -   TTL            VCL 0 10 0 1478360984
[15:52:30] <bblack>	 -   TTL            VCL 0 3600 0 1478360984
[15:52:30] <bblack>	 -   TTL            VCL 601 3600 0 1478360984
[15:52:43] <bblack>	 and it is hitting our hit-for-pass 601s code, and does come back with x-cache as passpasspasspass
[15:53:14] <bblack>	 so it's a little bit of a mystery exactly why it works this way
[15:53:28] <bblack>	 some dangerous interaction with 4-hit-wonder I presume
[15:53:43] <bblack>	 but even then, you'd think that would only make it 4x worse than normal hfp'd objects
[15:53:54] <bblack>	 (on the rare occasions)
[15:55:17] <bblack>	 -   VCL_call       HASH
[15:55:18] <bblack>	 -   VCL_return     lookup
[15:55:18] <bblack>	 -   Timestamp      Waitinglist: 1478361271.842062 21.061273 21.061273
[15:55:18] <bblack>	 -   VCL_call       MISS
[15:55:18] <bblack>	 -   ReqHeader      X-CDIS: miss
[15:55:28] <bblack>	 ^ that's right afterwards, well within the 601s for the hit-for-pass object
[15:55:37] <bblack>	 yet still on initial reception it's a lookup->waitinglist->miss
[15:55:47] <bblack>	 (which then gets turned into hit-for-pass once we parse the beresp)
[16:00:08] <bblack>	 if it makes to the "TTL VCL 601" line, we know it made it to the hit-for-pass creation
[16:00:21] <bblack>	 the question is why the hit-for-pass then doesn't avoiding waitinglist for the next query
[16:03:57] <bblack>	 notably, I don't have this problem on backends, only frontends...
[16:05:09] <bblack>	 oh nevermind, we do see it in backends, just not as pronounced or common
[16:05:23] <bblack>	 it's easy to get ~2s response times though, which is way higher than it should be
[16:06:21] <bblack>	 but I guess whatever's happening to make the backend slow down on waitinglist, it gets multiplied with every layer (biggest in frontends), and the 4-hit-wonder exacerbates the effect further
[16:06:36] <bblack>	 I think
[16:16:54] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773843 (10Reedy)
[16:22:12] <bblack>	 ok I've experimentally shown https://gerrit.wikimedia.org/r/#/c/319997/1 (with manual VCL edits and testing) seems to make things better.  We'll see how it goes.  It can't hurt in any case.
[16:54:28] <bblack>	 bleh, reverted
[16:54:49] <bblack>	 it did seem to functionally help this one case, but the results still aren't right, and I don't comprehend why, so best left alone for now
[16:55:38] <bblack>	 the logs on the request-in-question still show it doing miss->pass every time (as in, repeat this request over and over through the same frontend, and you see miss -> create 601s hit-for-pass over and over.  it never "hits" its own hit-forpass)
[16:55:46] <bblack>	 (or passes as the case may be)
[16:56:35] <bblack>	 and in our overall text-cluster stats, adding that new conditional (to only create new hfp on "miss") converted a sizeable portion of "pass" traffic into "miss", something like 5%
[16:56:56] <bblack>	 which I can't explain either.  I'm not sure how we would arrive there in states other than miss or existing-pass
[16:57:38] <bblack>	 it's maybe possible that the X-CDIS -related code needs updates for v4 in general, the way it hooks into vcl_(pass|miss|hit), etc
[16:59:25] <bblack>	 another oddity: the frontend's bereq logs for this show:
[16:59:49] <bblack>	 -   VCL_call       BACKEND_RESPONSE
[16:59:49] <bblack>	 -   TTL            VCL 0 10 0 1478365078
[16:59:49] <bblack>	 -   TTL            VCL 0 3600 0 1478365078
[16:59:49] <bblack>	 -   TTL            VCL 601 3600 0 1478365078
[16:59:49] <bblack>	 -   BerespHeader   X-CDIS: pass
[16:59:52] <bblack>	 -   VCL_return     deliver
[16:59:54] <bblack>	 -   Storage        malloc Transient
[17:00:19] <bblack>	 why is it creating the hit-for-pass in transient storage? the default (which we don't change) is documented as 10s (ttl+grace+keep) cutoff for transient
[17:00:45] <bblack>	 this should have been 4201 seconds for that measure, and thus gone into non-transient malloc
[17:01:29] <bblack>	 maybe it's making the transient decision too-early for our later VCL mods of the ttls in backend_response?
[17:02:26] <bblack>	 another possible explanation is that a "pass" hitting on an existing hit-for-pass (which is what this request should've been doing anyways) may always use transient storage since its response is for this request only
[17:02:43] <bblack>	 but if so, why do we see:
[17:02:44] <bblack>	 -   BereqHeader    X-CDIS: miss
[17:03:06] <bblack>	 and on the client-facing side:
[17:03:07] <bblack>	 -   VCL_return     lookup
[17:03:07] <bblack>	 -   VCL_call       MISS
[17:03:07] <bblack>	 -   ReqHeader      X-CDIS: miss
[17:03:07] <bblack>	 -   VCL_return     fetch
[17:03:42] <bblack>	 http://book.varnish-software.com/4.0/chapters/VCL_Basics.html <- flow graphs here
[17:04:14] <bblack>	 hit-for-pass results in recv/lookup should call vcl_pass and not vcl_miss.  but we're setting X-CDIS=miss in vcl_miss and XCDIS=pass in vcl_pass
[17:04:33] <bblack>	 maybe the docs are out of date and hit-for-pass comes in via vcl_miss instead now?
[17:26:50] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773987 (10bd808) What about using https://blueprints.launchpad.net/keystone/+spec/delegated-auth-via-oauth ?
[19:42:25] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773828 (10AlexMonk-WMF) I'm pretty sure the Nova half of #1 is unnecessary, instances can already hit the nova API (it runs on labnet), the...
[19:45:11] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774119 (10AlexMonk-WMF) Actually, I'll go further: We should delete any user other than a whitelisted account that managed to get a success...
[20:32:42] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774143 (10Andrew) >>! In T150092#2773987, @bd808 wrote: > What about using https://blueprints.launchpad.net/keystone/+spec/delegated-auth-v...
[20:33:29] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774144 (10Andrew) >>! In T150092#2774119, @AlexMonk-WMF wrote: >>>! In T150092#2774103, @AlexMonk-WMF wrote: >> We should probably disallow...
[21:21:50] <wikibugs>	 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774171 (10AlexMonk-WMF) >>! In T150092#2774144, @Andrew wrote: >>>! In T150092#2774119, @AlexMonk-WMF wrote: >>>>! In T150092#2774103, @Ale...