[11:09:26] ema: you there? [11:09:32] (buongiorno) [11:10:30] the analytics data consistency alerts are firing for text, and with varnishlog I can see the same mailbox issue that you guys fixed yesterday for upload [11:14:51] curl "http://localhost/w/index.php?title=Template%3AAdvancedSiteNotices%2Fajax&action=render" --header "Host: id.wikipedia.org" --header 'X-Forwarded-Proto: https' -i for example [11:15:34] hangs to me on cp4018.ulsfo.wmnet but not on cp2013 [11:21:47] I am using elukey@cp4018:~$ sudo varnishlog -n frontend -T 1500 -L 10000 -g request -q 'not Timestamp:Resp and not HttpGarbage' to spot reqs logged without timestamp resp [11:24:49] the interesting thing on cp2013 for the above req is "Expires: Thu, 01 Jan 1970 00:00:00 GMT" in the response headers [12:00:12] I tried to check on hadoop from 3 to 7 AM UTC time all the faulty requests logged, I put the results on stat1004:/home/elukey/nov_05_text_faulty_reqs.txt [12:00:27] very few reqs compared to the overall volume [12:00:43] probably our alarms are firing because they are too sensible [12:05:14] *sensitive [12:06:08] elukey: hey [12:07:20] elukey: which alerts are firing for text? [12:08:24] o/ [12:08:45] data consistency checks, the usual requests logged with dt:- [12:08:52] but they are very few [12:08:59] probably some weird corner case [12:09:05] ok [12:09:20] did you mean s/mailbox/waitinglist/ above? [12:09:30] yes sorry :) [12:11:20] alright! [12:11:34] * ema prepares some lunch [12:11:38] see you in a bit :) [12:19:32] o/ [12:29:31] double checked the numbers, /w/index.php?title=Template%3AAdvancedSiteNotices%2Fajax&action=render for id.w.o seems the most problematic one, affecting ~60k requests per hour per text ulsfo host maximum [12:30:01] will resume the investigation on Monday :) [12:30:03] thanks! [13:29:46] seems likely to be waitinglist-related, yes [13:29:59] fetching that directly from appservers, it's notable that it has: [13:30:00] < Cache-control: private, must-revalidate, max-age=0 [13:30:55] I wonder if some other aspect of the req or resp is causing it to fail to generate a proper (minutes-long) hit-for-pass/uncacheable [13:52:59] this one has both CC and Expires, yesterday's had no Expires header IIRC? [13:53:11] Expires: Thu, 01 Jan 1970 00:00:00 GMT [14:02:49] cp4018.ulsfo.wmnet:~ema/v.log [14:03:05] out of 14 requests, 14 seem to be waitinglist-related [14:03:32] (filtering for ReqURL ~ "title=Template%3AAdvancedSiteNotices%2Fajax&action=render" and RespStatus eq 200) [14:54:28] bblack: cp1068's backend failed to restart today at 10:44:46 :( [14:54:41] looks like sync and sleep 10 are not enough [14:54:57] repooling, varnish.service is running fine now [15:14:38] yeah [15:14:56] I did some debugging on related things a long time ago, but I don't remember much of it [15:15:15] re: fallocate and filesystems. basically the FS layer is "lazy" about accounting for free space. [15:15:38] I don't think there's a succint way around it, either, except "wait long enough" [15:16:00] we could also just make the cron conditional on varnish_version4, since this only seems to affect the persistent ones [15:16:40] (although, I don't know *why* it only affects persistent. we're wiping and re-allocating in both cases. I hope fallocate is actually-working in v4 after our switch from our patches to upstream fallocate support) [15:18:22] one way to be sure, if this problem weren't unimportant, would be to replace the rm -f /srv/sd*/varnish* with unmount -> mkfs -> mount :) [15:19:06] (it's really not an awful idea for "file" either, clear out any fs-level metadata cruft accumulation over time. but probably not worth the hassle) [15:26:47] ema: re CC/expires, they'll both end up meaning the same thing, and I think CC takes precedence [15:27:00] (I think!) [15:29:18] ema: I made the restart cron conditional on varnish_version4, and salt-wiped the existing cron files on esams+eqiad cache_text [15:47:27] yeah [15:51:11] it is a waitinglist problem, the AdvancedSiteNotice thing [15:51:12] - Timestamp Req: 1478360966.267407 0.000000 0.000000 [15:51:21] - Timestamp Waitinglist: 1478360983.854913 17.587506 17.587506 [15:51:37] ^ that one happened to complete in ~17s, it's the one I got the log on. others went over 2 minutes as before [15:52:29] - TTL RFC 0 10 -1 1478360984 1478360984 1478360984 0 0 [15:52:29] - VCL_call BACKEND_RESPONSE [15:52:30] - TTL VCL 0 10 0 1478360984 [15:52:30] - TTL VCL 0 3600 0 1478360984 [15:52:30] - TTL VCL 601 3600 0 1478360984 [15:52:43] and it is hitting our hit-for-pass 601s code, and does come back with x-cache as passpasspasspass [15:53:14] so it's a little bit of a mystery exactly why it works this way [15:53:28] some dangerous interaction with 4-hit-wonder I presume [15:53:43] but even then, you'd think that would only make it 4x worse than normal hfp'd objects [15:53:54] (on the rare occasions) [15:55:17] - VCL_call HASH [15:55:18] - VCL_return lookup [15:55:18] - Timestamp Waitinglist: 1478361271.842062 21.061273 21.061273 [15:55:18] - VCL_call MISS [15:55:18] - ReqHeader X-CDIS: miss [15:55:28] ^ that's right afterwards, well within the 601s for the hit-for-pass object [15:55:37] yet still on initial reception it's a lookup->waitinglist->miss [15:55:47] (which then gets turned into hit-for-pass once we parse the beresp) [16:00:08] if it makes to the "TTL VCL 601" line, we know it made it to the hit-for-pass creation [16:00:21] the question is why the hit-for-pass then doesn't avoiding waitinglist for the next query [16:03:57] notably, I don't have this problem on backends, only frontends... [16:05:09] oh nevermind, we do see it in backends, just not as pronounced or common [16:05:23] it's easy to get ~2s response times though, which is way higher than it should be [16:06:21] but I guess whatever's happening to make the backend slow down on waitinglist, it gets multiplied with every layer (biggest in frontends), and the 4-hit-wonder exacerbates the effect further [16:06:36] I think [16:16:54] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773843 (10Reedy) [16:22:12] ok I've experimentally shown https://gerrit.wikimedia.org/r/#/c/319997/1 (with manual VCL edits and testing) seems to make things better. We'll see how it goes. It can't hurt in any case. [16:54:28] bleh, reverted [16:54:49] it did seem to functionally help this one case, but the results still aren't right, and I don't comprehend why, so best left alone for now [16:55:38] the logs on the request-in-question still show it doing miss->pass every time (as in, repeat this request over and over through the same frontend, and you see miss -> create 601s hit-for-pass over and over. it never "hits" its own hit-forpass) [16:55:46] (or passes as the case may be) [16:56:35] and in our overall text-cluster stats, adding that new conditional (to only create new hfp on "miss") converted a sizeable portion of "pass" traffic into "miss", something like 5% [16:56:56] which I can't explain either. I'm not sure how we would arrive there in states other than miss or existing-pass [16:57:38] it's maybe possible that the X-CDIS -related code needs updates for v4 in general, the way it hooks into vcl_(pass|miss|hit), etc [16:59:25] another oddity: the frontend's bereq logs for this show: [16:59:49] - VCL_call BACKEND_RESPONSE [16:59:49] - TTL VCL 0 10 0 1478365078 [16:59:49] - TTL VCL 0 3600 0 1478365078 [16:59:49] - TTL VCL 601 3600 0 1478365078 [16:59:49] - BerespHeader X-CDIS: pass [16:59:52] - VCL_return deliver [16:59:54] - Storage malloc Transient [17:00:19] why is it creating the hit-for-pass in transient storage? the default (which we don't change) is documented as 10s (ttl+grace+keep) cutoff for transient [17:00:45] this should have been 4201 seconds for that measure, and thus gone into non-transient malloc [17:01:29] maybe it's making the transient decision too-early for our later VCL mods of the ttls in backend_response? [17:02:26] another possible explanation is that a "pass" hitting on an existing hit-for-pass (which is what this request should've been doing anyways) may always use transient storage since its response is for this request only [17:02:43] but if so, why do we see: [17:02:44] - BereqHeader X-CDIS: miss [17:03:06] and on the client-facing side: [17:03:07] - VCL_return lookup [17:03:07] - VCL_call MISS [17:03:07] - ReqHeader X-CDIS: miss [17:03:07] - VCL_return fetch [17:03:42] http://book.varnish-software.com/4.0/chapters/VCL_Basics.html <- flow graphs here [17:04:14] hit-for-pass results in recv/lookup should call vcl_pass and not vcl_miss. but we're setting X-CDIS=miss in vcl_miss and XCDIS=pass in vcl_pass [17:04:33] maybe the docs are out of date and hit-for-pass comes in via vcl_miss instead now? [17:26:50] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773987 (10bd808) What about using https://blueprints.launchpad.net/keystone/+spec/delegated-auth-via-oauth ? [19:42:25] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773828 (10AlexMonk-WMF) I'm pretty sure the Nova half of #1 is unnecessary, instances can already hit the nova API (it runs on labnet), the... [19:45:11] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774119 (10AlexMonk-WMF) Actually, I'll go further: We should delete any user other than a whitelisted account that managed to get a success... [20:32:42] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774143 (10Andrew) >>! In T150092#2773987, @bd808 wrote: > What about using https://blueprints.launchpad.net/keystone/+spec/delegated-auth-v... [20:33:29] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774144 (10Andrew) >>! In T150092#2774119, @AlexMonk-WMF wrote: >>>! In T150092#2774103, @AlexMonk-WMF wrote: >> We should probably disallow... [21:21:50] 10netops, 06Labs, 10Labs-Infrastructure, 06Operations, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774171 (10AlexMonk-WMF) >>! In T150092#2774144, @Andrew wrote: >>>! In T150092#2774119, @AlexMonk-WMF wrote: >>>>! In T150092#2774103, @Ale...