[09:58:41] <wikibugs>	 10Traffic, 10Continuous-Integration-Config, 10Operations, 10Tracking: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754081 (10hashar)
[10:00:47] <wikibugs>	 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754081 (10hashar)
[12:33:11] <wikibugs>	 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3754469 (10Chicocvenancio) The problem is solved. I eventually moved it to [a tool](https://tools.wmflabs.org/merge2pdf/)...
[12:58:31] <wikibugs>	 10Traffic, 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3754514 (10ema) p:05Triage>03Normal
[12:59:47] <wikibugs>	 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754553 (10ema) p:05Triage>03Normal
[13:08:44] <wikibugs>	 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754581 (10ema)
[13:13:05] <wikibugs>	 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754602 (10ema) I've updated the task description with comments about all repos. They're all debian packages with the except...
[13:20:01] <hashar>	 ema: I guess I can add the  debian-glue job and see what happens? :D
[13:20:14] <hashar>	 though it would end up compiling the libvmod using whatever is in apt.wm.o :/
[13:23:31] <ema>	 hashar: wouldn't it use the contents of the git repos? What do you mean with 'using whatever is in apt.wm.o'?
[13:24:32] <hashar>	 ema: if I take eg operations/software/varnish/libvmod-vslp  , it most probably depends on Varnish somehow.  pbuilder would end up fullfilling the dependency by downloading varnish from apt.wm.o
[13:24:35] <hashar>	 though maybe
[13:25:03] <hashar>	 we can have a job that  compiles varnish from whatever latest is in git  and then compile the libvmod against that
[13:26:42] <ema>	 mmh no, that should be fine. When we build the VMODs on the package builder host, build-dependencies are resolved installing packages from apt.w.o
[13:28:01] <hashar>	 so libvmod patches would fail until the varnish package is updated on apt.wm.o
[13:28:12] <hashar>	 sounds good enough for now. I will add the few jobs
[13:43:24] <ema>	 re: hfp vs hfm, after some more thinking I believe that a valid approach to choose which one to use would be: if the response can become cacheable in the future (eg. because it will stop setting cookies) and it is OK to disable conditional requests, then HFM should be used. Otherwise HFP.
[13:44:50] <ema>	 the reason being that HFM has the advantage of allowing to cache cacheable responses within the hfm object's TTL, with the drawback of turning all conditional requests in non-conditional ones  
[13:49:03] <ema>	 the main rationale for hfm is avoiding people to shoot themselves in the foot by creating a hfp object with a very long ttl for particularly popular requests
[13:51:06] <ema>	 (hfp objects can't be purged/banned, so they just stick around for their lifetime no matter what)
[13:53:20] <ema>	 but, a few observations:
[13:53:54] <ema>	 1) in our scenario I don't think it's very likely to happen that we start creating very-long living hfp objects, unless we screw up a vcl change?
[13:54:23] <ema>	 2) disabling conditional requests is a pretty hardcore thing to do, especially on cache_upload
[13:55:45] <ema>	 3) we do make relatively heavy use of conditional requests on text too given our architecture of layered varnishes and grace/keep  
[14:01:49] <bblack>	 the disabling of conditionals on a miss is probably general-case desirable, though
[14:02:18] <bblack>	 that's what happens on all normal misses, too
[14:02:28] <bblack>	 (because otherwise, varnish doesn't get content to cache for future requests)
[14:04:07] <wikibugs>	 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754739 (10hashar) If `operations/software/varnish/libvmod-header` is obsolete and you are never going to change it later:...
[14:05:57] <ema>	 well for normal misses it is indeed desirable, I'm not sure about hfp/m though, in the sense that there can be popular hfp/m objects for responses that keep on not being cacheable 
[14:10:03] <bblack>	 right
[14:11:26] <bblack>	 the only case I can think of where something is potentially-cacheable and potentially-popular but we'd choose not to cache it would be the frontend cache-hitrate filters, though.
[14:11:51] <bblack>	 e.g. the current absolute size cutoff, or the exp() method
[14:12:13] <bblack>	 for the abs size cutoff hfp is clearly the answer
[14:13:06] <bblack>	 for exp() it's trickier.  in theory if it's popular enough it will become cached eventually.  but for sufficiently-large objects, eventually could be a rather large number of requests.
[14:14:53] <bblack>	 and for those cases, the backend storage can/will cache
[14:15:26] <bblack>	 but... yeah it would be annoying if the FE was converting potential client-conditional -> backend-instance-304 into a 200 and full response back out to the user
[14:19:39] <ema>	 https://github.com/varnishcache/varnish-cache/pull/2135#issuecomment-343921170
[14:27:31] <bblack>	 just under a year! :)
[14:28:11] <bblack>	 anyways back on the hfm issues...
[14:29:26] <bblack>	 the basic reason to use a non-zero TTL HFM (which wasn't possible before HFM existed) is it gives you hash_ignore_busy-like behavior vs using 0-second uncacheables.
[14:30:17] <bblack>	 but otherwise... all the stuff about conditional requests is the same with our current ttl=0+uncacheable "single miss" situations
[14:30:27] <bblack>	 (they just have the additional caveat that they stall parallel clients, too)
[14:32:06] <bblack>	 so, the safe thing to do (maybe not the most-optimal, but at least the most-predicatbly-unbreaky) would be to leave all HFP as they are, and only look at our uncacheable+ttl=0 cases (miss-once) and decide whether they should get TTLs (HFM for N seconds to avoid stalling) which is probably almost always desirable in those cases for some reasonable TTL like 10-15 mins.
[14:32:33] <bblack>	 ["leave all HFP as they are" meaning leave them as HFPs, which means changing their syntax under v5]
[14:34:44] <ema>	 right :)
[14:35:10] <ema>	 yes, agreed
[14:37:59] <bblack>	 re: actual HFM TTL values: An infinite TTL maximizes the benefits in terms of stall-avoidance.  And for objects that aren't low-hit-wonders or way-too-big (etc), a long-TTL HFM would eventually convert to a cache object.
[14:38:49] <bblack>	 the constraint in the other direction is that over-long TTLs on HFM objects which are rare (e.g. low-hit-wonders) would accumulate lots of pointless HFM objects in cache storage and data structures, etc.
[14:39:12] <bblack>	 I figure the desirable middle-ground here is "long enough to avoid most stalling complications for most clients"
[14:40:28] <bblack>	 ... which probably means "considerably higher than the usual transfer time of objects to clients"
[14:41:56] <bblack>	 hmmm no, transfer time to client mattered in v3
[14:42:26] <bblack>	 in v4 when N clients are stalling on the same missed object, the transfer from the backend->storage goes full-speed, even if the first client was a slow one.
[14:42:44] <bblack>	 (so the second client can still complete the request before the first one)
[14:43:20] <bblack>	 so the desirable middle-ground here is "considerably higher than the usual transfer time of the object from the backend to varnishd"
[14:44:05] <bblack>	 and lower than "some value which would cause way too many HFMs to accumulate in storage from cold requests that don't really matter"
[14:46:50] <bblack>	 even if an individual fetch from ms-fe ran at ~ 100Mbps (not sure what our parallelism is here vs iface bw), a 10GB file would transfer in 819s if I did my math right.
[14:54:40] <bblack>	 we should really try to get some histogram / pNN type of data on response transfer times for cache_upload.  or maybe just inferred transfer-rate.
[14:54:51] <bblack>	 most are fast, but in looking manually just now there are definitely outliers...
[14:55:02] <ema>	 gotta go out for a bit, bbl
[14:55:21] <bblack>	 -   BerespHeader   Content-Length: 32
[14:55:22] <bblack>	 -   Timestamp      BerespBody: 1510584762.615053 60.466561 0.000130
[14:55:40] <bblack>	 ^ I assume those super-short CLs are slow 404s and/or just plain timeouts
[14:55:47] <bblack>	 *   << BeReq    >> 415863569 
[14:55:47] <bblack>	 -   BerespHeader   Content-Length: 27599
[14:55:47] <bblack>	 -   Timestamp      BerespBody: 1510584758.507788 1.377139 0.000709
[14:56:04] <bblack>	 ^ but that looks different.  It looks like 1377ms to transfer 27KB :P
[14:58:05] <bblack>	 anyways, I took a few samples of reasonable-looking transfers of larger-ish objects from ms-fe->varnish in eqiad and the avg speed was closer to 25Mbps
[14:58:56] <bblack>	 on the other other than, 10GB+ files are probably rare-to-nonexistent, and we could/should probably put a hard cap on the exp() behavior anyways just to hfp the edge cases away.
[15:00:04] <bblack>	 at 25Mbps, 1GB is ~327s.
[15:01:22] <bblack>	 ema: so maybe even in the exp case, we put an HFP size filter in front of it to get rid of crazy cases that will always have cacheability chances with super-huge numbers of zeros on the frontend, and use HFM times around 600s?
[15:02:19] <bblack>	 if(size>1GB) { hfp } else { exp_with_hfm_ttl=600s }
[15:14:28] <bblack>	 running some numbers on the exp() calculations: if we had a hypothetical future frontend node with 1024GB of malloc storage, and our current cache_upload statistics feeding the parameters
[15:14:49] <bblack>	 a 6MB file is already at admission probability of ~= 4.784302618e-10
[15:20:13] <bblack>	 running with that kind of calculation a little further:
[15:20:55] <bblack>	 if the file size is 10MB, and it's requests (from a given frontend) at a rate of 10K/sec for a full 24H, the probability over that whole day of the 10MB file entering cache storage is ~= 2.528588107e-7
[15:22:04] <bblack>	 and the weeklong average rate of reqs into an upload frontend today is somewhere closer to 1K/sec overall
[15:22:37] <bblack>	 so even 10MB is a very fair cutoff to evade the whole exp() calculation at on the frontends
[15:23:53] <bblack>	 if(size>10MB) { hfp } else { exp_with_hfm_ttl=67s } ?
[15:24:25] <bblack>	 should provide ample margins without hfm bloat in storage, and not affect the practical results of the exp admission policy in any way, even on a hypothetical 1TB mem cache.
[15:27:32] <bblack>	 ran the same calcs with a 256K file size just to sanity-check the methodology vs expectations at our present hard size cutoff
[15:29:06] <bblack>	 admission probability per-request for a 256K file with a 100MB fe cache size, and if it was requested even 1/sec over a whole day, the whole-day odds of admission are .9999999...
[15:30:12] <bblack>	 (but that's ~0.07% per request)
[15:30:51] <bblack>	 s/100MB/100GB/ above heh
[15:31:03] <bblack>	 at 1TB it's ~40% per request
[15:31:08] <bblack>	 anyways
[15:52:25] <bblack>	 ema: https://gerrit.wikimedia.org/r/#/c/391025/ seems like a bugfix, I wonder if it was causing any significant impact
[16:12:47] <wikibugs>	 10Traffic, 10Operations, 10monitoring, 10Prometheus-metrics-monitoring: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256#3755414 (10fgiunchedi)
[16:40:25] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755514 (10BBlack) What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis.  Am I missing some way to use the deletion l...
[16:49:03] <wikibugs>	 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3755551 (10BBlack) p:05Triage>03Normal > Unfortunately, there is a known issue with this version of nginx where...
[16:53:04] <ema>	 uhm I forgot why we need the `|| beresp.http.Content-Length ~ "^[0-9]{9}"` part there
[16:53:17] <ema>	 but yeah it does look like a bugfix indeed!
[17:15:56] <bblack>	 I think that code pattern is in case the CL header has a huge number in it that doesn't parse correctly as an integer (in which case we assume the size is very large rather than very small)
[17:16:03] <bblack>	 e.g. more digits than can be parsed
[17:16:52] <ema>	 have we actually observed such a case in real traffic or is it more of a defensive measure?
[17:21:14] <bblack>	 defensive I think, against buggy CL header values, and/or varnish stupidly having a 32-bit-ish limit on integer conversions at some point in the past or future
[17:21:26] <ema>	 alright
[17:21:30] <bblack>	 (cache_upload does have some rare files past the 2/4G-ish marks)
[17:38:34] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755746 (10JEumerus) The user-side of deletion logs does not inherently have a search function, unless the specific actions are marked with a tag.
[17:50:00] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755802 (10BBlack) Err, we should really move the sub-conversation back to T171881 .  This ticket is more about general reliability problems and/or race-conditions, not about the WP0 abuse s...
[22:23:00] <wikibugs>	 10Traffic, 10Operations: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3756812 (10Krinkle)
[22:26:34] <wikibugs>	 10Traffic, 10Operations: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3756828 (10Krinkle)
[23:16:43] <wikibugs>	 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3757008 (10RobH) Supposedly this was delivered to ulsfo today, but I didn't get any email from UL support.  Dropped them an email and will update.  If it is onsite, I'll plan to go to ulsfo tomorrow (Tuesda...