[09:58:41] 10Traffic, 10Continuous-Integration-Config, 10Operations, 10Tracking: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754081 (10hashar) [10:00:47] 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754081 (10hashar) [12:33:11] 10Traffic, 10Operations, 10PAWS, 10Pywikibot-Commons, and 2 others: Server error (500) while trying to download files from Commons from PAWS - https://phabricator.wikimedia.org/T178567#3754469 (10Chicocvenancio) The problem is solved. I eventually moved it to [a tool](https://tools.wmflabs.org/merge2pdf/)... [12:58:31] 10Traffic, 10netops, 10Cloud-VPS, 10Operations, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179#3754514 (10ema) p:05Triage>03Normal [12:59:47] 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754553 (10ema) p:05Triage>03Normal [13:08:44] 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754581 (10ema) [13:13:05] 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754602 (10ema) I've updated the task description with comments about all repos. They're all debian packages with the except... [13:20:01] ema: I guess I can add the debian-glue job and see what happens? :D [13:20:14] though it would end up compiling the libvmod using whatever is in apt.wm.o :/ [13:23:31] hashar: wouldn't it use the contents of the git repos? What do you mean with 'using whatever is in apt.wm.o'? [13:24:32] ema: if I take eg operations/software/varnish/libvmod-vslp , it most probably depends on Varnish somehow. pbuilder would end up fullfilling the dependency by downloading varnish from apt.wm.o [13:24:35] though maybe [13:25:03] we can have a job that compiles varnish from whatever latest is in git and then compile the libvmod against that [13:26:42] mmh no, that should be fine. When we build the VMODs on the package builder host, build-dependencies are resolved installing packages from apt.w.o [13:28:01] so libvmod patches would fail until the varnish package is updated on apt.wm.o [13:28:12] sounds good enough for now. I will add the few jobs [13:43:24] re: hfp vs hfm, after some more thinking I believe that a valid approach to choose which one to use would be: if the response can become cacheable in the future (eg. because it will stop setting cookies) and it is OK to disable conditional requests, then HFM should be used. Otherwise HFP. [13:44:50] the reason being that HFM has the advantage of allowing to cache cacheable responses within the hfm object's TTL, with the drawback of turning all conditional requests in non-conditional ones [13:49:03] the main rationale for hfm is avoiding people to shoot themselves in the foot by creating a hfp object with a very long ttl for particularly popular requests [13:51:06] (hfp objects can't be purged/banned, so they just stick around for their lifetime no matter what) [13:53:20] but, a few observations: [13:53:54] 1) in our scenario I don't think it's very likely to happen that we start creating very-long living hfp objects, unless we screw up a vcl change? [13:54:23] 2) disabling conditional requests is a pretty hardcore thing to do, especially on cache_upload [13:55:45] 3) we do make relatively heavy use of conditional requests on text too given our architecture of layered varnishes and grace/keep [14:01:49] the disabling of conditionals on a miss is probably general-case desirable, though [14:02:18] that's what happens on all normal misses, too [14:02:28] (because otherwise, varnish doesn't get content to cache for future requests) [14:04:07] 10Traffic, 10Continuous-Integration-Config, 10Operations: Add CI to all operations/software/varnish/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180329#3754739 (10hashar) If `operations/software/varnish/libvmod-header` is obsolete and you are never going to change it later:... [14:05:57] well for normal misses it is indeed desirable, I'm not sure about hfp/m though, in the sense that there can be popular hfp/m objects for responses that keep on not being cacheable [14:10:03] right [14:11:26] the only case I can think of where something is potentially-cacheable and potentially-popular but we'd choose not to cache it would be the frontend cache-hitrate filters, though. [14:11:51] e.g. the current absolute size cutoff, or the exp() method [14:12:13] for the abs size cutoff hfp is clearly the answer [14:13:06] for exp() it's trickier. in theory if it's popular enough it will become cached eventually. but for sufficiently-large objects, eventually could be a rather large number of requests. [14:14:53] and for those cases, the backend storage can/will cache [14:15:26] but... yeah it would be annoying if the FE was converting potential client-conditional -> backend-instance-304 into a 200 and full response back out to the user [14:19:39] https://github.com/varnishcache/varnish-cache/pull/2135#issuecomment-343921170 [14:27:31] just under a year! :) [14:28:11] anyways back on the hfm issues... [14:29:26] the basic reason to use a non-zero TTL HFM (which wasn't possible before HFM existed) is it gives you hash_ignore_busy-like behavior vs using 0-second uncacheables. [14:30:17] but otherwise... all the stuff about conditional requests is the same with our current ttl=0+uncacheable "single miss" situations [14:30:27] (they just have the additional caveat that they stall parallel clients, too) [14:32:06] so, the safe thing to do (maybe not the most-optimal, but at least the most-predicatbly-unbreaky) would be to leave all HFP as they are, and only look at our uncacheable+ttl=0 cases (miss-once) and decide whether they should get TTLs (HFM for N seconds to avoid stalling) which is probably almost always desirable in those cases for some reasonable TTL like 10-15 mins. [14:32:33] ["leave all HFP as they are" meaning leave them as HFPs, which means changing their syntax under v5] [14:34:44] right :) [14:35:10] yes, agreed [14:37:59] re: actual HFM TTL values: An infinite TTL maximizes the benefits in terms of stall-avoidance. And for objects that aren't low-hit-wonders or way-too-big (etc), a long-TTL HFM would eventually convert to a cache object. [14:38:49] the constraint in the other direction is that over-long TTLs on HFM objects which are rare (e.g. low-hit-wonders) would accumulate lots of pointless HFM objects in cache storage and data structures, etc. [14:39:12] I figure the desirable middle-ground here is "long enough to avoid most stalling complications for most clients" [14:40:28] ... which probably means "considerably higher than the usual transfer time of objects to clients" [14:41:56] hmmm no, transfer time to client mattered in v3 [14:42:26] in v4 when N clients are stalling on the same missed object, the transfer from the backend->storage goes full-speed, even if the first client was a slow one. [14:42:44] (so the second client can still complete the request before the first one) [14:43:20] so the desirable middle-ground here is "considerably higher than the usual transfer time of the object from the backend to varnishd" [14:44:05] and lower than "some value which would cause way too many HFMs to accumulate in storage from cold requests that don't really matter" [14:46:50] even if an individual fetch from ms-fe ran at ~ 100Mbps (not sure what our parallelism is here vs iface bw), a 10GB file would transfer in 819s if I did my math right. [14:54:40] we should really try to get some histogram / pNN type of data on response transfer times for cache_upload. or maybe just inferred transfer-rate. [14:54:51] most are fast, but in looking manually just now there are definitely outliers... [14:55:02] gotta go out for a bit, bbl [14:55:21] - BerespHeader Content-Length: 32 [14:55:22] - Timestamp BerespBody: 1510584762.615053 60.466561 0.000130 [14:55:40] ^ I assume those super-short CLs are slow 404s and/or just plain timeouts [14:55:47] * << BeReq >> 415863569 [14:55:47] - BerespHeader Content-Length: 27599 [14:55:47] - Timestamp BerespBody: 1510584758.507788 1.377139 0.000709 [14:56:04] ^ but that looks different. It looks like 1377ms to transfer 27KB :P [14:58:05] anyways, I took a few samples of reasonable-looking transfers of larger-ish objects from ms-fe->varnish in eqiad and the avg speed was closer to 25Mbps [14:58:56] on the other other than, 10GB+ files are probably rare-to-nonexistent, and we could/should probably put a hard cap on the exp() behavior anyways just to hfp the edge cases away. [15:00:04] at 25Mbps, 1GB is ~327s. [15:01:22] ema: so maybe even in the exp case, we put an HFP size filter in front of it to get rid of crazy cases that will always have cacheability chances with super-huge numbers of zeros on the frontend, and use HFM times around 600s? [15:02:19] if(size>1GB) { hfp } else { exp_with_hfm_ttl=600s } [15:14:28] running some numbers on the exp() calculations: if we had a hypothetical future frontend node with 1024GB of malloc storage, and our current cache_upload statistics feeding the parameters [15:14:49] a 6MB file is already at admission probability of ~= 4.784302618e-10 [15:20:13] running with that kind of calculation a little further: [15:20:55] if the file size is 10MB, and it's requests (from a given frontend) at a rate of 10K/sec for a full 24H, the probability over that whole day of the 10MB file entering cache storage is ~= 2.528588107e-7 [15:22:04] and the weeklong average rate of reqs into an upload frontend today is somewhere closer to 1K/sec overall [15:22:37] so even 10MB is a very fair cutoff to evade the whole exp() calculation at on the frontends [15:23:53] if(size>10MB) { hfp } else { exp_with_hfm_ttl=67s } ? [15:24:25] should provide ample margins without hfm bloat in storage, and not affect the practical results of the exp admission policy in any way, even on a hypothetical 1TB mem cache. [15:27:32] ran the same calcs with a 256K file size just to sanity-check the methodology vs expectations at our present hard size cutoff [15:29:06] admission probability per-request for a 256K file with a 100MB fe cache size, and if it was requested even 1/sec over a whole day, the whole-day odds of admission are .9999999... [15:30:12] (but that's ~0.07% per request) [15:30:51] s/100MB/100GB/ above heh [15:31:03] at 1TB it's ~40% per request [15:31:08] anyways [15:52:25] ema: https://gerrit.wikimedia.org/r/#/c/391025/ seems like a bugfix, I wonder if it was causing any significant impact [16:12:47] 10Traffic, 10Operations, 10monitoring, 10Prometheus-metrics-monitoring: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256#3755414 (10fgiunchedi) [16:40:25] 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755514 (10BBlack) What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion l... [16:49:03] 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries - https://phabricator.wikimedia.org/T180269#3755551 (10BBlack) p:05Triage>03Normal > Unfortunately, there is a known issue with this version of nginx where... [16:53:04] uhm I forgot why we need the `|| beresp.http.Content-Length ~ "^[0-9]{9}"` part there [16:53:17] but yeah it does look like a bugfix indeed! [17:15:56] I think that code pattern is in case the CL header has a huge number in it that doesn't parse correctly as an integer (in which case we assume the size is very large rather than very small) [17:16:03] e.g. more digits than can be parsed [17:16:52] have we actually observed such a case in real traffic or is it more of a defensive measure? [17:21:14] defensive I think, against buggy CL header values, and/or varnish stupidly having a 32-bit-ish limit on integer conversions at some point in the past or future [17:21:26] alright [17:21:30] (cache_upload does have some rare files past the 2/4G-ish marks) [17:38:34] 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755746 (10JEumerus) The user-side of deletion logs does not inherently have a search function, unless the specific actions are marked with a tag. [17:50:00] 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3755802 (10BBlack) Err, we should really move the sub-conversation back to T171881 . This ticket is more about general reliability problems and/or race-conditions, not about the WP0 abuse s... [22:23:00] 10Traffic, 10Operations: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3756812 (10Krinkle) [22:26:34] 10Traffic, 10Operations: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3756828 (10Krinkle) [23:16:43] 10Traffic, 10Operations, 10ops-ulsfo: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3757008 (10RobH) Supposedly this was delivered to ulsfo today, but I didn't get any email from UL support. Dropped them an email and will update. If it is onsite, I'll plan to go to ulsfo tomorrow (Tuesda...