[02:37:31] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4083351 (10Krinkle) @Vgutierrez I've seen the conversion to mtail going on, and look forward to using Prometheus in the ResourceLoader dashboards. Howev... [02:46:12] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4083365 (10BBlack) Digging into the timestamps of the final entries of various types (for load.php slowlog entr... [07:41:12] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4083535 (10Vgutierrez) >>! In T184942#4081704, @Pchelolo wrote: > - API Summary dashboard - this one uses `varnish.$dc.backends.be_{backend}` metric and... [07:56:47] ema, godog, I've seen that we are using mtail 3.0-rc4 from sid.. as godog pointed out here (https://github.com/wikimedia/puppet/blob/production/modules/mtail/files/programs/varnishbackend.mtail#L6-L7) there is some work missing because 3.0-rc4 has some issues handling floats [07:57:12] good thing is that since then, sid updated to 3.0-rc5 where the issue is already fixed [07:57:50] should I've any special consideration before upgrading in our repo from 3.0-rc4 to 3.0-rc5? [08:12:26] vgutierrez: no I don't think so, should be fine to upgrade, thanks for looking into it! [08:14:41] godog: I'm about to run our mtail tests against mtail 3.0-rc5, to check that everything behaves as we expect [08:16:06] ack [08:26:51] godog: works like a charm :D [08:29:42] vgutierrez: eggcelent! [08:39:47] hmmm boron shouldn't be able to reach debian mirrors? [08:40:19] vgutierrez: hey :) [08:40:27] what is not working? [08:40:34] dget http://http.debian.net/debian/pool/main/m/mtail/mtail_3.0.0~rc5-1.dsc [08:40:37] that [08:40:46] mmh [08:40:52] static.debian.org [130.89.148.14] 80 (http) : Connection timed out [08:41:05] have you tried setting the http_proxy environment variable? [08:42:50] learn++ [08:42:52] thx [08:53:03] vgutierrez: FWIW I got a couple of aliases proxy-on/proxy-off in my .bashrc for that, in puppet [08:54:43] ack [08:55:02] is it completely ok to build a DIST=sid package and upload it to our jessie repo? [08:56:30] for things like golang packages that usually don't depend on anything else we do that too yeah [08:56:57] in this particular case even built on stretch-backports and uploaded to jessie would do, I see mtail rc5 is in stretch-backports [09:41:29] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4083723 (10fgiunchedi) >>! In T184942#4083351, @Krinkle wrote: > I did find `fmt_inm` as `varnish_resourceloader_inm` via [varnishrls.mtail](https://gith... [10:01:57] godog: thanks for https://gerrit.wikimedia.org/r/#/c/422110/ [10:02:40] godog: I've now checked on cp2006 (which failed rebooting properly without human intervantion a few days ago because of memory errors), and there's no sign of errors in prometheus [10:03:24] do they get cleared out upon successful reboot? [10:03:52] I believe so yeah, until the next detection [10:36:43] 10Traffic, 10Operations: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4083947 (10ema) >>! In T190450#4074126, @Krinkle wrote: > I'm not sure why this would've changed recently. I suppose we can try to track it down. However, i... [10:36:53] 10Traffic, 10Operations: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4083949 (10ema) p:05Triage>03Normal [11:13:24] 10Traffic, 10Operations: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084046 (10zeljkofilipin) >>! In T190450#4081937, @Ragesoss wrote: > It might make sense to patch the gem to return empty string for 404s, which would resto... [11:13:42] 10Traffic, 10Operations, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084049 (10zeljkofilipin) [12:22:39] ema: where are we turning CL:0 into 404 at? [12:24:09] (or do I misunderstand the wording in the ticket?) [12:24:26] oh, I do [12:24:42] we don't ever turn a non-404 into a 404. It's just we generate a custom errorpage when it should've been empty body [12:25:03] hmmm [12:32:58] it's a tricky case! [12:33:18] (and it's really quite accidental that over-abuse of gzipping small/zero outputs was preventing this before) [12:34:46] in the general case for all >=400 outputs, we probably can't safely assume unset-CL == CL:0 in VCL. [12:35:01] there's the TE:chunked case, but also there's the http/1.0-style close-delimited case [12:35:20] but regardless of that possible improvement we'd still have the issue in the ticket [12:37:20] perhaps, a more-correct way to handle this would be to not replace output with synthetic if the applayer has already set Content-Type to [12:38:12] or another way of thinking about the problem: the whole reason varnish injects that errorpage is in the name of errorpage standardization [12:38:40] perhaps varnish can assume MediaWiki standardizes those outputs at its own layer, and only needs to do so for its internal outputs. [12:39:14] body-free cases that are 5xx might be different, but for 4xx surely we can trust the applayer if the applayer is MW, to set MW-standard error pages where necc [12:55:53] hmmm [13:17:19] interesting case, yes! [13:22:02] 10Traffic, 10Operations, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084338 (10BBlack) So, recapping a bit what's already been mentioned above: the proximate cause of behavior change was the AE:gzip... [13:24:53] oh and we should probably do that at the backend(-most?) layer only [13:25:03] yeah see ticket update [13:25:21] which probably could've been written in half the length if my brain was more-fully awake [13:25:56] good morning! :) [13:28:17] 10HTTPS, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#4084367 (10Seb35) 05Open>03Invalid I close this task given it is a request of support with no response since 5 months and the domain name expired since at least 3 months. Fee... [13:28:33] bblack, ema do you know any good method to determinate buckets size for HTTP request TTFB? [13:29:10] you mean like p50/p90/p95? [13:29:20] yup [13:29:39] feeding data in prometheus requires slicing the requests in buckets [13:29:51] oh [13:30:09] that seems wrong somehow, we can't know the statistical buckets while we're gathering the raw stats [13:30:21] bucketing should be happening up in grafana or something [13:31:18] but, I think this is where we're treading into the line of what's prometheus-appropriate and what isn't [13:31:29] I was reading some stats theory on how to do that: http://www.statisticshowto.com/choose-bin-sizes-statistics/ [13:31:57] where would we be feeding ttfb via prometheus anyways? [13:31:58] I guess I can get some data sample and pick the bin sizes after that [13:32:18] yeah but basing it on a sample is going to cause misleading outputs [13:32:29] things vary a lot under adverse conditions, which is precisely when the graphs are most-useful [13:33:59] but back to the broader point: [13:34:52] prometheus is useful for machine stats that are more-or-less time-stepped raw values (e.g. cpu% sampled every N seconds, or disk write bytes per N interval or per-second but sampled every X seconds) [13:35:07] and I think it's also useful for request-driven stats with discrete states [13:35:29] (e.g. emit 1x stat per http request, but the data is discrete values like http-status-code, or ciphersuite, etc) [13:36:07] but I doubt prometheus is the right tool for a raw value (not a limited discrete set) that's happening on a per-event/request basis rather than time-steps. [13:36:33] that falls more in the realm of an analytics pipline than a monitoring one [13:36:40] *pipeline [13:36:57] it's a time series database ;) [13:37:10] right [13:37:37] 10Traffic, 10Operations, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084401 (10Gilles) Sure, the behavior can be reverted in these ways, but the rationale to have more human-friendly error pages shoul... [13:37:59] once you take a stream of something like "raw ttfb values per request" and try to pre-encode the stat for prometheus somehow into discrete buckets and/or time-steps and let it post-process it... you've lost statistical meaning you'd want to have when evaluating that data. [13:38:24] bblack: right now we are doing exactly that with graphite [13:38:31] well :) [13:38:54] what's the actual case? [13:39:57] (I suspect we're doing lots of statistically-invalid things in places with graphite, good time to clean up!) [13:41:20] so.. I'm trying to deprecate python varnish cachestats (T184942) [13:41:20] T184942: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 [13:41:48] and basically I'm trying to finish the varnishbackend.mtail program (https://github.com/wikimedia/puppet/blob/production/modules/mtail/files/programs/varnishbackend.mtail) [13:41:57] that was blocked by a mtail bug [13:42:18] yeah there's a lot in there :) [13:42:28] so for the varnisbackend data, we were recording timings too? [13:42:38] (I can hear ema laughing 1000km away) [13:42:49] indeed [13:43:01] this is a sample line of what it should be parsing "http_status 200 http_method GET backend vcl-root-4d6ee02a-4455-43f6-aebd-f6a5f538b139.be_wdqs_svc_eqiad_wmnet ttfb 0.015081" [13:43:15] I think we should probably stop doing so, and stick to things like status codes and dispositions or whatever [13:43:31] if we need to know about slow backend responses, we can get that from logstash in a different way [13:43:43] or we can set up something that's more analytics-appropriate for that kind of data [13:44:12] or, for basic debugging we could bucket it into discrete chunks that are just good enough for debugging purposes, but not for statistical analysis (e.g. p95 or whatever) [13:45:13] e.g. have ttfb buckets for [<10ms,<100ms,<1s,<10s,<100s,101+] [13:45:37] hmmm we won't have nothing greater than 60secs, right? [13:45:56] hypothetically, but that's subject to tuning and bugs, all of which can evolve over time [13:46:02] better to have a bucket it can go in [13:46:18] yup.. prometheus requires a "+Inf" bucket [13:46:27] ok so that maps well [13:46:39] just do orders-of-magnitude down from there to ~10ms [13:46:55] and that's enough for debugging problems without getting into the real-stats issues [13:47:31] ack [13:48:31] also [13:48:39] yeah each per-request value stored in prometheus is a non-starter, hence the buckets [13:48:40] I kind of wonder where that ttfb number comes from in Varnish anyways [13:49:06] since whenever the graphite version of that was written, we've come to understand varnish timing output much better than we did before [13:49:06] btw it isn't that statistical meaning is lost, it is that a (tunable) error is introduced with the buckets [13:49:55] I guess that's one way to say it, but the error can be quite large if you can't predict what the buckets should be at any given time. [13:50:45] that's for sure, I see the buckets as a ballpark [13:50:56] * godog does the one-eye-point-at-thumb gesture [13:51:03] statistically-meaningful ones would have to be linear (e.g. more like 100ms steps instead of order-of-magnitude steps), and then you either end up with way too many buckets, or a quite limited total range (say 100ms buckets from 0-100s, which already gives you 1K buckets which is crazy) [13:51:40] but then under some adverse condition when someone's finally really wanting to stare at that graph, maybe almost everything is in the final "N->inf" bucket because it's outside the expected range [13:52:14] the 1K buckets probably doesn't scale well anyways, you're probably really looking/hoping for something more on the order of 10 [13:53:02] yeah 10, 15 tops and mostly for operational purposes, tying back to what you were saying earlier about what's that ttfb really and what performance we should expect from it [13:53:25] for thumbor the "bucket tuning" has been essentially "zooming" into the interesting ranges [13:53:50] or adding resolution, you get the idea [13:54:22] yeah [13:55:00] for this kind of purpose, I like the idea of bucketing exponentially. you get the range coverage in a small bucket count and can see what's important for debugging, but throw all hopes at (bounded-error) statistical utility out the window. [13:55:31] 0.01,0.1,0.5,1,2,4,8,16,32,+Inf? [13:58:32] that seems to switch exponential factors near the bottom [13:58:39] but whatever it's bike-shedding :) [13:58:51] vgutierrez: my suggestion would be to start with sth like that yeah and we can bikesh^Wdiscuss on the code review [13:58:53] I doubt we care much about the distinction between a 2s request and 4s one [13:59:08] godog: ack :D [13:59:44] mostly we'd expect most well-behaved backend requests to be sub-10ms in an ideal world. [14:00:01] in our own real world, maybe somewhere sub-1s is a more-reasonable target :P [14:00:22] and then there's going to be the hopefully-rarer slow/timeout cases out in the upper ranges [14:00:33] fwiw one of my rule of thumb to see if prometheus is a good fit to answer a certain question is whether I want the accuracy of a billing/metering system, if I want that then prometheus isn't a good fit [14:00:56] yeah [14:01:31] for that catoegry of things (what I was calling analytics-like), I don't know that we have any good ops-level solution. We have webrequest -> hive/pivot/etc for the obvious case of per-request metrics. [14:01:49] maybe we shouldn't have an ops-level solution though, and that should always be on the analytics side of things [14:01:58] (through something-or-other -> hive and onwards) [14:03:33] I agree we shouldn't have sth like that, meaning if we really need to the data is in hadoop and we can look at it to dig deep [14:04:24] we get away with 5xx in logstash because our elasticsearch cluster can deal with it, but that's different scenario [14:04:27] sometimes it currently isn't, though (but maybe should be) [14:04:33] notable relevant/recent case: [14:04:47] https://grafana.wikimedia.org/dashboard/db/performance-singapore-caching-center?orgId=1&panelId=1&fullscreen&from=now-7d&to=now [14:05:03] ^ this is coming from graphite and trying to operate on ttfb-like values, which can't be working quite right statistically... [14:06:13] yeah perhaps good to look at the relative change, absolute values I don't know [14:11:15] bblack: grace does seem to apply to hfp objects, https://gerrit.wikimedia.org/r/#/c/421542/ updated [14:12:09] also, I believe it's unnecessary to shorten the hfm ttl? responses that become cacheable would get cached so there should be no bad fallout due to wrong hfm? [14:14:25] 10Traffic, 10Operations, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084478 (10BBlack) Yeah, I guess I was taking it as implicitly true that it's correct for MW to desire content-free 404s in the case... [14:16:13] ema: the reason to avoid long hfm is just to avoid pointless waste of storage/indices on transient-ish things. We tend to think about the common-hot and corner cases, but not the middling ones. [14:17:05] e.g. a given URL could be fairly-rare (gets accessed let's say once every 30 minutes on average), and is usually cacheable-200, but occasionally returns a 5xx which is going to create an HFM. [14:17:22] there could be a very large long-tail set of such rare-ish URLs [14:18:02] and every time one of them creates an hfm, it basically sits around pointlessly in storage for a full 10 minutes even though nobody's bothering to hit on it. It will probably expire before the next access which gets a 200. [14:18:45] mmh [14:18:48] so if we took the "long hfm is np" argument to the extreme, and we set our hfm ttls to be 24h, you can see how pointless ones would pile up. [14:18:58] we do explicitly exclude 5xx though [14:19:09] no, we explicitly *include* 5xx's [14:19:14] right? [14:19:36] oh right, exclude [14:19:48] yup [14:20:08] 5xx should be such a case, though, we just left it out because of the hfp implications [14:20:13] (back when we had no hfm) [14:20:23] what's the default 5xx behavior? I know it's uncacheable, but... [14:21:25] if it wasn't for the missed opportunities for conditional resposes, using hfm for everything would be way simpler... [14:23:42] yeah [14:23:57] so, we could use hfm for 5xx [14:24:15] for < 500, hfp if potentially cacheable, hfm otherwise [14:24:34] s/cacheable/candidate for conditional requests/ [14:24:34] but now I really wonder, what is happening with 5xx now? [14:24:39] is it a one-shot miss? [14:25:14] I think so, what would be the alternative? [14:25:26] well I just wonder how that interacts with coalesce, etc [14:25:33] I'm guessing if we have a normally-cacheable output [14:25:57] and it expires and we try to fetch for a new client and get a 5xx, that we're serializing/coalescing clients at that point until we get another 2xx [14:26:32] oh [14:26:43] (which is why it was desirable to do something better for 5xx in that block originally, but then we couldn't because hfp was our only tool) [14:28:18] so for a really transient 5xx it seems desirable to coalesce till we get a 2xx response [14:28:28] a cacheable 2xx response [14:28:43] well [14:29:06] unless "really transient" means "for 30 seconds until the applayer recovers" and we get 10K requests for this very hot object in that time [14:29:27] I guess there's no great answer at that point [14:30:08] either the clients stall up and consume threads in varnish-land, or we flood them through to the applayer and probably make things worse there [14:30:11] right, disabling coalescing and sending the 10K reqs to the recovering applayer could also be problematic [14:31:20] so, 5xx is tricky, and really a separate problem [14:31:55] the other thing that's ugly about that block currently, is the fragile/strange condition on underlying cache hits trying to avoid the edge-cases around expiring/graced objects [14:33:21] the assumption there being that if the underlying cache returned a hit and we consider it uncacheable the object must be expired (but graced) [14:33:23] probably some of the related behaviors should be different in the applayer-facing backend than elsewhere [14:33:24] is that right? [14:33:51] maybe a better way to state the intent of the block: [14:34:11] the point of the block was to catch the truly-persistently-uncacheable cases, and give them reasonable-duration HFP objects to avoid coalescing. [14:34:41] but then there are responses which look uncacheable, for which hfp is undesirable because the uncacheability is actually transient, and we're trying to avoid some classes of those: [14:34:53] 1) 5xx's [14:35:31] 2) objects which have ttl<=0, but came from another cache layer which indicated a hit, which means they probably hit on a recently-expired object as a grace-hit while re-fetching. [14:36:41] one of the fundamental issues here is that ttl<=0 is not exactly equal to "has uncacheable headers", which should be indicated by e.g. no-store|private type stuff, not max-age=0 [14:36:52] (or max-age=500 + age=501) [14:37:21] but we've translated the no-cache|no-store|private -type cases to ttl=0 earlier in the same subroutine, and then we're handling it all together down here [14:37:44] all of this needs some re-thinking I think [14:38:15] also we should add another desirable case above: [14:38:25] 3) MediaWiki-Api-Error + 200 OK :P [14:38:45] lol [14:39:52] (a case where we avoid creating hfps) [14:40:29] for cases where 304 is possible, I think we're better off sticking with hfp and trying to deal with these conditions. [14:40:45] we can obviously opt-out of all this mess with a short hfm for non-304-capable responses. [14:41:39] so keep the patch as is and shorten hfm too? [14:42:06] (which rewinds back to the whole subject of how long hfm ttls should be. if we know extremely-long is bad, and that badness gets better the shorter they are, what's the lower limit of reasonability? probably "long enough that combined with grace we cover reasonable coalesce windows" aka "long enough to cover a small multiple of a maximum possible fetch-time") [14:44:29] now I find myself re-checking the properties of the MW-Api-Error+200OK case [14:44:39] (which takes a while, my known case is a 60s timeout) [14:46:37] ah for once it's as I hoped [14:46:49] the MW-Api-Error+200 case doesn't emit LM or Etag at least [14:46:55] so it will fall into the hfm case anyways [14:47:35] ok [14:48:09] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4084586 (10RobH) Tim emailed about this a couple of weeks ago, and I sent out another email to them regarding this. Hopefully it gets some movement soon. [14:49:08] now I've really twisted this all up in my head though, I'm starting to question whether the conditional should be on the request's conditiona properties instead of the response object [14:49:18] (req's INM IMS?) [14:50:01] so going back over that (re hfp vs hfm) [14:50:27] a normal miss where we have no previous information about a URL, varnish always strips INM/IMS in order to hope for getting a cacheable body instead of a 304 [14:51:28] hit-for-pass/pass has the desirable property that if req.INM/IMS + resp.ETag/LM line up, the client+applayer can do 304s with each other instead of re-fetching whole objects (either way the cache isn't involved) [14:52:23] hfm is an always-miss sort of scenario, which means always-strip-INM/IMS, denying 304 opportunities between client+applayer while passing through [14:52:42] (assuming the condition persists) [14:52:53] which is why we don't just blanket-convert all hfp cases to hfms [14:53:51] the client would have no basis for req.INM/IMS unless it had previously seen ETag/LM responses for the object [14:55:04] so the fact that we happen to see a response with a lack of ETag/LM (which may well be a transient error + 200 OK in the api case), doesn't necessarily mean that the normal responses don't have them, or that the client isn't looking for it with INM/IMS, which might work eventually when success [14:56:09] on the other hand, it's hard to rely on client's presence of INM/IMS as an indicator either, it's not reliable. [14:56:10] on the other hand, lack of INM/IMS on one request does not mean that the next one also won't be conditional [14:56:15] heh [14:56:28] especially if the hfp/hfm is shared between many clients, some of which may be "new" [14:57:49] I think we really need to distinguish between the two cases: 1) actual uncacheable responses, which will remain uncacheable for a long time 2) transient uncacheable responses, which might turn cacheable soonish [14:58:04] I guess an interesting question we could artificially test would be: if we create an hfm, and then hit it on a future request and generate a true miss, and then observe the applayer responding with ETag and/or LM and try to generate an HFP, does the HFP object get created and replace the HFM for future requests [14:58:43] or does something about the nature of the transient busyobj for hfm make hfp creation not work right in that scenario (upconvert hfm->hfp) [14:59:35] I know hfp doesn't create hfp (we hit an hfp object, and then see the same conditions that cause VCL to try to create another hfp object, I don't think it actually replaces the original one) [15:00:06] but hopefully hfm can upconvert to persistent hfp, that would solve a lot of the mess above. [15:00:28] (the transient response with no LM/Etag can create hfm, and it will get fixed promptly back to HFP when warranted by backend response headers) [15:02:56] this stuff gets complicated very quickly for how little code diff is involved heh [15:08:58] having stats on conditional requests for uncacheable objects would be interesting too [15:09:42] 10Traffic, 10Operations, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4084655 (10Ragesoss) I think the new behavior isn't a significant problem; it'll only really be an issue for anyone who was relying... [15:17:33] https://phabricator.wikimedia.org/P6904 [15:18:00] ^ this is more what I'm currently thinking, but there's still some edge-cases to think about mentioned in all the sprawling text above [15:19:44] (and how it blends with the rest of that sub, e.g. the currently early ttl=0 for uncacheable influencing some of the code between for 4xx and ttl-swizzling. [15:19:56] and how it blends with other cluster-specific VCL in the area) [15:21:18] separately, whatever approach we go with here which causes more hfms, we should find the right hack to fix miss/pass stats [15:21:32] (e.g. hook into vcl_hit+vcl_miss in a way that hitting an hfm counts as a "pass") [15:23:10] (the aim being that hit+miss counters should represent algorithmic/resource efficiency of attempts to cache cacheable content, and "pass" should be the bin for things we had no hope of caching) [15:31:27] there are so many other complex peices of vcl interaction here [15:31:43] like the n-hit-wonder code in text-frontend [15:32:15] and related bits in the upload case [15:32:55] we should look at getting rid of upload-common's: [15:32:56] // Debugging T144257. Don't cache 200 responses with CL:0. [15:32:56] if (beresp.http.Content-Length == "0" && beresp.status == 200) { [15:32:57] T144257: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257 [15:33:12] we're probably long-past that problem in various ways and/or solving the rest of it now [15:34:47] but anyways, the basic things I'm trying to aim at with the stawperson in https://phabricator.wikimedia.org/P6904 are: [15:35:12] 1) Stop confusing truly-nocache-headered outputs with happen-to-be-ttl<=0 outputs, they're different cases. [15:35:41] 2) Assume (but should verify first) that hfm can upgrade to hfp and thus we can rely on resp.ETag/LM detection there to do that job [15:36:09] 3) Use shorter timers in both cases, just long enough with grace to cover avoiding stalls on realistic fetch timeouts [15:37:14] (maybe another thing to verify: do hits through hfp and/or hfm refresh their own timers when they re-invoke the same logic?) [15:37:20] (I think, they don't, but not 100% sure) [15:55:03] hfm -> hfp conversion is a thing https://phabricator.wikimedia.org/P6905 [16:01:22] hfm timer refresh is not a thing https://phabricator.wikimedia.org/P6906 [16:09:29] nice [16:11:16] grace also confirmed to be working w/ hfm, omitting https://phabricator.wikimedia.org/P6906$16 the test fails [16:11:44] (grace is 10s by default) [16:13:15] I other factors in hfp/hfm lifetimes is of course churn and the frequency of stalls [16:13:22] it's a tough problem! :) [16:14:01] even if we set lifetimes long enough to avoid explicit stalls in the common case, we could just be creating lots of churn constantly pointlessly recreating the same hfp/hfm objects over and over, faster than we have to. [16:14:33] I was thinking about the "long enough to avoid stalls" problem, and really it's less about the TTL and more about the grace, right? [16:15:19] whether the hfm/hfp was say 10s or 10m, the real problem (assuming it persists for the long term and keeps refreshing) is what happens when it expires and we begin fetching a new one. [16:15:31] if fetching the replacement takes >grace, we suffer a coalesce impact [16:15:52] I built openssl 1.1.0h packages, the current round of security fixes is negligable for our TLS termination (https://www.openssl.org/news/secadv/20180327.txt) , but there's also the usual ton of other bugfixes. ok to upgrade cp1008 for some tests? [16:16:19] 10Traffic, 10Operations, 10ops-codfw: cp2006, cp2010, cp2017: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4085174 (10ema) [16:16:20] moritzm: please do, will peek at the changes in depth in a bit [16:16:32] ok, will do [16:18:16] 10Traffic, 10Operations, 10ops-codfw: cp2006, cp2010, cp2017: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10ema) Same issue on cp2017 today. Host depooled. ``` 6 | Sep-28-2015 | 20:10:59 | ECC Uncorr Err | Memory | Uncorrectable memory error ; OEM Ev... [16:22:37] longer ttl to reduce churn, long-enough-to-fetch grace to avoid coalescing ? [16:23:10] right, grace=31s if we have timeouts that cut off fetches closer to ~60s, is probably a failure-point [16:23:53] I'm not sure how we rationally evaluate the rest of it, re competing concerns for churn and excess object pileup and so-on. [16:25:29] to reduced ttl in your proposal should make transient memory usage change somehow [16:25:36] (and I guess also the concern that sometimes an hfp is mistaken and shouldn't have happened, so limit the impact) [16:25:38] s/to // [16:25:50] what's the transient cutoff again? [16:25:55] heh [16:26:30] on text it's 5G on the frontend [16:26:30] (I'm pretty confident an hfp won't degrade to an hfm, btw) [16:26:41] I mean time [16:26:43] 60s? [16:26:44] and 2G on the backend [16:26:55] ah timewise [16:27:21] shortlived, defaulting to 10s [16:27:34] ok [16:27:48] so even at ~ ttl=90+grace=31, we don't see transient buildup [16:28:12] although really that grace value should be higher as noted above [16:28:26] hmmm [16:28:40] but wasn't there something about hfp always going to transient regardless of ttl? [16:28:44] IIRC hfp ends up in transient regardless [16:28:46] not sure about hfm [16:29:19] I'd be surprised if hfm would differ [16:30:06] if (bo->uncacheable || lifetime < cache_param->shortlived) [16:30:06] stv = stv_transient; [16:31:23] tests with the openssl 1.1 package on pinkunicorn look fine, I'll upload to apt.wikimedia.org or does anyone want to run additional tests/reviews? [16:32:11] go ahead! we can control the rollout as we upgrade to double-check [16:33:27] ack [16:37:56] ema: yeah so, if reduce the total hf[pm] lifetimes (ttl+grace), it should reduce transient memory a bit. Currently it's ~632s, we're aiming lower than that regardless I think. [16:41:08] bblack: cool. Should I amend my CR including your proposal or do you want to submit it as a separate CR? [16:41:53] 10Traffic, 10Analytics, 10Operations: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085283 (10Nuria) [16:43:06] 10Traffic, 10Analytics, 10Operations: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085301 (10Nuria) [16:43:36] 10Traffic, 10Analytics, 10Operations: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085283 (10Nuria) Adding folks from traffic. [16:44:48] 10Traffic, 10Analytics, 10Operations: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085324 (10Nuria) Seems like this is an "unofficial mirror". Looping legal [16:47:08] ema: amend for sure, but I think my proposal still has problems. we can keep iterating and thinking a bit though [16:55:55] 10Traffic, 10Analytics, 10Operations: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4085283 (10Jdlrobson) Note there is also https://ru.wiki.ng/ doing exactly the same thing (different host) [18:18:06] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#4085808 (10Pnorman) a:03Pnorman Going to take another look at a couple of things here [22:00:46] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4086547 (10kchapman) @Tgr perhaps I was not as clear as I could have been. The other issue we see is there should be multiple RFCs broken out for that. Per... [22:52:45] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4086666 (10Tgr) >>! In T66214#4086547, @kchapman wrote: > The other issue we see is there should be multiple RFCs broken out for that. Perhaps that means t... [23:06:02] 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850#4086715 (10demon) >>! In T187850#4076945, @bd808 wrote: > The cleanup command @chasemp used was `rm -fR /usr/local/lib/...