[10:19:43] 07HTTPS, 10Traffic, 06Operations, 06Project-Admins: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#2325850 (10Danny_B) [13:27:08] varnishtop is *so* insane [13:28:17] :) [13:28:25] not in a good way [13:28:31] all those PURGEs are crazy [13:28:38] and requests to ?action=history [14:07:38] yup! [14:09:21] https://phabricator.wikimedia.org/T133821 has my best current overview of the bigger-picture situation on PURGE [14:09:53] it doesn't take into account even-more-recent increases in jobs like htmlCacheUpdate, but it's harder to even notice them now because the jobqueue is mostly ratelimiting them to a flattish rate. [14:11:30] the explanations for PURGE rate increase listed there don't even try to account for things like user behavior (bad bot?) or wikidata-specifics with updating backlinks more aggressively [14:11:37] there are lots of known causes without even going to that level [14:12:43] given that most purges are pointless (it's actually a very small percentage of them that actually purge a live object) [14:12:57] we could just update vhtcpd to ignore 75% of them randomly or something [14:13:46] then the severity of loss would more-closely match the punishing workarounds for loss implemented elsewhere, and the world would be back in balance expect for the excess raw network traffic :P [14:14:37] or we could ask the MW Core team to investigate this... [14:14:52] or I could spend weeks dumpster-diving in layers of PHP code [14:21:57] anyways, on related notes, I should start moving forward again on TTL/cache-size stuff this week, since we have stats to stare at now and see the effects more-clearly [14:22:35] there's the cache size bump for text frontends, pulling the per-layer TTL caps down to 7d, and arguably capping off all frontends at 1d at most. [14:23:11] notably cache_upload with its amazing cache-hit stats has a 1h cap on its frontends [14:23:41] (which could probably be raised to 1d and avoid some dc-local inter-cache traffic, but that's a whole other topic) [14:24:55] context on the 7d thing: we've dropped the TTL caps on backend instances from 30d to 21d to 14d over the past few months with no real tragedy. I think 7d is probably the limit at which we really need to sort out how our grace-mode works first, though. [14:25:08] that gives us a 7 day window to still have something useful in stale caches if we offline a DC for a while, for now. [14:26:16] (sort out grace before going any lower than 7d with ungraced TTLs, I mean) [14:26:33] (which implies we'll want to sort out backend healthchecks on LVS'd endpoints) [14:27:22] but first: upsizing text frontend caches like we did for upload [14:28:05] (and watching graph effects of the rolling restarts, and how recovery looks long-term afterwards on whether they absorb appreciably more traffic that would otherwise be local-backend or not) [14:34:38] sorry, catching up -- the perf team could help us on the MW side of purges [14:35:03] https://phab.wmfusercontent.org/file/data/5tnisdmyx6oacjri7i2y/PHID-FILE-73y2ytqgclzpiqp7zp4p/Screen_Shot_2016-04-07_at_7.47.28_PM.png [14:35:16] so what happened on Jan 20th or thereabouts? [14:36:59] paravoid: before that date, basically there was a massive ratelimit on jobqueue-based PURGEs that was implicit, because it only had one worker or something like that [14:37:10] ah [14:37:35] this is still pretty insane [14:37:36] paravoid: so, several increases to the incoming rate of htmlCacheUpdate happened before that date, and on that date ori "fixed" the jobq indefinitely-lagging by giving them more queues/workers/whatever [14:37:47] at which point the floodgates opened [14:37:47] did it coalesce before? [14:37:52] no idea! [14:37:59] so look at this: [14:38:00] https://www.wikidata.org/w/index.php?title=Q53805&action=history [14:38:20] shortly after that, they did put some kind of explicit ratelimiting into the purge jobqueue stuff, which helped flatten out spikes [14:38:35] which may now be hiding other update-rate issues on the incoming side, too [14:39:10] for each of those edits, 17 wikis need to get purged [14:39:18] plus mdot, 34 PURGEs [14:39:29] I'm still operating slower than normal, I keep screwing up hotkeys from re-learning how my terminal/window-switching and cut/paste work, etc :) [14:40:07] paravoid: mdot? [14:40:15] en.wikipedia.org + en.m.wikipedia.org [14:40:19] (for example) [14:40:24] these were 138 edits [14:40:31] so these are 4.692 PURGEs [14:41:02] at least [14:41:24] oh, plus the action=history ones [14:41:35] plus wikidata themselves [14:41:40] :) [14:41:47] yeah, this is insane [14:41:53] 138 edits over what kind of time window? [14:42:04] 20 minutes I suppose [14:42:06] https://www.wikidata.org/w/index.php?title=Q53805&offset=&limit=500&action=history [14:42:17] that's just ONE article of that bot, though [14:42:22] it's doing that en masse [14:42:40] so sure, it's operationally-insane in the effect on PURGE [14:42:40] and some Wikidata articles have different counts of back references than 17 [14:43:01] but from a broader view: this bot is probably doing something wikidata people think is useful [14:43:16] how *should* we be handling this to let it do its job and not spam purge? [14:43:31] https://www.wikidata.org/w/index.php?title=Special:Contributions/LandesfilmsammlungBot&offset=&limit=500&target=LandesfilmsammlungBot [14:43:47] I don't understand why the bot needs to do 138 edits on the same article in a short period of time honestly [14:44:06] maybe some wikidata peculiarity I don't understand [14:44:21] well presumably it's not batching them because they need separate descriptions? [14:44:41] e.g. "Added qualifier: point in time (P585):" [14:44:41] P585 Somebody subscripted core deployment branches?!??!?!?!!!!???? - https://phabricator.wikimedia.org/P585 [14:45:17] well it could join the descriptions with nice commas I suppose :) [14:45:20] probably the "right" answers to this kind of problem lie in the jobqueue handling of content updates, in the present [14:45:42] I think it might be deeper than that [14:45:49] maybe wikidata doesn't support multiple-field edits [14:45:50] jobqueue should be de-duplicating this at several levels (for spam-updates to the same real original article, and also to coalesce backref updates) [14:46:08] de-duplicating this with what time window though [14:46:21] any time window would help, that's just tuning [14:46:55] the point is that this bot is doing those operations in a too large time window :) [14:46:56] ignoring how jobqueue works today, and the special problems of how some update jobs create other update jobs for backrefs, etc... [14:47:12] possibly because it can't do edits in batches [14:47:24] at the output side somewhere, jobqueue is emitting a serial stream of PURGEs for given URLs [14:47:38] so at the output side: [14:48:28] remind me, why do we need to purge the mdot separately? [14:48:35] 1) every emission also tracks in a little cache/db the last PURGE emission timestamp for each URL. These should fall out after N-time has passed so it doesn't grow indefinitely. [14:48:37] isn't just a variance due to our X-Subdomain rewriting? [14:49:42] 2) if we're about to emit a PURGE for something still in the cache/db thing (was already purged in the last N-time), hold/coalesce until the entry falls out of the little cache/db index due to N-time passing, then emit a single PURGE and update the DB again. [14:50:29] so, a purge that's isolated always goes through immediately. if a URL is being PURGEd more often than 1 per N-time, it will get coalesced to only emit once per N-time after the first hit, until it quiets down [14:50:50] even at N-time = 5 minutes, it would probably cut the rate dramatically [14:51:35] but then we also know that somewhere in all of the inter-related code (I don't know where), we're intentionally re-emitting purges multiple times spaced out over time, to try to fix up loss/races [14:51:57] so these 10k purges are not even 10k [14:52:00] they're possibly 20k [14:52:01] so we don't want this ratelimiter to affect that. we still want, after all that output-side coalescing, to do the second purge a minute later or whatever [14:52:10] just for that single wikidata article, just for a 20min time period [14:52:22] (I think they currently 3x the purges over time for the race/loss-coverage thing, but I have no idea where that code is) [14:52:29] haha [14:52:48] paravoid: m-dot is because we combine the mobile+text caches [14:53:00] so 138 edits * 17 back-referenced wikis * 2 for desktop/mdot * 2 for article/action_history * 3 times for resiliency? [14:53:13] 28.152 [14:53:30] before, when the caches were separate, the same text-domain purge would affect two different objects (text+mobile) in two different cache clusters. we re-wrote m-dot to desktop at the front edge in mobile. [14:53:55] now, caches are combined, and we don't rewrite the hostname until we send the request to mediawiki, so that they can cache-differentiate on the real incoming hostname. [14:54:00] so we need 2x purges [14:54:10] ah [14:54:19] we could Vary instead, no? [14:54:32] probably not the best idea [14:54:55] how come? [14:55:18] it only fixes the purge duplication, it makes everything else worse IMHO [14:55:25] mw already Varies on X-Subdomain anyway right now, right? [14:55:31] but maybe it's not emitting it on desktop [14:55:44] well, that's the first of many hints at wrong things [14:56:07] I think MW routinely emits different Vary headers for the same object, which is horrible. I don't want to start asking them to do more Vary and make the problem worse. [14:56:14] XKey? [14:57:04] but also, we shouldn't have this X-Subdomain hack anyways. Varying on it just reinforces it rather than helps get rid of it. [14:57:19] we only do X-Subdomain because apparently mediawiki can't just differentiate on hostname [14:57:59] (in other words, MFE should be detecting m-dot hostnames rather than looking at X-Subdomain. That would be sane and normal) [14:58:35] ema: yes, XKey is the answer to a lot of things when we get there. it at least cuts down the purge duplicity. [14:59:01] but of course it won't have any real effect on over-broad/too-fast purging of content in general. [14:59:50] so are zerodot pages purged right now? :) [15:01:35] nope [15:01:52] but zerodot is supposed to be lightweight/legacy anyways. the vast majority use mdot [15:02:03] I can see your arguments for MFE supporting Host [15:02:12] it probably falls into the category of things we could/should remove, but there's no active zero coding team [15:02:34] but otoh, conceptually, mdot/zerodot *is* a variance of the same page [15:03:10] not really anymore than any other factor [15:03:21] we could strip the first path element of the URL, put that in a header, and vary on that, too :P [15:04:03] it's a different hostname, and it emits different content for the same non-hostname portion of the URL [15:04:21] that the content happens to contain some of the same underlying data is irrelevant for caching [15:04:41] switching to Vary is just a poor-man's XKey at that point, if we know applayer things about how they're related to each other abstractly [15:05:06] I can see arguments for both honestly [15:05:54] Vary: Accept-Encoding is also because different content is emitted for the same URL -- the point is that it's just a slightly different version [15:06:12] to me the tipping point in that argument is that with a complex application (or worse, several applications!), at least here, it's not generally true that the developers can manage Vary [15:06:16] so yeah, you could conceptually call it a variance, or not -- it could go either ways [15:06:28] yeah, that's fair enough :) [15:06:35] they don't get the idea that Vary can't Vary [15:06:53] haha [15:06:54] there's no one single place in all our global infrastructure that decides what URLs emit what Varys [15:07:03] (at the app code level) [15:07:21] ok, fair :) [15:07:21] avoiding Vary and sticking to "unique URLs and unique" is simpler [15:07:53] so according to your graph above [15:07:57] Vary is useful, but IMHO only for certain kinds of cases, like Accept-Encoding (which we're doing for RB) [15:08:13] (and normalizing it on input, too) [15:08:14] PURGEs are essentially as much and possibly more as our legitimate requests? [15:09:02] paravoid: depends on your perspective. because they're multicasted. on any one cache, even frontend, real requests outnumber PURGEs. but graphs that aggregate across many caches count real requests once and count PURGEs X num_caches. [15:09:34] right, good point :) [15:09:47] another sad datapoint, though, is that most PURGEs are for things that aren't even in cache anyways. [15:10:30] we can't really avoid that easily and efficiently, unless the applayer side had some tracking about whether a given article or URL was ever parsed in the past N days. [15:10:37] (s/parsed/emitted?) [15:11:22] and then we'd lose that on some memcache wipe and purge would stop flowing for days or something [15:11:46] anyways, at least with xkey and jobqueue-stuff fixed, the rate would go down and we'd care less [15:12:13] I don't think anyone sees it as worth investing in fixing jobqueue purges because the almighty eventbus is replacing it for this purpose. [15:12:58] but if we're going to invest in debugging jobqueue'd purges, there are obvious pointers on what to look at, but I've not gone and found the offending code [15:13:23] which is these points from the ticket: [15:13:24] We know some of the causes were code changes trying to fix problems (2) and (3) by sending multiple delayed repeats of every purge request a short time later to try to paper over races and loss, which further multiplies the total rate. [15:13:28] We suspect that when most of the purging was centralized through the JobQueue somewhere in this timeframe, that this probably also multiplied the purge rate due to JobQueue bugs repeating past purges that were already completed for no good reason. [15:13:51] looking at those two issues + how/if coalescing works for jobqueue purges would probably lead to some easy reductions [15:15:36] 1) fix the accidental-duplication bug on jobqueue purges (RB has seen this too and complained). 2) Find the code that does intentional delayed purge-repeats and "fix" it (make sure it only does one extra for now, in reasonable time?) [15:16:31] 3) ratelimit purge of a given URL somehow before or after the previous point [15:18:47] if we even knew where the intentional-repeat code was, we could theorize about its behavior better [15:18:59] for all I know it does 10 total repeats and is responsible for the bulk of the problem [15:19:13] but I suspect it only does 3 [15:21:28] I have a question about MediaWiki's setting of the Cache-Control header. I have 5 wikis and for 3 of them the header gets set to "s-maxage=10, must-revalidate, max-age=0", leading to intermittent stale Special:RecentChanges data, while for the other 2 wikis it gets set to "no-cache, no-store, max-age=0, must-revalidate" leading to proper RecentChanges results. [15:22:27] This seems to do with the OutputPage.php function sendCacheControl() and whether $this->mEnableClientCache is true or false, but I can't seem to determine what would cause that value to differ. [15:25:54] intermittently only up to 10 seconds stale? but either way, probably mediawiki developers is who you want to ask about the second part [15:27:13] No as to the 10 seconds. Like right now I can keep refreshing RecentChanges on one of my wikis and see the latest changes being from May 23, 24, or 25. [15:27:55] I figured this was a good place to start since I've had excellent help in the past here with respect to caching issues. :) [15:28:57] By the way, I forgot an important point that I should mention, this is only a problem if no cookies are set for the domain. If any cookies are set, even just the Google Analytics ones, there is no problem. [15:29:30] But since I've had multiple reports of this and many people do block cookies, I wanted to investigate. [15:30:16] No matter what, though, for the 2 wikis that don't experience the problem, it doesn't matter if cookies are set or not. [15:31:37] well cookies are probably preventing caching in general, as is the "no-cache, no-store, max-age=0, must-revalidate" on the working ones [15:32:27] on the stale views in browsers, is it just stale in browser cache, or is it hitting varnish (and if so, is it a stale 200 or a stale 304)? [15:33:44] note in this channel too: [15:33:48] !log rolling restart of global text frontend memory caches for upsizing - 15 min spacing, ~8H to completion - T135384 [15:33:48] Not expecting to hear !log here [15:33:48] T135384: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384 [15:36:56] bblack: Heh, we should change that message to "Wrong channel, master" :p [15:38:39] or add !log to this channel somehow [15:38:56] with echo back through to other !log channels [15:39:09] I don't really know how we structure all that [15:39:29] you'd think we could do !log locally in smaller channels like this, and I guess only -ops gets all the log-traffic? [15:41:49] bblack: Hmm, refreshing the working wikis does show a 200 status but refreshing, including forced browser refresh, in the non-working wikis doesn't show it hitting varnish at all as there is nothing coming through varnishlog for the URL [16:02:48] justinl: so, probably something about that Cache-Control header isn't forcing browsers to revalidate. Although, I would've thought "must-revalidate" would do that. [16:03:02] justinl: but I have to re-read Cache-Control RFCs all the time, it's confusing [16:05:47] justinl: a quick read of the RFCs does confirm that must-revalidate should cause the browser to at least hit a 304 in your varnish, if it considers the content stale. we assume the content's considered stale because you have max-age=0 + s-maxage=10, both of which are shorter than your super-stale pages [16:06:06] justinl: perhaps there's some other indicator that's letting the browser think that page is un-stale, like Expires? [16:07:06] seems strange that forced-refresh doesn't bypass browser cache [16:09:19] bblack: I just checked and the Expires header is set for the 2 wikis that work (Thu, 01 Jan 1970 00:00:00 GMT) but Expires isn't set for the other 3. I'm mostly really wondering what would be causing the headers to be different when I don't have any settings of which I'm aware that should cause the wikis to behave differently in this regard. [16:09:54] In theory I shouldn't have to make any custom changes for this stuff to Just Work Right. [16:09:56] justinl: that part, I really don't know anything about, that's all on the MediaWiki side of things, that I'm not very well versed in [16:10:30] That's cool, I can ask on the mediawiki-l list. :) [16:10:42] I appreciate your help so far :) [16:10:52] justinl: but regardless, it seems puzzling that: forced-refresh, lack of Expires, must-revalidate, and no maxages longer than 10s, on a very stale page, doesn't result in the browser asking varnish anything [16:11:45] bblack: Yep. I am only checking with Chrome right now but the reports I've gotten from users indicate they've tried multiple browser and seen the same behavior [16:32:52] so far the data on the text frontend restarts (and I've dug somewhat deep) is that the bulk of the frontend hitrate recovers in under a minute [16:33:14] and thus the ~15-minute spacing pretty much puts all graph effects into the category of "difficult to distinguish from background noise" [16:33:38] we see a small shift from hit-front to hit-local trailing a cache daemon restart, but it's small [16:34:28] when varnish-frontend restarts, varnishxcache exits and restarts to reconnect to the shm log. it's only sending stats every 30s, so there's a 30-60 second gap in the sent stats (last from old daemon, first from new) [16:34:57] 10Traffic, 10Varnish, 06Operations, 06Performance-Team, and 2 others: Vary mobile HTML by connection speed - https://phabricator.wikimedia.org/T119798#2327099 (10Jdlrobson) Should we decline this task @ori? Doesn't feel like this will happen since we'd prefer to give the same experience for all. [16:35:15] but the hit-front rate recovers to ~85% of its original value by the time the first new stats packet is sent [16:35:34] (confirmed by stracing varnishxcache's sendto() to graphite, and looping to catch the process-restart) [16:36:35] it's still a little strange to me, but it's not impossible either if the bulk of our hits are very hot hits and coalescing works right, etc [16:37:29] under a minute, that's interesting [16:38:53] it kind of makes sense, if you assume the usual distributions of things. certain objects are probably way way hotter than others. and those very-hot objects are going to coalesce on a single backend miss and then hit for the other stacked up clients. [16:40:18] all of that probably happens for a very large chunk of traffic (the hottest global objects) in a very short span of time. and since the misses are using a backend cache, it's not like they have to take much perf hit refreshing those objects either. [16:41:54] e.g. even if 1/3 of our total traffic was a constant spam of requests from various clients for /wiki/Popular, on frontend-restart the first new such request does fe-miss + be-hit, all the rest very briefly queue in the frontend waiting on that miss to the backend to complete, and then return hit from there on in the usual pattern [16:42:12] so even from the first moment, for the 1/3 of total traffic that URL represents, we'll only see one frontend miss [16:43:27] if we assume there's some kind of reasonably-expected distribution of object-hotness, a lot of the traffic is going to be concentrated in the top-N URLs and N doesn't have to be very big for those all to refresh into hits on the first miss in the first several seconds. [16:44:53] stepping away from "what are the effects of frontend restarts/wipes" and back to the level of cache tuning in the long run: [16:46:01] 1. Increasing frontend memory probably only has a modest effect on frontend hitrate or stack-wide hitrate. It's still nice-to-have, though, because otherwise that memory's waste buffercaching fast SSDs for backend instances that nominally don't hit-answer nearly as much traffic. [16:46:44] 2. We still need all the backend layers to catch the long tail and make the frontend restart have little-to-no impact on apps [16:47:15] 3. From the long-term view on cp3048 (which was raised to 25% some time ago now), it seems like even 25% is borderline as high as we can go without jemalloc tuning. [16:47:44] maps+misc are at 50%, but probably never see enough live objects in parallel to really use it all anyways. Should probably drop them back to 25% just for safety though. [16:48:09] at 25% the virtual memory size of the frontend ends up getting to be a very large fraction of total memory, due to various overhead/fragmentation. [16:49:12] we could tune that up with jemalloc tuning, but it's probably not worth it. growing it from 25% to say 50% isn't going to buy us some huge increase in hit-front rate; we're clearly already in long-tail territory even in the frontends. [16:49:58] 10Wikimedia-Apache-configuration, 13Patch-For-Review: create pk.wikimedia.org and redirect to wikimediapakistan.org - https://phabricator.wikimedia.org/T56780#2327177 (10Dzahn) Hi @80686 i assume you are the Manuel from WMCH, is that right? If so, could you take a look at http://wikimediapakistan.org? It d... [16:50:19] 10Wikimedia-Apache-configuration, 13Patch-For-Review: create pk.wikimedia.org and redirect to wikimediapakistan.org - https://phabricator.wikimedia.org/T56780#2327179 (10Dereckson) @80686 The domain wikimediapakistan.org still worked at 11:17 UTC this Wednesday, but doesn't work currently. Furthermore, a SSL... [16:51:12] in the cp3048 case, it's got 264G of RAM, and at 25% varnish configures frontend mem as 63G, and we end up with a resident size of ~80G, but a virtual memory size of ~192G [16:51:26] 10Traffic, 10Varnish, 06Operations, 06Performance-Team, and 2 others: Vary mobile HTML by connection speed - https://phabricator.wikimedia.org/T119798#2327193 (10atgo) @jdlrobson - please forgive me if I'm missing the discussion, but what's the rationale for providing the same experience for all? People co... [16:51:56] the resident size has been stable for quite a while in the ~80G range while the virtual keeps slowly growing so far [16:52:14] so I assume the excess virtual is mostly lost to jemalloc fragmentation and other assorted overhead [16:53:30] although now that I say that out loud, it's still puzzling [16:53:48] frontend malloc should be just malloc, not backend by mmap on FS or anything. and we have no swap configured [16:54:04] so where's the other ~80G of lost address space if it's not resident in memory? [16:54:28] sorry, "where's the other ~112G of lost address space" ... [17:02:07] surely not transient storage? [17:10:31] there is about 23G in mapped files, so that's part of the virt only [17:10:45] mostly compiled VCL, we've seen that "leak" before from reloads over time [17:14:15] ooh [17:15:30] "figure out if varnish4 still has the problem that old VCL shared object continue to consume memory" [17:15:44] (from T126206) [17:15:44] T126206: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206 [17:16:37] 10Wikimedia-Apache-configuration, 13Patch-For-Review: create pk.wikimedia.org and redirect to wikimediapakistan.org - https://phabricator.wikimedia.org/T56780#2327270 (10Dzahn) LE would be great, if you follow https://letsencrypt.org/getting-started/ with the new "certbot" client it should not take long, it e... [17:18:44] yeah but even if I take out that 23G of various mapped files, that still leaves us with ~90G of unexplained virt [17:19:06] which probably means I'm misunderstanding something I'm looking at in various tools and /proc entries, etc [17:20:38] status file in /proc/nnnn/ is pretty clear though [17:20:40] VmSize: 206271736 kB [17:20:47] VmRSS: 89062856 kB [17:21:01] I got the 23G from grepping things with sizes out of "maps" [17:23:47] is it possible the "missing" virtual memory that's not resident and not mapped, is just allocations that were never written to and thus not faulted to memory? [17:27:09] maybe jemalloc keeps growing heap by large chunks (several pages), then using only part of that chunk, then deciding it can't fill it for some frag/size reason and moving on to another chunk, leaving a bunch of pages here and there in the midst of heap that it never writes to, or something. [17:28:25] that's the only way I can really match up "this must be jemalloc waste" with "if it is, how is this waste (a) from malloc + (b) not resident in memory on (c) a machine with no swapspace" [17:30:12] so in theory that could be pages which have been allocated but not used right? [17:32:56] right [17:33:04] here's another interesting datapoint [17:33:23] root@cp3048:~# pmap -p 91232|egrep ' [0-9]{6,}K' [17:33:23] 00007f7fbc000000 11931648K rw--- [ anon ] [17:33:23] 00007f8296c00000 12500992K rw--- [ anon ] [17:33:23] 00007f8593800000 7479296K rw--- [ anon ] [17:33:23] 00007f8761c00000 172986368K rw--- [ anon ] [17:33:25] 00007fb0a1fe0000 753792K rw--- [ anon ] total 206271736K [17:33:48] those are all the 100MB+ chunks in the process map [17:33:58] they add up to nearly the whole VM size [17:34:45] we've got chunk sizes there of ~12G, ~12.5G, ~7.5G, ~172G, ~753MB [17:36:40] so given we've clearly got a fully-utilized (in allocation sense) 63G "malloc" cache using jemalloc, the 172G one has to be what jemalloc is backing that with [17:37:05] it's just wasting 2/3rd of the memory in some way that luckily doesn't make it resident because the pages are never touched [17:40:09] https://info.varnish-software.com/blog/understanding-varnish-cache-memory-usage [17:40:16] ^ has some stuff about jemalloc tuning [17:41:09] the last time we paid attention much to jemalloc, was long ago on precise, when we realized it was using a very-outdated copy of jemalloc (from in-tree, or from debian, don't remember which now), and fixed that [17:41:29] I think at the time I wanted to try telling the varnish build to use system malloc and then linking in tcmalloc. I don't remember if we ever tried. [17:42:30] heh from link above: [17:42:32] The red line represents the total amount of resident memory Varnish Cache is using (approximately 45GB). When we compare this with the storage size (blue line, 40GB), fragmentation waste is right around 10%. We also see very high virtual memory utilization (yellow line), which is mainly comprised of untouched zero pages. This explains why it goes above the physical memory limit (grey line). [17:42:41] so yes, some of jemalloc's waste comes in the form of untouched pages [17:46:19] I don't know if it's worth really trying too hard to fix it, though [17:47:06] we can just keep our total malloc size down to a reasonable size, not care too much about the zero-page waste, and still have more than enough memory to take on all the hottest objects in the frontend [17:49:30] if it's as simple as setting up malloc.conf like the link says, though, that's not much effort [17:50:08] even if we don't go digging into cluster-specific object-size distributions, just upping lg_dirty_mult might help [17:56:47] I reduce the frontend-restart spacing down to 5mins for the remainder. maybe it will let us see a little more statistical impact [17:57:02] (probably not much, but still, it's nice to explore the safe limits) [17:57:43] these rolling restarts are with the datacenters interleaved, too. so it's more like: cp4xxx, 5 min sleep, cp3xxx, 5 min sleep, cp2xxxx, 5min sleep, cp1xxx, 5 min sleep, repeat [17:58:00] we end up still with ~20m between two frontends cycling in a single DC [17:59:16] (and then with 8x caches for text in each DC, it's only 12.5% affected on each too... but ulsfo only has 6, it's the oddball with fewer total cache machines) [18:26:56] logged in SAL, but basically on cp3048 I temporarily added: [18:26:57] Environment=MALLOC_CONF=lg_dirty_mult:6 [18:27:13] and restarted the frontend, and puppet will put it back to normal for future restarts [18:27:59] the default is "3", the varnish blog was talking about "8". the effect is 2^N:1 ratio allowed, so 3 -> 6 takes us from 8:1 to 64:1, 8 would take us to 256:1 [18:28:14] we'll see if it has some notable effect on resident size or cpu utilization at 6 [18:28:29] (over days to rebuild and exercise the large cache) [18:28:38] s/resident/virtual/ above [18:31:14] on the text-frontend restarts, even with the 5m spacing there's no appreciable impact visually on the varnish-caching graphs [18:31:42] for future restarts (maybe for applying lg_dirty_multi more-broadly), we can try going even lower on the spacing until we see some real impact [18:57:52] on the object-sizing front: we can roughly estimate that the average body size on text is ~5KB and on upload it's ~20KB [18:58:26] (based on s_bodybytes + s_hdrbytes / s_reqs) [18:58:53] that's a very very loose approximation, and includes probably a bunch of responses that aren't cache objects (passes, synths for redirect/beacon, etc) [18:59:08] and includes a bunch of sythesized headers per response that wouldn't be in the object either [18:59:39] still, it gives us some ballpark notion that the default jemalloc chunk size is probably bigger than it should be [19:01:30] lg_chunk defaults vary with jemalloc version, but it's either 2MB or 4MB. We could experiment with dropping it conservatively to, say, 256K or 512K. [19:01:45] (for upload) [19:02:02] for text, maybe even as low as 64K? [19:05:15] with an avg body size of ~5K that should be more than enough [19:05:33] yeah just there's a lot of iffy-ness in that average body size calculation [19:06:43] 304 responses with zero side come to mind too heh [19:06:48] s/side/size/ [19:08:00] right [19:08:42] I mean perhaps we could just ignore non-2xx responses when calculating the average size? [19:10:05] I just took it from varnishstat output, which is over all requests [19:10:38] probably the right way to look at average object size is to only count 2xx responses that aren't synthetic and do have a body length [19:11:03] we could get that from varnishlog and grafana-ize it :) [19:11:51] but maybe that's getting too deep for the importance level of this problem [19:12:32] just dropping it to 512K (or maybe even 1MB) on upload and observing long-term effect on memory stats would be enough to tell us (a) if we're on the right track and (b) if we've gone far enough we don't care anymore [19:13:06] and text maybe try something in the 128K-ish range similarly [19:27:27] really this whole notion that average response size is an indicator is tricky to begin with [19:27:54] even if we assume we filter down to full-response-length cache hits (which is tricky on upload with Range) [19:28:20] we're still seeing the distribution of cache object sizes in responses, which is different than the distribution of sizes in the cache's storage [19:29:07] e.g. if your storage typically had 300 objects in it, and the hottest object is 1MB in length and accounts for 10% of all requests... [19:29:31] that 1MB size counts for 10% in avg response size calc, but it's really ~0.3% of average object size in cache [19:30:16] and probably the latter matters more for jemalloc tuning than the former [19:34:47] FTR - both our varnish3 and 4 daemons are definitely linking external jemalloc from jessie (good), and jessie's version defaults to lg_chunk:22 (4MB). later jemalloc drops this to 2MB as the default. [19:36:41] on cp3048 so far (early to tell, but reading tea leaves in top so far) it seems like lg_dirty_multi:6 had a minimal impact (but it's there) on virtual size growth, and avg cpu load went up notably (but not horribly) [19:37:14] I keep saying lg_dirty_multi, it's lg_dirty_mult and it is configured right in the env var :) [21:34:47] 10Traffic, 06Operations, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2328291 (10BBlack) `lg_dirty_mult:6` doesn't seem to be making a very big impact on its own so far, so I'm changing the cp3048 experiment to `lg_dirty_mult:6,lg_chunk:20` (...