[08:01:45] bblack: I determined in my thesis, that fairly popular articles are requested ~2 minutes apart. At least that is true for "Batman" and "Wikipedia" as articles. A grace time of 10 minutes should easily be enough to work for most at least fairly popular articles [08:02:49] the question is: do we need a grace dependant on the popularity, or rather something that hits most of the time except the most unpopular objects? [08:27:49] Snorri: Sometimes defining terminology even is hard for questions like these heh [08:30:12] Hehe yeah I guess so. Scaling the grace depending on the popularity might be nice. But setting a higher grace would not have a bad impact on the popular objects but increase performance on unpopular once, right? [08:30:40] Snorri: We can imagine if we were able to experiment on all possible values of grace "G", and measure the percentage of client fetches of cacheable content which suffer a true miss->fetch penalty (as opposed to getting an immediate cache object)... [08:31:05] the more you raise G, the fewer miss->fetch penalties we see in the aggregate [08:31:38] probably it makes a curve shape where, for our dataset and parameters, beyond a G value of X, there are diminishing returns and thus it's not worth it [08:32:04] and if X also sounds reasonable to developers (in terms of excess effective TTLs), then X is our answer [08:32:57] whereas if the X value that gives us all the realistic gains we can get is an unacceptably-high value, then we have to find some pragmatic tradeoff with developers who don't want their objects with 10m TTLs doing grace hits 3 hours later... [08:33:21] I see. The question is...can X be determined mathematically? Is there time to test it extensively? Or do we have to guess on the data we already have? [08:33:40] inuitively, though, we don't expect a conflict here. probably X is under an hour, possibly well under. [08:34:26] it would be nice to determine it mathematically. I'm not sure we have the right historical data to determine it from. [08:34:42] worst case we take a good guess, and then experimentally move the number up and down a bit and see how our live stats change. [08:38:32] personally, I find the whole way that varnish deal with this issue backwards, the whole "grace" concept [08:39:20] to recap briefly: the idea is if you just had a pure 1h cache object that's very hot, once an hour it expires and suddenly all these hot fetches all stall out momentarily while we fetch a new copy from the appserver, then everything speeds up again for another hour. [08:39:27] sadly I think my mathematical knowledge might not be enough. It looks like some stochastical analysis I do not know (yet) but I do have a friend I could ask about how to best try to determine this X. Maybe he can at least give a hint what values we would need to determine at least a good X [08:40:17] so we they tack on these extra seconds of "grace" time after the real TTL expires, during which we'll still serve it as a cache hit, but any hit on it during that time triggers a background refresh which should refresh the object into storage before grace runs out completely. thus there is no periodic stall for clients. [08:41:13] it's a desirable behavior, but now you're going to have to explain grace to the developer, too. "I know your code said this object was cacheable for exactly 3600 seconds, but actually we could serve it when it's 3900 seconds old, because that's how we solved this perf issue" [08:41:40] it would be better for the "grace" window to be within the TTL [08:42:17] so if you have obj.ttl=3600 and obj.grace=300, any hit after Age=3300 would trigger a background refresh, and the 3600 limit is never violated. [08:42:40] Yeah. It seems like a very convoluted way to try to solve this performance issue [08:42:55] you could call it pre-fetching instead of serving slightly-stale content [08:43:25] we could actually make it work that way, if we manipulate CC/SC headers and such [08:43:49] app sends s-maxage=3600, we desire grace=300, so we re-set s-maxage to 3300 + obj.grace=300 [08:44:32] I think this would be preferable to do. With a way to "pre-fetch" we might also be able to fetch just modified and therefore purged content. [08:44:58] In my data this was the real reason for misses. The TTL expiring happened only with very unpopular objects actually [08:45:25] hello, as FYI the Analytics team deployed the new data consistency checks for the varnishkafka sequence numbers. Now we have one alarm for "holes" in sequence numbers not caused by incomplete records (like dt:-) and one for incomplete records (but with a different threshold). Last step is to make the data available via Pivot (it is already available via Hive on Hadoop) [08:48:26] to be honest. I´m not sure a Grace value will bring a lot of performance. Unpopular objects are kept for a very long time. (with 1 request every 6h I was able to keep something in the cache until the TTL expired!) [08:49:02] Snorri: lack of a grace value certainly has a negative performance impact [08:49:19] we have one today, which is commonly 5 minutes [08:49:24] On the other hand...the more popular an object was, the more often it had a low age as it was purged due to an modification. [08:49:29] well, I think always 5 minutes [08:49:48] yeah the purge issue is a whole other issue [08:49:55] our purging is downright stupid currently [08:50:34] we're purging at a much higher rate than actual edits happen, and a lot of the purges are pointless and/or redundant [08:50:38] bblack: Yeah I do think it helps. I´m just saying in the cases where it really come in handy (the hot objects) a purge is a lot more likely than ttl expiring. At least with the 1d ttl right now [08:50:40] (orders of magnitude higher) [08:51:27] a lot of the purging is driven by inter-linked stuff. someone edits X, and there are all these related things through templates, categories, whatever, that get purged along the way [08:51:47] but in many cases, the fallout of the original change caused no actual content change in the thousands of related things it caused to be purged [08:52:16] or if it did, the difference was not important and the purge could've been soft instead of hard [08:52:37] (a soft purge basically puts the object into grace-mode immediately, but doesn't actually purge out the object) [08:53:14] on top of those sorts of logical issues, purges are then also issued redundantly (intentionally), and then there's apparently bugs that caused accidental reprocessing of purges too [08:53:24] Okay. Sadly it looks like the purge issue is extremely hard to pinpoint and fix :D [08:54:16] it is, it's been bitched about for about a year now (when it first started getting bad), but no real progress [08:54:43] we're purging objects at a rate of ~2000/sec [08:54:53] actual article content edits are more on the order of ~10/sec [08:55:14] ... [08:55:35] there's something seriously wrong there, but it's in complex mediawiki-related things outside of my scope [08:55:37] Orders of magnitude indeed! [08:55:40] all I can do is complain [08:56:15] we know some of the presently-unavoidable multipliers, though [08:56:26] each purge is issued 3x times in an attempt to fix race conditions [08:56:48] and commonly purges of a single "article" emit 4x purges for desktop+mobile variants and ?action=history of both [08:57:04] so that would turn ~10/sec into ~120/sec [08:57:09] but not 2000 [08:57:41] <_joe_> bblack: did anyone even understood where are those purges coming from? [08:58:09] <_joe_> bblack: btw, is it ok to try to clean up some of the purge backlog on commons if I keep in sync with ema when he arrives? [09:01:01] _joe_: not really. lots of handwaving about template edits (which is probably it), but not a solid explanation as to why it has to be this way. We know from RB folks looking at similar things that many of the generated invalidations result in a re-parse of to identical HTML (no-op change/purge) [09:01:15] I have no idea what you're talking about with purge backlog on commons? [09:01:20] I´d love to find out where those purges come from. But with you guys not really knowing how...i feel a bit intimidated [09:01:33] jobqueue backlog? [09:01:47] <_joe_> bblack: bots have been removing a very popular category from pages on commons, which created a jobqueue backlog [09:01:54] ok [09:01:57] <_joe_> because we do throttle htmlCacheUpdate jobs [09:02:02] <_joe_> at the mediawiki level [09:02:20] <_joe_> so to clean that up, I'd need to make periodic runs of runjobs.php [09:02:31] <_joe_> with --nothrottle [09:03:16] that's probably not a great idea, to let it run unthrottled. although I don't really know exactly how unthrottled "unthrottled" is [09:03:41] it won't ultimately hurt varnish too bad, but you may cause a lot of those purges to be ineffective [09:03:42] <_joe_> that means if I ask it to run 100 jobs, it does it as fast as possible [09:03:47] <_joe_> we can make some tests [09:04:06] because the transmission of the is lossy, and the odds of loss go way up if you burst them out at high rates [09:04:09] <_joe_> we definitely need a way to catch up with that backlog in an effective way [09:04:26] <_joe_> yeah that's why I'm not thinking of doing big batches [09:04:30] or just wipe it out and let the natural ~1day cache rollover take care of actual content updates [09:04:42] <_joe_> it's ~1 day now on pages? [09:04:51] I thought we were talking about commons images? [09:05:03] <_joe_> no, it's the associated File: pages [09:05:21] heh [09:05:38] in any case, the burst->loss thing is on very small scales [09:05:47] <_joe_> I thought the cache on those was larger [09:05:52] <_joe_> bblack: sigh [09:05:53] the two primary ways you're going to lose purges from bursting is socket buffers and vhtcpd overflows [09:06:19] the outbound socket buffer on whatever (queue runner host?) will drop outbound packets if you send them too fast [09:06:45] the inbound socket buffers on all the caches will drop them if they come in faster than vhtcpd can read them (but it's C and it tries very hard to read them very quickly) [09:07:11] <_joe_> yeah I am aware of the theory [09:07:24] <_joe_> I am not experienced with what rates are actually sustainable [09:07:24] whatever vhtcpd can slurp it buffers in host memory, but it has a limit to how much it will buffer as it spools them out to varnish. if the buffer reaches the limit, the whole buffer is wiped clean and restarted to continue picking up new purges [09:07:34] <_joe_> for text caches, nonetheless [09:07:48] text caches are the only case that really matters, it's where all the purge problems are [09:08:00] <_joe_> I know :/ [09:08:03] nobody knows really, but I have some anecdotes [09:08:51] we wouldn't ever know about socket buffer loss [09:08:55] <_joe_> I'll take a look at rising the throttling on commonswiki specifically a tiny bit first, and see if that makes a difference [09:09:08] but we do get to see it when vhtcpd overflows its buffer and self-wipes [09:09:26] before the jobqueue throttling was put in, it was doing that quite regularly [09:09:30] <_joe_> I'll talk with ema about this when he's around too [09:11:08] if you want to see vhtcpd view of things... well in theory it's in ganglia, but usually the data's all missing because, well, ganglia [09:11:15] but you can view it raw as: [09:11:19] salt -v -t 5 -b 101 -G cluster:cache_text --out=raw cmd.run 'cat /tmp/vhtcpd.stats' [09:11:30] which gives lines like: [09:11:33] {'cp1068.eqiad.wmnet': 'start:1479307491 uptime:671129 inpkts_recvd:1629316111 inpkts_sane:1629316111 inpkts_enqueued:1629316111 inpkts_dequeued:1629278151 queue_overflows:0 queue_size:37960 queue_max_size:3288371'} [09:11:40] vhtcpd updates that file every 15s or so IIRC [09:11:57] <_joe_> do we collect those metrics? [09:12:19] yes, in ganglia. except it's usually broken because something or other needed restarting to make ganglia pick up metrics again and nobody ever notices [09:12:48] but the point is, you can look at the raw outputs in salt and check for rising queue sizes, and especially overflows [09:12:55] right now none of them have overflows since their last restarts [09:13:06] <_joe_> bblack: ok, cool [09:13:06] (except cp1008, but it's a slow crap test box) [09:13:40] <_joe_> I guess I could've found this on wikitech, and it's late night there [09:13:51] I doubt it's even there heh [09:14:18] sounds like something to get going for prometheus -> grafan [09:14:19] a [09:14:41] <_joe_> it is: https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging [09:15:01] <_joe_> a bit hidden, but it was there [09:15:25] cool [09:16:20] I'm only here because I'm baking pies heh [09:16:37] <_joe_> eheh [09:16:42] <_joe_> pumpkin pie? [09:16:51] <_joe_> oh man, I want one! [09:16:55] I keep having to wait on various timers to go off in the pie-making process, and I'm up all night baking pies, and I chat to avoid falling asleep between doing things [09:17:23] yes, pumpkin pies. I have to make four of them for two different thanksgiving dinners. and I'm very slow at these kinds of things heh [09:18:04] my average time from start->finish on a pie is ~1.5h this evening [09:18:09] but most of that is waiting [09:20:23] this is about the time I wish I had just bought some pies at a store [09:20:37] but I enjoy the process, too, of trying to figure out how to make a really good one [09:21:33] the first year I tried, I made and threw away (well, ate some of and then threw away) ~4 pies before I started making ones I liked enough to bring to dinner heh. [09:25:09] 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820282 (10Joe) 05Open>03Invalid [09:25:11] well...that is commitment! [09:27:15] https://goo.gl/photos/6wE1WLvGe6Qvk2yt6 [09:27:24] ^ that's the one that just came out of the oven [09:27:36] the round bulge in the middle will eventually flatten out as it finishes cooling [09:29:01] it certaily looks tasty! [09:29:29] I wouldn´t taste it, I´m sick right now. But I bet it is tasty. [09:29:55] 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820296 (10Joe) This is a huge discussion to have, and would need a ton of auditing. Basically: - We return 200 for *... [09:31:07] get better! [09:31:41] <_joe_> Snorri: get better! [09:31:54] <_joe_> and yeah it looks great AND tasty [09:35:18] thanks! [09:57:31] 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820406 (10elukey) 05Invalid>03Open [10:16:30] 10Wikimedia-Apache-configuration, 06Operations, 13Patch-For-Review: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2820445 (10elukey) Re-opening the task after a chat with Joe, let's find a solution for this issue :) What are the sc... [10:18:27] <_joe_> bblack, ema I am running htmlCacheUpdate for commons on terbium, checking if that causes overflows [10:20:57] <_joe_> and, boy we do have purge spikes :/ [12:00:32] 10netops, 06Discovery, 06Operations, 06WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2820741 (10Addshore) @akosiaris is there any way to get this expedited? (mentioning you as you complete... [14:09:11] 10Traffic, 06Operations: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821095 (10ema) [14:09:22] 10Traffic, 06Operations: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821110 (10ema) p:05Triage>03Normal [14:10:55] 10Traffic, 06Operations: varnishapi.py AttributeError: VSM_Close - https://phabricator.wikimedia.org/T151561#2821095 (10ema) [14:26:06] 10Traffic, 06Operations: Varnishkafka and related VSM daemons seeing abandoned VSM logs on cp1055 - https://phabricator.wikimedia.org/T151563#2821164 (10elukey) [14:29:52] 10Traffic, 06Operations: Varnishkafka and related VSM daemons seeing abandoned VSM logs on cp1055 - https://phabricator.wikimedia.org/T151563#2821164 (10ema) We might have to increase workspace_backend: https://github.com/varnishcache/varnish-cache/issues/1990 [14:33:43] 10Traffic, 06Operations: Varnishkafka and related VSM daemons seeing abandoned VSM logs - https://phabricator.wikimedia.org/T151563#2821211 (10elukey) [15:04:35] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2821257 (10ema) [15:04:38] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660#2821256 (10ema) 05Open>03Resolved [15:06:57] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Convert text cluster to Varnish 4 - https://phabricator.wikimedia.org/T131503#2821259 (10ema) 05Open>03Resolved a:03ema [15:06:59] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2821261 (10ema) [15:07:39] 10Traffic, 06Operations: Sideways Only-If-Cached on misses at a primary DC - https://phabricator.wikimedia.org/T142841#2821263 (10ema) [15:07:42] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2821264 (10ema) [15:07:45] 10Traffic, 10Varnish, 10MediaWiki-API, 06Operations: Evaluate the feasibility of cache invalidation for the action API - https://phabricator.wikimedia.org/T122867#2821265 (10ema) [15:07:46] 07HTTPS, 10Traffic, 10Varnish, 06Operations, 05codfw-rollout: Outbound HTTPS for varnish backend instances - https://phabricator.wikimedia.org/T109325#2821267 (10ema) [15:07:51] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2168822 (10ema) 05Open>03Resolved [15:09:10] it's taskclosing day :) [15:17:09] \o/ [15:18:03] 10netops, 06Discovery, 06Operations, 06WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2821272 (10akosiaris) 05Open>03Resolved a:03akosiaris This seems to have fallen between the crack... [15:20:15] 10Traffic, 10Varnish, 06Operations, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2821283 (10ema) [15:20:50] 10netops, 06Discovery, 06Operations, 06WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*.codfw.wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T146474#2821284 (10Addshore) Thanks @akosiaris ! [15:22:29] 10Traffic, 06Operations, 13Patch-For-Review: Varnish4 is unexpectedly retrying certain applayer failure cases - https://phabricator.wikimedia.org/T150247#2821290 (10ema) We've proposed a patch introducing a varnishd parameter limiting the number of extrachance retries: https://github.com/varnishcache/varnish... [15:23:25] 10Traffic, 06Operations, 13Patch-For-Review: varnish-backend: weekly cron restart for all clusters - https://phabricator.wikimedia.org/T149784#2821291 (10ema) 05Open>03Resolved [15:26:56] 10Traffic, 06Analytics-Kanban, 06Operations: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms - https://phabricator.wikimedia.org/T148412#2821292 (10ema) 05Open>03Resolved >>! In T148412#2781980, @elukey wrote: > Last but not the least, no alarms were fired for uploa... [18:13:56] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2821720 (10Gilles) I think it might be just a little misunderstanding. If you're talking about describing to the client what kind of media the original is... [19:05:25] 10Wikimedia-Apache-configuration, 06Discovery, 06Operations, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2821887 (10Jdlrobson) [19:05:56] 10Wikimedia-Apache-configuration, 06Discovery, 06Operations, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#730151 (10Jdlrobson) Given this would probably redirect to https://www.wikipedia.org/ probably something of con... [20:04:44] ema, did you copy the vmod packages out of experimental in T150660? [20:04:45] T150660: Post Varnish 4 migration cleanup - https://phabricator.wikimedia.org/T150660