[00:00:15] 10Wikimedia-Apache-configuration, 06Operations: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2359804 (10Dzahn) [06:43:54] elukey: \o/ [07:22:17] 10Traffic, 06Mobile-Apps, 06Operations, 13Patch-For-Review: alias /apple-app-site-association and /.well-known/apple-app-site-association - https://phabricator.wikimedia.org/T130647#2360091 (10jcrespo) 05Open>03Resolved a:03jcrespo Thank you https://grafana.wikimedia.org/dashboard/db/varnish-http-err... [08:28:57] 10Traffic, 06Operations: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2360167 (10ema) The issue is reproducible on hosts running vhtcpd (eg: cp1061). Normal CPU usage on hosts *not* running vhtcpd (eg: cp1058). [10:15:39] 10Traffic, 06Operations: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2360352 (10ema) So, besides the fact that we should check why vhtcpd is running on some misc nodes and not on others, this seems to be a scalability issue in the new varnis... [11:51:58] ema: yeah what you're saying about varnishlog.py makes sense [11:52:53] ema: so, vhtcpd running on random misc's, is probably an oversight from their reconfiguration long ago from another cache type [11:53:44] ema: and the inefficiency is the same one we battled in varnishakafka I bet: the new varnish API has no coalescing sleep/delay [11:54:25] maps has vhtcpd, but it has a dedicated multicast address and they're not using it yet, so no traffic [11:54:39] the leftover ones on misc are getting the full PURGE blast from text/upload purges [11:54:54] right [11:55:50] in vk, if the varnish callback gave zero events, we do a 10ms sleep before checking again, and that fixed it there [11:55:57] probably varnishlog.py needs something similar [11:56:04] I'm afraid we do that already [11:56:19] I looked for it and didn't see it [11:56:52] oh varnishlog4.py, right [11:57:00] https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/varnishlog4.py#L82 [11:57:35] yep [11:57:56] hmmm [11:58:06] that being said, there might be a similar (same?) inefficiency in python-varnishapi [11:58:12] could just be that the python solutions are going to suck without a fast purge filter, too [11:58:29] or that [11:58:49] lunch, bbl [12:00:24] the docs for varnishapi use a loop like ours, but they set their delay at 500ms on zero events heh [12:01:08] (also, I'm surprised python2's time.sleep() uses select for fractional instead of nanosleep or usleep [12:01:34] that may be a factor too [12:12:21] I donno, I'd confirm that the sleep is the issue a couple of different ways: 1) see if hacking time.sleep to 0.1 fixes it (which is a non-ideal solution, as shm will likely overflow in that time window on busy nodes) 2) implement a true nanosleep using ctypes and try 10ms sleeps using that instead of time.wait() [12:12:30] err instead of time.sleep() [13:38:45] bblack: looks like sleep is *not* the issue [13:39:20] regardless of the amount of time we sleep in varnishlog.py, CPU usage goes to 100% with > 1k requests per second [13:51:29] yeah... in strace there's some kind of busywait going on inbetween our sleeps, too [13:51:41] maybe like you said earlier, we're missing the sleep at a deeper abstraction layer [13:56:32] yeah it has something to do with varnishapi's Dispatch logic [13:56:46] def Dispatch(self, cb, priv=None): [13:56:46] i = self.__cbMain(cb, priv) [13:56:46] if i > -2: [13:56:46] return i [13:56:46] if not self.vsm: [13:56:48] return i [13:56:51] self.lib.VSLQ_Flush(self.vslq, VSLQ_dispatch_f(self.__callBack), None) [13:56:54] self.lva.VSLQ_Delete(byref(cast(self.vslq, c_void_p))) [13:56:56] self.vslq = None [13:56:59] ^ that stuff and related [13:57:10] I think the sleep magic needs to be up there, rather than in varnishlog.py [14:00:41] ema: found it [14:00:50] \o/ [14:00:52] well I think [14:02:40] doing some more testing with a live hack on cp3010... [14:03:08] I've put a sleep(0.01) in dispatch and it seems to work [14:03:37] well, my hack is similar-ish, but it breaks things apparently [14:05:35] I'm never seeing varnishxcache actually send stats, which it should do periodically [14:06:02] bblack: that rings a bell, haven't we seen that before? [14:06:27] well we have a time-based thing in the outer script. if there are any stat increments and more than 30s has elapsed, send stats, basically [14:06:40] I think my sleep hack may be preventing that from ever happening [14:06:52] oh wait, I did finally see a send... [14:06:59] ok [14:07:30] I've removed the sleep from varnishlog4.py and moved it to python-varnishapi's Dispatch as a brute test [14:07:45] maybe I'm just impatient heh [14:07:54] the output looks ok at first glance... [14:07:57] and cpu usage is low [14:08:37] anyways [14:09:05] in my hack, in varnishapi.py near line 700 in __callback [14:09:07] while 1: [14:09:07] i = self.lib.VSL_Next(tra.c) [14:09:07] if i < 0: [14:09:07] return (i) [14:09:10] if i == 0: [14:09:12] time.sleep(0.01) [14:09:15] break [14:09:17] I injected the sleep above, and removed the one in varnishlog [14:09:25] the rest is identical, just the sleep there is added [14:09:53] alright I'll also give it a shot [14:10:39] well the effects are still strange [14:11:00] it sent stats at 14:07:32.118736 then 14:09:19.130777 [14:11:24] it's supposed to send them in 30 second batches, and I have a while loop elsewhere ensuring 1/sec real traffic into my test host (not just purge) [14:12:19] mmmh no, that sleep seems to break things [14:12:50] I've tried to inject it, lots of VSL entries get skipped [14:13:11] yup [14:13:44] where was your Dispatch sleep at? [14:13:45] same sleeping in Dispatch, for the record [14:14:13] hmmm ok [14:14:46] but why would it miss *so* many entries by sleeping just 0.01 secs? [14:14:58] we're probably sleeping in the wrong place [14:15:04] yup [14:15:10] does it even notice/log that it overflowed? [14:15:19] it seems like varnishapi should know when overflow happens [14:15:41] heh, I think it should write that in self.error [14:16:05] but I can't see any statements printing that attribute [14:16:11] also, "fixing" sleep + missed events, and our fix is based on what worked for varnishkafka, which has an open ticket about missed events [14:16:16] seems un-coincidental :) [14:16:22] :) [14:17:02] OK so I think varnishapi.py expects the script importing it to do error reporting [14:17:15] probably, it's just an API mapping [14:17:35] I would expect, design-pattern-wise, that any necessary sleep should be at the varnishlog level alonside the error reporting [14:17:53] which is kinda how we had things already [14:17:55] but still [14:22:09] so the reasoning behind our sleep in varnishlog.execute was: if there is nothing to read, don't keep on calling Dispatch but sleep a little [14:22:23] which is a good idea to avoid wasting CPU cycles [14:22:44] however, now increasing *that* sleep time doesn't change anything in terms of high cpu usage under high load [14:23:19] because the problem is not that we're calling Dispatch too many times for nothing, we're calling it often because there *is* stuff to read [14:25:02] now, if we sleep in varnishlog.execute regardless of the return value of self.vap.Dispatch, we miss pretty much all the events [14:25:57] the real problem here is where the filtering happens and how efficiently, IMHO [14:26:16] it's supposed to be the case with varnishlog that we only get callbacks for matching events, which don't include the PURGE spam [14:26:24] agreed, filtering on !PURGE requests doesn't change anything in terms of CPU usage [14:26:44] in the xcache case at least, we're already effectively filtering PURGE, because PURGE doesn't have X-Cache [14:27:17] the sleep needs to happen before filtering, though [14:27:34] if we sleep on zero events post-filtering, we sleep through too many events and overflow [14:28:11] in other words, the sleep needs to be deeper in the stack (before filtering), and needs to be shorter, probably, if we want to avoid overflows [14:28:15] so even though design-pattern-wise we should sleep in varnishlog, we probably have to sleep in varnishapi [14:28:21] right [14:28:39] even varnishapi probably isn't deep enough, right? [14:28:59] heh [14:31:04] the sleep has to be before tag-based filtering [14:31:25] before any kdin of filtering. and really the sleep basically goes away if you're processing a high enough rate of raw events [14:32:17] probably with proper error logging added to the current code, we're getting a lot of overflows already even with the CPU locked up [14:33:04] (if varnishapi is setting those errors properly) [14:33:35] it's probably time to take a fresh look at 4.1's stock varnishncsa and/or varnishlog code [14:33:54] I took a look at varnishncsa and there are no sleeps [14:34:01] I imagine VUT is doing that [14:34:51] varnishlog.py->varnishapi.py skips over VUT I guess? [14:35:06] yep! [14:35:28] VUT.idle_f = &flushout; <- varnishncsa [14:35:44] so they don't even call the output func until some idle event [14:36:03] yes and then VUT_Main is using that [14:36:08] with a 0.01s sleep [14:36:28] else if (i == 0) { [14:36:28] /* Nothing to do but wait */ [14:36:28] if (VUT.idle_f) { [14:36:28] i = (VUT.idle_f)(); [14:36:28] if (i) [14:36:28] i = VSLQ_Dispatch(VUT.vslq, vut_dispatch, NULL); [14:36:28] if (i == 1) [14:36:28] /* Call again */ [14:36:28] continue; [14:36:28] else if (i == 0) { [14:36:30] break; [14:36:31] /* Nothing to do but wait */ [14:36:33] } [14:36:33] if (VUT.idle_f) { [14:36:35] VTIM_sleep(0.01); [14:36:36] i = (VUT.idle_f)(); [14:36:38] continue; [14:36:38] if (i) [14:36:41] break; [14:36:43] } [14:36:46] VTIM_sleep(0.01); [14:36:48] heh [14:36:51] well anyways [14:38:21] so is VUT exposed in the real libvarnishapi, or not? [14:39:03] not really, we ended up re-inventing the wheel in vk for that reason IIRC [14:40:02] does stock varnishncsa/varnishlog even handle this case correctly? or do they also drop events with 1k/sec PURGE flowing and matching non-PURGE traffic like xcache? [14:41:07] there could also be an issue of sizing the vsm larger for our rate, but either way you'd expect that to just create overflows, not cpu waste [14:43:34] varnishncsa seems to be doing the right thing [14:44:13] I've sent 30k requests and it logged > 32k to disk [14:44:17] with low cpu usage [14:44:27] 1.7k req/s [14:45:57] so in the deeper varnishapi stuff [14:46:14] ultimately the vslq callback is supposed to return 0 if everything's normal [14:46:25] maybe at the python level we're failing to do that and returning some other value? [14:47:23] vsl_callback in varnishxcache does an explicit "return 0" [14:50:40] well still, the first obvious thing is fix up varnishlog and/or varnishapi python to log error/overflow correctly and make sure they're handled sanely [14:50:49] without that, it's hard to see what happens with other problems [14:50:50] yes [14:59:23] so, after profiling varnishlog.py and varnishlog4.py it turns out that in order to emit 538782 lines of output varnishlog.py called stuff like varnishlog.py:100(_vsl_handler) exactly 538782 times [14:59:54] varnishlog.py, instead, called Dispatch 428721 times to output 5405 lines... [15:00:04] sorry, *varnishlog4.py* [15:03:38] uh, but that's because I was using different filters with varnishlog4 [15:06:08] https://phabricator.wikimedia.org/P3218 [15:06:12] https://phabricator.wikimedia.org/P3217 [15:09:04] so yeah, both versions are doing "the right thing" when it comes to the number of function calls (it is very close to the number of entries processed) [15:09:15] FYI I installed the new vk on cp1046 and it seems working fine (tailed /var/cache/vk/etc.. plus watched kafka json messages on stat1002) [15:21:44] also live hacked the config to add the "end:" prefix in the timestamp field and works fine [15:21:58] I am checking kafka messages and everything looks good [15:38:56] nice [15:50:42] https://people.wikimedia.org/~ema/callgraph-v3.png [15:50:45] https://people.wikimedia.org/~ema/callgraph-v4.png [15:52:24] bblack: those are the callgraphs of varnishlog.py v3 and v4 under load [15:53:16] ema: perhaps it's simply that our our shm log size is too small for the rate, we're not handling errors like overflow correctly (or at all), and that either because we're not handling them correctly, or just in general, the oveflows are what's causing CPU excess? [15:53:51] e.g. overflow path may skip the sleep, or just take a long time because it's an exception at the python level, or whatever [15:55:18] I couldn't really get any overflow error from varnishlog.py, it might be that python-varnishapi is swallowing those [15:56:37] what I tried was checking for self.vap.error in varnishlog4.execute [15:56:47] but it's never set [16:00:09] probably, whether or not it's at issue here, we should look at vsl params [16:00:25] we did s/shm_reclen/vsl_reclen/ (which probably isn't enough) [16:01:04] there's also vsl_space default to 80M [16:01:23] and vsl_mask to avoid ever logging certain events, not even sure what the default is there [16:01:40] oh default is Default: -VCL_trace,-WorkThread,-Hash,-VfpAcct [16:02:25] there's some strange interaction between vsl_buffer and vsl_reclen too [16:02:58] interestingly the absolute max for vsl_reclen is 4084 bytes [16:03:29] with 2K as a reasonable URL limit, I could easily see the rest of the records for headers and cookies overflowing the rest. how the hell can 4k max per logged request be sufficient? [16:03:36] (not to mention the internal tracing) [16:04:11] or does varnish let a big request span multiple vsl "record"s? [16:05:03] ok wait, I think I understand the docs now [16:05:26] there is no max on vsl_buffer, which is allocated bufferspace per-thread for VSL writes, and has to fit at least one full record [16:05:51] vsl_reclen's maximum allowed value is vsl_buffer + 12 [16:06:06] sorry, -12 [16:06:21] 4084 is just the example max, given default vsl_buffer of 4K [16:06:32] oh that explains the 4084 [16:07:16] shm_reclen was tricky in varnish3, which is why we couldn't tune it arbitrarily. it could cause overflows of other buffers, etc... [16:07:27] IMHO in v4 we should waste memory on getting these right [16:08:23] set vsl_buffer to something like maybe 64K, and vsl_reclen at ~16K (perhaps 16K - 12, in case that affects efficiency)? [16:09:20] and then bump vsl_space a bunch so it's never an issue [16:09:48] if we bump it a whole lot, we'll have to raise the size of the tmpfs we mount on /var/lib/varnish too [16:10:05] in general we have the RAM for it though [16:10:31] our worst remaining caches have ~96G RAM [16:10:48] should we depool, say, cp1058 and play with the settings? [16:11:52] so /var/lib/varnish, you waste 60 bytes every time you reload VCL, the rest is tmpfs copy of vsl [16:14:38] and tmpfs doesn't waste unused allocation. the allocated size of tmpfs just sets a ceiling on what can be allocated [16:15:08] going back a little to what you were saying about filtering: on cp1008, varnishlog.py (v3) is also showing high CPU usage (70/80%) without filters [16:15:26] as in python varnishlog.py -n frontend > /dev/null [16:16:03] so it might also be simply that we're simply handling too many purges and in v4 we don't filter efficiently [16:16:12] yeah [16:16:17] too many 'simply' but you get the point :) [16:16:34] but we can also process them more-efficiently with bigger sleeps though, but that's predicated on not overflowing during sleep [16:16:56] yes [16:17:04] so I'd say as an experiment (aimed at all varnish4's, but starting out manual on one node): [16:17:23] 1) Bump the tmpfs size to 2GB [16:17:38] 2) Bump the frontend vsl_space to 1GB [16:18:27] 3) Bump the backend vsl_space to 256MB [16:18:36] 4) Bump vsl_buffer to 64K and vsl_reclen to 16K-12 on both frontend and backend [16:20:29] with just that in place, we can be pretty sure we're not overflowing single record fitting in vsl_reclen ever, and in the common case vsl_buffer should spool out many requests at a time making it more-efficient (at least 4, but commonly many more). [16:21:40] and even with maximally-sized records (which should be rare!), we'd fit ~64K records in vsl_space on the frontend before it loops [16:22:26] so even if we had sleep(0.1) on zero-available, that can handle a rate of something like 640K/sec [16:22:36] (of maximally-huge records) [16:23:05] at the same time, varnishapi should handle overflows and I'm not sure it does that now :) [16:23:23] then we know for sure that overflow isn't an issue in functional or perf terms, and the only real cost is the wasted 1-2GB of memory, and whatever we waste for 64K vsl_buffer * n_threads [16:23:41] varnishapi should handle overflow, too. fix that first while it's easier to reproduce :) [16:23:49] it should at least log it loudly [16:23:56] exactly! [16:24:25] but then I still think we're best off in varnish4 (where it's safer to do so), raising all the numbers to something like the above. it's worth the lost ram to make sure we never ever lose stuff [16:26:52] (and the vsl_buffer waste is going to make varnishd more efficient, and the vsl_space waste makes longer sleeps for efficiency a non-issue on the receiving side) [16:27:37] default vsl_* + default vut sleep(0.01) are probably not considerate of our se-case here [16:27:51] *use-case [16:38:42] bblack: uh, we do get lots of 'Log reacquired' errors even with sleep(0.01) [16:39:29] https://github.com/xcir/python-varnishapi/blob/master/varnishapi.py#L633 [17:04:57] bblack: https://gerrit.wikimedia.org/r/#/c/293123/ [17:05:49] we could also raise an exception, but that might be logging too loudly :) [17:32:26] bblack: something like this is probably needed as well https://gerrit.wikimedia.org/r/#/c/293132/ [17:35:13] 07HTTPS, 10Traffic, 06Operations, 06WMF-Communications, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2361655 (10Florian) [18:06:41] heh nice [18:07:27] I wonder if the initial "Log reacquired" is what happens on varnishd restart, or on event buffer overflow (as in, ran out of space before it caught up), or the routine wrapping of the buffer space in memory? [18:09:15] ema: re the latter patch on clearing errors, probably the right place to clear them universally is just after the consumer's callback returns [18:09:37] (in other words, right after the varnishapi consumer's code has had a chance to consume any outstanding error) [18:10:18] I donno though, the error state model is strange in there [18:12:35] so, in trying to sort out the miss-vs-pass stats problem and also think about the eventual "pass directly from the frontend, when it makes sense" problem, I've stumbled on an interesting variant [18:12:54] let's say we want the frontend daemon, when it's "pass", to jump straight to the applayer [18:13:19] and then we make the exception "unless we're passing in the frontend-only, because of a size or range restriction, but might hit in the backends" [18:14:00] but some of those frontend-only passes are hit_for_pass/uncacheable. we could easily set a req.foo to differentiate when we initially pass/hfp, but how would we tell on further pass due to hitting the hfp objects that are frontend-pass only? [18:15:40] oh we have obj.http.header [18:16:08] so we could set beresp.http.foo when setting up hit-for-pass, and then find it again during pass by looking at obj.http.foo [18:16:28] maybe [18:17:05] we could also take the tactic that frontends always talk to the first backend, and any direct access on pass goes from the backend and skip the whole problem [18:17:48] for stats it's similar, we could simply never count frontend-pass as pass, only backend-pass as pass for stats [18:17:52] hmmmm [18:40:48] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2361834 (10Gehel) [19:30:57] 10Traffic, 06Discovery, 06Operations, 10Wikidata, 10Wikidata-Query-Service: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2361834 (10Smalyshev) Yes, after the switch to built GUI we can cache at least the hashed CSS/JS pretty much forever, they'd never change. Non-hashed ones prob... [21:15:33] bblack: quick question... in x-cache headers, I see "pass", "miss" or "hit". Miss is obvious, not sure about pass vs hit [21:16:41] it's quite complicated. even x-cache is hiding a lot of gory distinctions [21:16:54] there's also "int" in addition to miss, pass, and hit currently. [21:17:14] bblack: yep, I saw that as well... [21:17:33] hit means a cache hit in cache storage. there was no need to query a deeper cache server (or the applayer, if already at the last cache server) [21:18:05] int means a locally-generated response from the cache (like VCL emitting its own synthetic 301 redirect, without using a cache object or talking to another server) [21:18:18] miss and pass both mean it had to talk to something deeper (a further cache or the applayer) [21:18:40] the line is read right-to-left. The rightmost is the outermost cache, things to the left are progressively deeper towards the applayer [21:19:20] once you encounter "hit", you have to remember that everything to the left of "hit" is part of the cached object that got hit. it's whether the stuff to the left missed, passed, or hit when that object was first pulled into the hitting cache. [21:20:34] bearloga: is trying to massage some of those data to get some stats and I'm trying to explain what I do understand, but my knowledge has quite a few holes... [21:20:40] the distinction between miss and pass is subtle, and it's easy for things we'd like to consider "pass" to end up saying "miss" instead, so it's blurry [21:21:07] bblack: you're making much more sense than I do! [21:21:10] but basically pass means we passed through because something was uncacheable, whereas miss might mean the object might be possible to cache, just wasn't in cache [21:21:45] but sometimes things are uncacheable and turn up as a miss in X-Cache. it's something we're still working on being able to get stats and distinction better on. [21:22:13] for now, I'd just consider miss+pass as one lump, as we currently do in https://grafana.wikimedia.org/dashboard/db/varnish-caching [21:23:54] bblack: if pass is something uncacheable, how come we have some x-cache with pass-miss-pass? uncacheable in memory, but cacheable on disk? [21:24:11] we have different rules for mem or disks? [21:25:38] (to be clear - the frontend-most (right-most) cache is the only memory cache, all others are disk caches) [21:25:57] yes, sometimes we have different pass rules on the frontends (mem, only 1) than on the backends (disk, many) [21:26:28] but also, commonly we don't know an object is uncacheable until after we fetch it, and then we cache the fact that it's uncacheable for a little while (~10mins) [21:26:53] bblack: damn, this is complex! [21:26:57] so "pass" could mean we have an explicit VCL rule that tells us (on request reception) that there's no way this request is cacheable [21:27:03] bblack: thanks a lot, I know understand this much better! [21:27:23] "pass" can also mean we hit one of those special cache entries that tells us that we learned from the backend that the response is uncacheable for 10 mins [21:27:43] bblack: thank you so much for explaining! [21:27:47] but we can also "miss" briefly on those while refreshing the knowledge that the backend response lets us know it's uncacheable [21:27:58] so again, for now consider miss/pass a blurry mixed thing :) [21:30:11] bblack: how do you retain your sanity with all that in your head? :P [21:30:40] I don't, I work from an asylum [21:32:07] my girlfriend mostly work in an asylum! [21:32:30] bblack: thanks a lot for taking the time to explain all that! Most interesting! [21:32:35] np! [21:39:13] bblack: yes! thank you for taking the time! super appreciate it and it's really helpful ^_^ [22:41:50] 10Traffic, 06Operations, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362614 (10BBlack) [22:43:19] 10Traffic, 06Operations, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362614 (10Dzahn) policy.wm runs on https://vip.wordpress.com/ [22:55:54] 10Traffic, 06Operations, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362614 (10Krenair) >>! In T137258#2362633, @Dzahn wrote: > policy.wm runs on https://vip.wordpress.com/ It appears to be running on different servers to the blog which is h... [22:58:31] 10Traffic, 06Operations, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362661 (10Slaporte) a:03Slaporte I've reported this to wordpress, and I'll update here as they resolve it.