[05:14:48] ema: the high n_vcl is from all the depool/repool stuff. Those are accomplished with vcl reloads too [05:19:31] so, possibly some combination of all the tuning hacks is helping, and the increase FE hitrate [05:20:55] we got a significant bump in FE hitrate from that 1MB limit, and that was just a stab in the dark really. more fine-tuning may raise it more and the hit/2 thing almost certainly will as well [05:21:06] maybe that and lru_interval and nuke_limit hacks are enough [05:24:42] it would be nice to patch varnish for etcd depool/repool without vcl reloads [05:24:59] (in some way that's generically reuseable by others with their own setups) [07:49:59] bblack: still stable! \o/ [09:56:53] 10Traffic, 06Operations, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2636310 (10MoritzMuehlenhoff) Ack, we're currently using the internally-managed ldap-labs.codfw|eqiad.wikimedia.org cert which are valid until Nov 28 2017.... [10:14:39] so yeah the hfp 1M change took us from 70% fe hit to 80% [10:14:43] not to shabby :) [10:14:48] *too [10:15:12] and then I think nuke_limit also plays a role [10:17:29] uh, it's happening on cp4005 [10:19:43] bumping nuke_limit changes the cause, but not the 503s [10:21:15] <_joe_> ema: cp4015 'Return code of 134 is out of bounds' [10:21:20] <_joe_> error on the HTTP check [10:21:43] yeah that could be a false positive, it happens sometimes [10:21:55] and it's only 1/3 checks [10:24:02] depooling and restarting varnish on cp4005 [10:24:16] ERROR:conftool:Error when trying to set/pooled=no on cp4005.ulsfo.wmnet [10:24:19] ERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out [10:24:36] worked fine on 2nd try [10:26:38] ok I've repooled cp4005 after restaring varnish-be, no more 503s [10:27:00] interestingly, for a very little bit after bumping nuke_limit we still got 503s but without 'Could not get storage' [10:27:08] but that didn't last long at all [10:30:33] next one should be cp4007 I think [10:34:20] man the 5xx reqs/min icinga check is really slow [10:43:13] this time n_lru_nuked was ~27M [10:43:51] so about 2M more than when the bug happened last time [10:52:01] one thing we can try is running an experiment with a much smaller -sfile [10:52:31] and then see if somehow we manage to reproduce this reliably [12:18:24] ema: so you raised nuke_limit, and for a very little bit there were non-lrufail 503, then.... ? did the 503s stop completely or go back to lrufail? [12:24:39] ema: also, probably we should paramterize the nuke_limit/lru_interval stuff. since these hacks aren't necc on frontends and probably have downsides there. [12:25:41] (for the runtime hacks I only did the backends, and I'm guessing FEs haven't restarted since, so it's probably still backend-only in practice right now) [12:31:12] my code understanding of how nuke_limit works is improved now, too. I don't think it's currently useless, whatever that ticket says. [12:31:30] the way it seems to work is this: [12:32:09] EXP_NukeOne will (apparently) walk the entire LRU list looking for the first item it can evict (not busy, etc) [12:32:46] the allocating code that needs N bytes to store a new object will invoke EXP_NukeOne up to nuke_limit times and retry e/cache.h:int EXP_NukeOne(struct worker *wrk, struct lru *lru); [12:32:55] grr stupid touchpad [12:33:22] the allocating code that needs N bytes to store a new object will invoke EXP_NukeOne up to nuke_limit times and retry EXP_NukeOne up to nukee/cache.h:int EXP_NukeOne(struct worker *wrk, struct lru *lru); [12:33:53] :p [12:34:14] ok the default settings for synclient on this mac are awful [12:34:35] the allocating code that needs N bytes to store a new object will invoke EXP_NukeOne up to nuke_limit times and retry until it gets the space it needs [12:35:23] so e.g. if it wants to allocate 16KB, and all of the existing objects on the LRU are 2KB each, it's going to have to nuke 8 of them to make space. if nuke_limit was <8, we'd fail to find storage. [12:35:52] so nuke_limit is more about the max object count we'll evict to make room for a single object [12:36:40] but even before nuke_limit is reached, if EXP_NukeOne returns -1 (which means failed to free an object at all), the whole thing immediately fails [12:36:54] and our LRU_Failed log entry is what happens within EXP_NukeOne when it's about to return -1 [12:39:29] so I can see how nuke_limit would help, especially if you have large disparities in object sizes [12:39:54] but ultimately it's not going to fix the LRU_Fail [12:40:12] LRU_Fail implies it walked the entire LRU list and failed to find a single object it could free [12:42:16] but if the lru is holding ~2.7M objects, it seems pretty impossible they're all busy siultaneously and thus un-nukeable [12:42:23] err 27M [12:42:38] which seems to imply that the LRU doesn't contain all the storage objects [12:42:49] maybe there's a bug and some objects are still in storage but no longer in the LRU [12:43:07] or a design-intentional thing where only a subset of the objects are ever in the LRU at any given time [12:47:57] I donno [12:49:57] https://phabricator.wikimedia.org/P4042 [12:50:15] ema: ^ source patch for varnishd that would make LRU_Fail more informative [12:52:58] I don't know what the "flags" will mean if they turn up nonzero, but seems interesting to know. more importantly, if n_objcore in the LRU is close to the total objects in storage, that eliminates "partial/tiny LRU list" theories [12:53:26] leaving us with explaining how ~27M objects are all busy. maybe there's a refcnt leak somewhere and they look artificially busier than they should be? [12:55:10] bblack: so MAIN.n_lru_nuked is the total number of nuked objects [12:56:04] since startup, so it's more an indication of uptime/workload than anything else [13:01:20] and yes, I bumped nuke_limit to 5000 while varnishd was LRU_Failing and captured a few non-LRU_Fail 503s [13:01:27] does n_lru_nuked stop incrementing when the LRU_Failed start happening? [13:01:27] then it quickly went back to LRU_Fail [13:01:43] we can check that in ganglia [13:02:28] in the meanwhile, if you want to take a look at the varnishlog, root@cp4005:varnishlog-backend-503.log are a few captured non-LRU_Fail 503s [13:02:48] while all the 503s captured since yesterday are in ~ema/2016-09-13-backend-503.log [13:03:27] well probably those are just some temporary race? [13:03:33] I don't know [13:05:13] let's see the event lasted from 10:14 till ~ 10:25 [13:05:14] https://gerrit.wikimedia.org/r/#/c/310540/ [13:05:50] ^ better debug patch, in case queue size != n_objcore, and probably gerrit patch easier to work from than paste :) [13:06:43] bblack: it's a little hard to tell from ganglia whether n_lru_nuked was actually increasing heh [13:07:01] but wait, perhaps it wasn't, look at this [13:07:03] https://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Upload+caches+ulsfo&h=cp4005.ulsfo.wmnet&jr=&js=&v=0&m=varnish.MAIN.n_lru_nuked&vl=N&ti=Number+of+LRU+nuked+objects [13:07:11] I don't really trust our lack of LRU_Cand messages, they may get filtered somewhere/somehow [13:13:03] godog, bblack: do you know where to find the ganglia rrd files for cp4005? [13:13:39] nope [13:14:21] the patch looks good! I would love to be able to do the same with systemtap so that we don't need to restart the daemons though [13:14:49] yeah [13:14:57] that attempts++ looks tough to reproduce in stap though :) [13:15:01] restarted daemons will be the last ones to finally reach it again heh [13:15:25] yes and I think it could take a long time now with higher frontend rate and perhaps the hacks [13:15:39] looking at how objects enter the LRU list in the first place.... [13:16:01] they get tehre for sfile via EXP_Insert which happens via HSH_Unbusy [13:16:10] ok [13:16:57] but that code is common to malloc/file (and everything else, even persist) [13:17:15] so it seems unlikely something special about "file" is failing to fully populate the LRU list with the stored objects [13:17:25] ema: yeah uranium:/srv/ganglia has those [13:17:31] godog: <3 [13:17:33] but also, when/where/how HSH_Unbusy is called seems to be subject to do_stream status [13:17:56] so that again argues for maybe disabling do_stream by default on upload backends just to see the effect [13:18:35] a related thought in the opposite direction: we left in those do_stream=true VCL conditionals from v3, on the assumption that v4 default do_stream=true makes them basically no-ops for now [13:19:02] * ema shivers [13:19:04] what if they're not? I imagine there are certain conditions under which v4 reverts to non-stream, and our VCL could be overriding that to bad effect [13:19:24] /o\ [13:19:41] let's get rid of those in v4 [13:19:50] it seems like a pretty harmless and easy experiment: make the do_stream=true conditionals conditional v3 blocks [13:19:53] yeah [13:20:12] ok, I've got like 3 running threads :) [13:20:16] let me first check the rrd file [13:20:28] ok, I'm out for a bit [13:20:38] see you later [13:50:28] 10Domains, 10Traffic, 10DNS, 06Operations, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2636714 (10grin) >>! In T144508#2625975, @Ryan_Lane wrote: > I spent years working on a project to make this possible. Before I left we had an en... [13:54:19] uh I got a non-LRU_Fail (but LRU related?) 503 [13:54:34] -- ObjHeader X-Cache-Int: cp1048 miss, cp2026 miss [13:54:34] -- Fetch_Body 2 chunked stream [13:54:34] -- ExpKill LRU_Cand p=0x7f3e71ca7080 f=0x0 r=1 [13:54:34] -- ExpKill LRU x=289764944 [14:11:22] 10Traffic, 10MediaWiki-Cache, 06Operations: Duplicate CdnCacheUpdate on subsequent edits - https://phabricator.wikimedia.org/T145643#2636791 (10hashar) [14:29:42] maybe the debug patch + smaller storage could work for aquicker debug repro [14:29:51] e.g. start a backend with 1/10th the normal storage we allocate [14:30:06] but also, that you saw an LRU_Cand at all means they're not filtered..... [14:30:20] so we're very likely going to see attempts=0 on those debug msgs [14:30:56] I wonder what the actual failure cause was in your LRU_Cand one... [14:31:13] it would be nice if varnish's 503s documented more clearly what their rationale was... [14:33:36] merged the do_stream patch in the meanwhile [14:34:16] https://gerrit.wikimedia.org/r/#/c/310551/ [14:57:58] cool [14:58:10] I was browsing through various related google results and varnish blog entries [14:58:36] the MSE blog entry is interesting [14:58:58] (the first one) [14:59:56] basically the first two paras of https://info.varnish-software.com/blog/introducing-varnish-massive-storage-engine [15:00:10] make it sounds like they know file isn't workable in this scenario [15:00:34] it's also kind of an interesting revisionist view of history in a few different ways :P [15:01:00] (ignoring persist, and that existing sites were using large disk storage successfully with open source, but can't now) [15:02:29] > From earlier we’ve known that if using the file backend and if Varnish is kept running for a long time, with a nearly full cache, the cache will start to suffer from fragmentation. [15:03:18] yeah but (I think!... ?...) fragmentation is the kind of thing nuke_limit solves, and LRU_Fail isn't fragmentation. LRU_Fail means it could find even one single object that can be evicted. [15:03:32] *couldn't* [15:03:50] oh yeah, for sure it's a bug somewhere there [15:04:26] anyways, if we have new assumptions about whatever varnish can do, we can re-design around them [15:04:26] bblack: I'm parametrizing lru_interval and nuke_limit, but I don't think we've got a [15:04:29] distinction between frontend and backend runtime_params in puppet. OK with me [15:04:32] adding be_runtime_params or do you have other suggestions? [15:04:44] the hard thing is figuring out what these new limits are [15:05:15] ema: yeah just make separate fe_runtime_params and be_runtime_params is probably easiest, even if it's slight duplication [15:05:23] alright [15:05:45] at the r::c::instances level I mean [15:05:55] yup [15:06:36] if fragmentation from small-vs-large obj is the problem but storage size isn't, we could do the inverse of the fe hitrate thing [15:06:47] on the backends, refuse to cache objects < size N [15:06:58] where there's a small overlap maybe between the FE and BE opposing limits [15:07:11] (the overlap probably being near the most-common object size) [15:07:53] 10Traffic, 06Operations, 13Patch-For-Review: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1702350 (10Dzahn) @bblack This looks like it's resolved since those merges? [15:08:52] 10Traffic, 06Operations, 13Patch-For-Review: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#2636998 (10BBlack) 05Open>03Resolved a:03BBlack Well, there's long term stuff we can do better, but I guess no solid reason to keep this open in the task sense. [15:09:36] but really I think there's a straight up bug in how LRU works in our particular case [15:10:14] and since neither file nor malloc have special LRU code (they use the core code)... either their behaviors still manage to trigger a difference, or the difference is simply in the total size [15:10:36] because our FEs have a much higher traffic rate and see the same objects passing through, but never have LRU problems with ahttps://info.varnish-software.com/blog/introducing-varnish-massive-storage-engine much smaller malloc area... [15:11:01] sorry, I keep putting random buffer pastes in my typing heh [15:11:15] watch out! :) [15:11:18] this trip is the first time I've really had to spend time with the macbook open since switching to linux [15:11:36] and the touchpad driver sucks and keeps pasting things and scrolling the screen while I'm typing :P [15:12:13] I may just have to bind an i3 hotkey to enable/disable the touchpad, because I haven't found the right magic settings for synclient [15:15:11] heh, done [15:15:18] i3 is so awesome :) [15:15:52] yeah I'm lost without it [15:16:41] the biggest thing I miss from ez-mode-mac to this new setup is easy wifi stuff [15:16:55] you can get gnome applets in i3 :) [15:17:05] yeah but I don't even install the gnome libs :) [15:17:16] xfce, whatever, applets! [15:17:17] I've avoided all the gnome/kde and other heavyweight stuff completely [15:17:27] yeah I need to find some applet that can manage it [15:17:29] nmcli sucks [15:17:33] y [15:18:51] trying wicd [15:25:19] let's see how many things I'm gonna break with the fe_ be_ split for runtime_params [15:25:22] https://gerrit.wikimedia.org/r/310569 [15:28:11] another thing I probably already brought up yesterday [15:28:20] it's possible this can be solved with a smaller TTL, too [15:28:38] (maybe if TTL is low enough and pro-active eviction is a factor, we never face the same filled-lived-object scenario) [15:28:47] filled-live-objects, I mean [15:29:20] it's also possible this is still somehow related to lock contention on LRUs and blah blah, and that our insane purge rate is a factor [15:29:40] it's looking them all up in the object store after all, even though the vast majority are not for upload.wm.o at all [15:30:12] that's something that could use improving anyways. for now the "easiest" path is the vhtcpd regex filter, if we can figure out the insane levels of meta-meta-meta-escaping/quoting [15:31:23] bblack: could you please take a quick look at https://gerrit.wikimedia.org/r/#/c/310569/? It seems fine according to pcc [15:31:35] (vhtcpd has a regex-filter argument to limit to upload.wm.o on cache_upload, but there's no puppetization to use it. I tried once and got annoyed at how hard it was to quote through puppet->ruby->init->shell->regex or whatever [15:32:39] LGTM, probably pcc is a better judge than me anyways :) [15:33:20] off to the office... [15:33:47] oh it's 8:30 there! crazy [15:33:55] see you later [16:24:45] 10Wikimedia-Apache-configuration, 06Operations: Apache mod_status metrics only available in ganglia - https://phabricator.wikimedia.org/T141424#2637306 (10elukey) Since we are super fancy let's go directly to Prometheus! [16:56:41] 10Wikimedia-Apache-configuration, 06Operations: Apache mod_status metrics only available in ganglia - https://phabricator.wikimedia.org/T141424#2637489 (10fgiunchedi) @elukey yes! See also {T145659} and related [16:59:35] 10Traffic, 06Operations: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2637510 (10ema) [17:00:03] 10Traffic, 06Operations: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#2637525 (10ema) p:05Triage>03High [17:00:15] forgive me for the title ^ [17:08:03] but yeah, I thought this funny bug deserved its own ticket [17:27:22] so, news of today: [17:27:34] - upload do_stream=true on v3 only [17:27:48] - lru_interval and nuke_limit bumps restricted to upload backend [17:28:21] - bug happened only on one machine, documented on T145661 [17:28:22] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [17:28:48] - raising nuke_limit while the issue occurs does not really help [17:31:10] tomorrow I'd like to try converting https://gerrit.wikimedia.org/r/#/c/310540/1/bin/varnishd/cache/cache_expire.c into a stap probe to avoid restarting varnishd and waiting another 6 days for the problem to happen [17:31:25] alternatively, I could just make a much smaller -sfile [18:39:19] 07HTTPS, 10Traffic, 06Operations, 06Performance-Team, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2454047 (10Dzahn) current access log is only about 9% https and a chunk of that is all Catchpoint monitoring. About 32% are from python-requests UA, of which... [18:42:48] 07HTTPS, 10Traffic, 06Operations, 06Performance-Team, and 2 others: HTTPS-only for stream.wikimedia.org - https://phabricator.wikimedia.org/T140128#2637840 (10Dzahn) >>! In T140128#2625078, @AlexMonk-WMF wrote: > Can you filter those access logs down to labs entries only (208.80.155.128 - 208.80.155.255),... [22:18:45] bblack, hey, we're seeing CL:0 again. [22:19:06] bblack, on https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Bartagame_fcm.jpg/800px-Bartagame_fcm.jpg but only from one cache node? [22:19:34] bblack, > Content-Length: 0 [22:19:37] bblack, > X-Cache: cp1048 hit/2, cp2020 hit/8, cp4007 miss, cp4005 miss [22:19:43] I don't know how to parse that into a data center line again. [22:19:55] We're seeing it on *a lot* of images.