[13:16:16] are we still using varnishncsa? [13:16:46] I can't find anything in our repo using varnish::logging https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/manifests/logging.pp [13:17:54] and yeah basically I found another FD leak similar to https://phabricator.wikimedia.org/T135700 but this time caused by varnishncsa (on cp4001) [13:18:22] so I was wondering whether we can just make sure varnishncsa is stopped or if it makes sense to investigate further [13:19:16] ema: looks like the last use of it was removed with removal of the last bits of analytics-related udp2log stuff [13:19:27] f4b9377c931e5688cd0f3b0e749f095c47539cc0 aka https://gerrit.wikimedia.org/r/#/c/259051/ [13:19:36] so yeah we can probably nuke that [13:19:54] might be interesting to make sure it's not still deployed somewhere as a running service JIC first [13:20:11] oh, if I kept reading, apparently it is? [13:21:00] well yeah on some machines it's running, perhaps because of the new v4 packages? [13:21:05] no idea [13:21:36] the machines are: [13:21:40] cp4019.ulsfo.wmnet: 2328 [13:21:40] cp4020.ulsfo.wmnet: 2343 [13:21:40] cp4002.ulsfo.wmnet: 22070 [13:21:40] cp4001.ulsfo.wmnet: 16713 [13:21:49] so perhaps the ulsfo reboot? [13:22:04] so yeah, I'd say nuke it.... [13:22:25] we have a few related bits to clean up, and then we can do some kind of service-disable in the main varnish module for it [13:22:51] git grep varnishncsa turns up some other random bits to nuke too [13:23:03] like the icinga stuff [13:23:09] (nrpe/nagios stuff, logging::config, initscripts, etc) [13:23:36] can I just go ahead and remove all of it? [13:23:52] yeah I think so, none of it's ref'd anywhere in practice [13:24:01] great [13:24:18] and then post-removal, do a commit to disable any stock varnishncsa service from the package somewhere appropriate [13:24:25] and make sure the few running ones are runtime-dead [13:24:29] alright! [13:24:51] in other news: https://github.com/varnishcache/varnish-cache/pull/1966 [13:25:59] yeah they've been working hard on cleaning up our issues and several others. the pattern looks like they're gearing up for a new 4.1.x point release [13:26:20] yay! [13:49:13] 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2333849 (10elukey) Tried for the first time to query Hadoop data via Hive, so this info will need to be validated, but I run a script to find how many holes... [13:49:52] bblack: https://gerrit.wikimedia.org/r/#/c/291226/ [14:07:54] 10Traffic, 06Operations, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2333902 (10BBlack) Today cp3048's at 156G virt + 75G resident and looks pretty stable. So it was a definite improvement, but there's still a lot of waste. As suggested ab... [14:20:21] ema: fro the service stop in https://gerrit.wikimedia.org/r/#/c/291238 , maybe add enable => false too so it doesn't start on boot? [14:24:07] bblack: yep, CR updated [14:26:39] <_joe_> we don't use varnishncsa? [14:26:49] <_joe_> I kind of remembered I used it :P [14:29:15] _joe_: we shouldn't be running the service [14:30:05] running the program yourself to look at the logs is another story :) [14:37:19] re: the ttl-reduction stuff, I think the next move is probably to implement Surrogate-Control in VCL (just for our inter-cache purposes) and start using it to align cache lifetimes. [14:38:00] basically this stuff, roughly: https://gerrit.wikimedia.org/r/291244 [14:38:03] oops wrong paste [14:38:09] https://phabricator.wikimedia.org/T50835#2267566 [14:38:19] new keybindings? :) [14:39:24] well, moving back to X-style copypaste + focus has been easy actually, I spent many years like that before the mac :) [14:39:49] what's tripping me up is that selecting URL text out of Chromium's address bar doesn't set it for middle-button pasting for some reason [14:39:57] I have to do an explicit copy hotkey over in Chromium [14:40:50] uh, strange. it should work [14:41:25] sometimes it does! :) [14:41:30] I'm not sure what the trigger is [14:41:47] } [14:42:04] <_joe_> bblack: yeah both chrome and chromium fuck up the traditional X clipboard [14:42:25] ah ok... so it works consistently if I actually manually select the text in the address bar (as hold and drag across the text) [14:42:48] but when you just click left-button once in the address bar, it highlights the whole thing, but that doesn't set it for X-paste [14:43:18] true! [14:43:18] http://xkcd.com/1686/ [14:43:26] ah triple-click does it (double picks out a word) [14:45:03] it's nice being able to only touch the mouse for copypaste [14:46:59] well except in the browser [14:47:21] maybe I should try one of the chrome plugins that does vi-like keymappings [14:47:32] I'm using vimperator on FF and it's not bad [14:47:53] https://vimium.github.io/ [14:48:17] it is a full-on disaster when you're using someone else's browser though [14:48:25] heh perfect, it even fixes the earlier problem too [14:48:28] yy:Copy the current URL to the clipboard [14:50:08] anyways back on the earlier topic about surrogate-control [14:50:37] I'd like to aim towards a point where we calculate beresp ttl/grace once at the backend-most varnish, and then set them via Surrogate-Control up the chain [14:51:21] so even if the applayer Cache-Control says something has a 1 month max-age, if our config says varnish caps the TTL at 7 days + 7 more days of grace or whatever.. the S-C header transmits that to the upper-layer varnishes [14:52:21] right now our TTL-capping just sets obj.ttl in the local varnish, but if CC allows a month, and our cap is 7 days, a nearly-expired object in a backend can get fetched into another backend and get a fresh 7-day life again. [14:52:49] so today on text with FE cap at 1d, and BE caps at 7d, the theoretical maximum object life cap across layers is still 22 days in the ulsfo case [14:53:03] 1d ulsfo-front + 7d ulsfo-back + 7d codfw-back + 7d eqiad-back [14:54:33] parsing C-C/S-C and setting S-C in native VCL will be a little ugly I think [14:54:47] but probably doable [14:55:35] I guess we can rely on varnish's interpretation of Expires+C-C at the backendmost, and then use obj.ttl to set S-C's maxage, and set some standard grace beyond it in stale-while-revalidate, etc... [14:56:04] and in upper varnishes, have something in vcl_fetch interpret Surrogate-Control to explicitly set obj.ttl + obj.grace [14:56:23] well I missed a step in there: [14:56:49] I guess we can rely on varnish's interpretation of Expires+C-C at the backendmost, **then cap obj.ttl at our desired max**, and then use obj.ttl to set S-C's maxage, and set some standard grace beyond it in stale-while-revalidate, etc... [14:58:37] ugh that doesn't work either I guess without further hacks. A cache object with obj.ttl=7d still has obj.ttl=7d 6 days later. [14:58:47] so we actually have to look at the Age field too [14:59:35] I guess on reception in a non-backendmost varnish [15:00:15] so in thsoe varnishes, if Surrogate-Control max-age is set, we override obj.ttl with S-C:maxage - beresp.Age or something like that [15:02:26] and then we still need to sort out grace-mode. the long-term goal on the TTL+grace stuff is to get our "normal" TTL down way shorter (say 1d for the whole stack), but have a much longer grace period (say 1w) to cover operational issues (like taking a cache DC offline for a few days without effectively losing all objects) [15:03:00] the problem with just blindly setting e.g. 1d TTL + 7d grace is the 7d grace would apply all the time, not just in failure cases [15:04:05] to fix that, we probably need healthchecks of lvs service IPs from varnish [15:04:34] so that we can adjust request-side grace depending on backend health, whereas the obj.grace sets the overall grace maximum for cache eviction, basically. [15:05:58] 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334115 (10elukey) Modified a bit the script to print host and timestamp related to the sequence number right before the hole. Here some snippets of the res... [15:11:17] don't we do that already with std.healthy(req.backend_hint)? [15:11:54] well, yeah, but we only define health checks/probes for cache<->cache, not cache<->app (well, except a couple oddball cases on misc) [15:12:08] oh I see [15:12:37] we didn't do that because LVS is managing their health and depooling servers, etc [15:14:40] but still, we could use a health probe there that's more-forgiving just to check for "can't reach the LVS service at all" [15:15:05] (more-forgiving so it doesn't trip up on some transient failure of just one or a few of LVS's randomized backend servers before LVS corrects via depool) [15:15:39] arguably we could look at 500/503 for grace at upper layers too, since they can't see the applayer health regardless [15:16:19] e.g. if a varnish-frontend gets a 503 (well double-503, since we do retry a 503 once), try a stale grace object at that point, since the 503 from the backend could be because the underlying applayer is down [15:23:47] heads-up traffic team [15:23:52] I'm draining esams! [15:24:20] I've been doing most stuff without downtime, but we're about to do even riskier moves now [15:24:23] so why not [15:26:12] ok [15:37:19] 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334176 (10Nuria) > The loss seems to happen around the hour, but I don't have a good idea about the why (logrotate afaik happens daily). You probably have... [15:50:54] trying to trim the trailing \t in localssl.erb with partial success [15:51:05] https://puppet-compiler.wmflabs.org/2964/cp1008.wikimedia.org/ [15:51:39] for some reason ssl_stapling on; keeps on being misaligned, but a bit less than before [15:51:45] gotta love erb templates [15:54:17] 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334236 (10Ottomata) Oh ho ho, check this out. Looking at > `cp1061.eqiad.wmnet 2016-05-27T12:59:59 2514541` ``` ADD JAR /usr/lib/hive-hcatalog... [15:56:30] 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Firefox: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2334257 (10Elvey) Tried it. Disabling http2 as @BBlack /@Thibaut120094 suggested eliminates the error for me too. It seems to occur only if... [16:25:27] bblack: would it make sense to split the countries that normally are mapped to esams 50% towards eqiad and 50% towards codfw as the second preferred DC in config-geo? [16:25:52] with esams drained pretty much all the traffic goes to eqiad [16:44:50] ema: i forget, was misc on varnish 4 for all of april? [16:47:44] ottomata: nope, we upgraded the first server in may IIRC [16:47:51] ah ok [16:48:30] ema: codfw is farther away from the coast, so that would be a perf hit [16:49:46] ottomata: I just checked on SAL, the first machine was upgraded on 2016-05-10 [16:51:44] paravoid: true, I was more thinking from a DC-load perspective [16:53:35] hmmm ok, ema which machine was that? [16:53:58] and when was the last one upgraded? (sorry i think you've told me this before) [16:54:13] eqiad can handle it, so I don't see it as a big problem personally [16:54:54] ottomata: cp3007 was the first one to be upgraded [16:55:05] paravoid: fair enough :) [16:55:28] ok [16:58:00] ottomata: then we ran for a while with v4 and downgraded for a few days after the varnish bugs were discovered [16:59:24] See SAL for the details, I've gotta go now. See you guys tomorrow! [17:00:45] ok thanks! [17:54:48] 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334727 (10Ottomata) Ah, I was incorrect in my previous comment. The dt is the request timestamp, and the sequence number is not generated until the respon... [19:11:40] bblack: cp3048 has puppet disabled, I see some "unpuppetized" chatter on SAL