[13:16:16] <ema>	 are we still using varnishncsa? 
[13:16:46] <ema>	 I can't find anything in our repo using varnish::logging https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/manifests/logging.pp
[13:17:54] <ema>	 and yeah basically I found another FD leak similar to https://phabricator.wikimedia.org/T135700 but this time caused by varnishncsa (on cp4001) 
[13:18:22] <ema>	 so I was wondering whether we can just make sure varnishncsa is stopped or if it makes sense to investigate further
[13:19:16] <bblack>	 ema: looks like the last use of it was removed with removal of the last bits of analytics-related udp2log stuff
[13:19:27] <bblack>	 f4b9377c931e5688cd0f3b0e749f095c47539cc0 aka https://gerrit.wikimedia.org/r/#/c/259051/
[13:19:36] <bblack>	 so yeah we can probably nuke that
[13:19:54] <bblack>	 might be interesting to make sure it's not still deployed somewhere as a running service JIC first
[13:20:11] <bblack>	 oh, if I kept reading, apparently it is?
[13:21:00] <ema>	 well yeah on some machines it's running, perhaps because of the new v4 packages?
[13:21:05] <bblack>	 no idea
[13:21:36] <ema>	 the machines are: 
[13:21:40] <ema>	 cp4019.ulsfo.wmnet: 2328
[13:21:40] <ema>	 cp4020.ulsfo.wmnet: 2343
[13:21:40] <ema>	 cp4002.ulsfo.wmnet: 22070
[13:21:40] <ema>	 cp4001.ulsfo.wmnet: 16713
[13:21:49] <ema>	 so perhaps the ulsfo reboot?
[13:22:04] <bblack>	 so yeah, I'd say nuke it....
[13:22:25] <bblack>	 we have a few related bits to clean up, and then we can do some kind of service-disable in the main varnish module for it
[13:22:51] <bblack>	 git grep varnishncsa turns up some other random bits to nuke too
[13:23:03] <ema>	 like the icinga stuff
[13:23:09] <bblack>	 (nrpe/nagios stuff, logging::config, initscripts, etc)
[13:23:36] <ema>	 can I just go ahead and remove all of it?
[13:23:52] <bblack>	 yeah I think so, none of it's ref'd anywhere in practice
[13:24:01] <ema>	 great
[13:24:18] <bblack>	 and then post-removal, do a commit to disable any stock varnishncsa service from the package somewhere appropriate
[13:24:25] <bblack>	 and make sure the few running ones are runtime-dead
[13:24:29] <ema>	 alright!
[13:24:51] <ema>	 in other news: https://github.com/varnishcache/varnish-cache/pull/1966
[13:25:59] <bblack>	 yeah they've been working hard on cleaning up our issues and several others.  the pattern looks like they're gearing up for a new 4.1.x point release
[13:26:20] <paravoid>	 yay!
[13:49:13] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2333849 (10elukey) Tried for the first time to query Hadoop data via Hive, so this info will need to be validated, but I run a script to find how many holes...
[13:49:52] <ema>	 bblack: https://gerrit.wikimedia.org/r/#/c/291226/
[14:07:54] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2333902 (10BBlack) Today cp3048's at 156G virt + 75G resident and looks pretty stable.  So it was a definite improvement, but there's still a lot of waste.  As suggested ab...
[14:20:21] <bblack>	 ema: fro the service stop in https://gerrit.wikimedia.org/r/#/c/291238 , maybe add enable => false too so it doesn't start on boot?
[14:24:07] <ema>	 bblack: yep, CR updated
[14:26:39] <_joe_>	 we don't use varnishncsa?
[14:26:49] <_joe_>	 I kind of remembered I used it :P
[14:29:15] <ema>	 _joe_: we shouldn't be running the service
[14:30:05] <ema>	 running the program yourself to look at the logs is another story :)
[14:37:19] <bblack>	 re: the ttl-reduction stuff, I think the next move is probably to implement Surrogate-Control in VCL (just for our inter-cache purposes) and start using it to align cache lifetimes.
[14:38:00] <bblack>	 basically this stuff, roughly: https://gerrit.wikimedia.org/r/291244 
[14:38:03] <bblack>	 oops wrong paste
[14:38:09] <bblack>	 https://phabricator.wikimedia.org/T50835#2267566
[14:38:19] <ema>	 new keybindings? :)
[14:39:24] <bblack>	 well, moving back to X-style copypaste + focus has been easy actually, I spent many years like that before the mac :)
[14:39:49] <bblack>	 what's tripping me up is that selecting URL text out of Chromium's address bar doesn't set it for middle-button pasting for some reason
[14:39:57] <bblack>	 I have to do an explicit copy hotkey over in Chromium
[14:40:50] <ema>	 uh, strange. it should work
[14:41:25] <bblack>	 sometimes it does! :)
[14:41:30] <bblack>	 I'm not sure what the trigger is
[14:41:47] <bblack>	 }
[14:42:04] <_joe_>	 bblack: yeah both chrome and chromium fuck up the traditional X clipboard
[14:42:25] <bblack>	 ah ok... so it works consistently if I actually manually select the text in the address bar (as hold and drag across the text)
[14:42:48] <bblack>	 but when you just click left-button once in the address bar, it highlights the whole thing, but that doesn't set it for X-paste
[14:43:18] <ema>	 true!
[14:43:18] <bblack>	 http://xkcd.com/1686/
[14:43:26] <bblack>	 ah triple-click does it (double picks out a word)
[14:45:03] <bblack>	 it's nice being able to only touch the mouse for copypaste
[14:46:59] <bblack>	 well except in the browser
[14:47:21] <bblack>	 maybe I should try one of the chrome plugins that does vi-like keymappings
[14:47:32] <ema>	 I'm using vimperator on FF and it's not bad
[14:47:53] <bblack>	 https://vimium.github.io/
[14:48:17] <ema>	 it is a full-on disaster when you're using someone else's browser though
[14:48:25] <bblack>	 heh perfect, it even fixes the earlier problem too
[14:48:28] <bblack>	 yy:Copy the current URL to the clipboard
[14:50:08] <bblack>	 anyways back on the earlier topic about surrogate-control
[14:50:37] <bblack>	 I'd like to aim towards a point where we calculate beresp ttl/grace once at the backend-most varnish, and then set them via Surrogate-Control up the chain
[14:51:21] <bblack>	 so even if the applayer Cache-Control says something has a 1 month max-age, if our config says varnish caps the TTL at 7 days + 7 more days of grace or whatever.. the S-C header transmits that to the upper-layer varnishes
[14:52:21] <bblack>	 right now our TTL-capping just sets obj.ttl in the local varnish, but if CC allows a month, and our cap is 7 days, a nearly-expired object in a backend can get fetched into another backend and get a fresh 7-day life again.
[14:52:49] <bblack>	 so today on text with FE cap at 1d, and BE caps at 7d, the theoretical maximum object life cap across layers is still 22 days in the ulsfo case
[14:53:03] <bblack>	 1d ulsfo-front + 7d ulsfo-back + 7d codfw-back + 7d eqiad-back
[14:54:33] <bblack>	 parsing C-C/S-C and setting S-C in native VCL will be a little ugly I think
[14:54:47] <bblack>	 but probably doable
[14:55:35] <bblack>	 I guess we can rely on varnish's interpretation of Expires+C-C at the backendmost, and then use obj.ttl to set S-C's maxage, and set some standard grace beyond it in stale-while-revalidate, etc...
[14:56:04] <bblack>	 and in upper varnishes, have something in vcl_fetch interpret Surrogate-Control to explicitly set obj.ttl + obj.grace
[14:56:23] <bblack>	 well I missed a step in there: 
[14:56:49] <bblack>	 I guess we can rely on varnish's interpretation of Expires+C-C at the backendmost, **then cap obj.ttl at our desired max**, and then use obj.ttl to set S-C's maxage, and set some standard grace beyond it in stale-while-revalidate, etc...
[14:58:37] <bblack>	 ugh that doesn't work either I guess without further hacks.  A cache object with obj.ttl=7d still has obj.ttl=7d 6 days later.
[14:58:47] <bblack>	 so we actually have to look at the Age field too
[14:59:35] <bblack>	 I guess on reception in a non-backendmost varnish
[15:00:15] <bblack>	 so in thsoe varnishes, if Surrogate-Control max-age is set, we override obj.ttl with S-C:maxage - beresp.Age or something like that
[15:02:26] <bblack>	 and then we still need to sort out grace-mode.  the long-term goal on the TTL+grace stuff is to get our "normal" TTL down way shorter (say 1d for the whole stack), but have a much longer grace period (say 1w) to cover operational issues (like taking a cache DC offline for a few days without effectively losing all objects)
[15:03:00] <bblack>	 the problem with just blindly setting e.g. 1d TTL + 7d grace is the 7d grace would apply all the time, not just in failure cases
[15:04:05] <bblack>	 to fix that, we probably need healthchecks of lvs service IPs from varnish
[15:04:34] <bblack>	 so that we can adjust request-side grace depending on backend health, whereas the obj.grace sets the overall grace maximum for cache eviction, basically.
[15:05:58] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334115 (10elukey) Modified a bit the script to print host and timestamp related to the sequence number right before the hole. Here some snippets of the res...
[15:11:17] <ema>	 don't we do that already with std.healthy(req.backend_hint)?
[15:11:54] <bblack>	 well, yeah, but we only define health checks/probes for cache<->cache, not cache<->app (well, except a couple oddball cases on misc)
[15:12:08] <ema>	 oh I see
[15:12:37] <bblack>	 we didn't do that because LVS is managing their health and depooling servers, etc
[15:14:40] <bblack>	 but still, we could use a health probe there that's more-forgiving just to check for "can't reach the LVS service at all"
[15:15:05] <bblack>	 (more-forgiving so it doesn't trip up on some transient failure of just one or a few of LVS's randomized backend servers before LVS corrects via depool)
[15:15:39] <bblack>	 arguably we could look at 500/503 for grace at upper layers too, since they can't see the applayer health regardless
[15:16:19] <bblack>	 e.g. if a varnish-frontend gets a 503 (well double-503, since we do retry a 503 once), try a stale grace object at that point, since the 503 from the backend could be because the underlying applayer is down
[15:23:47] <paravoid>	 heads-up traffic team
[15:23:52] <paravoid>	 I'm draining esams!
[15:24:20] <paravoid>	 I've been doing most stuff without downtime, but we're about to do even riskier moves now
[15:24:23] <paravoid>	 so why not
[15:26:12] <bblack>	 ok
[15:37:19] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334176 (10Nuria) >  The loss seems to happen around the hour, but I don't have a good idea about the why (logrotate afaik happens daily). You probably have...
[15:50:54] <ema>	 trying to trim the trailing \t in localssl.erb with partial success
[15:51:05] <ema>	 https://puppet-compiler.wmflabs.org/2964/cp1008.wikimedia.org/
[15:51:39] <ema>	 for some reason ssl_stapling on; keeps on being misaligned, but a bit less than before 
[15:51:45] <ema>	 gotta love erb templates
[15:54:17] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334236 (10Ottomata) Oh ho ho, check this out.  Looking at   > `cp1061.eqiad.wmnet      2016-05-27T12:59:59     2514541`  ``` ADD JAR /usr/lib/hive-hcatalog...
[15:56:30] <wikibugs>	 07HTTPS, 10Traffic, 06Operations, 07Browser-Support-Firefox: Secure connection failed when attempting to send POST request - https://phabricator.wikimedia.org/T134869#2334257 (10Elvey) Tried it. Disabling http2 as @BBlack /@Thibaut120094 suggested eliminates the error for me too.  It seems to occur only if...
[16:25:27] <ema>	 bblack: would it make sense to split the countries that normally are mapped to esams 50% towards eqiad and 50% towards codfw as the second preferred DC in config-geo?
[16:25:52] <ema>	 with esams drained pretty much all the traffic goes to eqiad
[16:44:50] <ottomata>	 ema: i forget, was misc on varnish 4 for all of april?
[16:47:44] <ema>	 ottomata: nope, we upgraded the first server in may IIRC
[16:47:51] <ottomata>	 ah ok
[16:48:30] <paravoid>	 ema: codfw is farther away from the coast, so that would be a perf hit
[16:49:46] <ema>	 ottomata: I just checked on SAL, the first machine was upgraded on 2016-05-10
[16:51:44] <ema>	 paravoid: true, I was more thinking from a DC-load perspective
[16:53:35] <ottomata>	 hmmm ok, ema which machine was that?
[16:53:58] <ottomata>	 and when was the last one upgraded?  (sorry i think you've told me this before)
[16:54:13] <paravoid>	 eqiad can handle it, so I don't see it as a big problem personally
[16:54:54] <ema>	 ottomata: cp3007 was the first one to be upgraded
[16:55:05] <ema>	 paravoid: fair enough :)
[16:55:28] <ottomata>	 ok
[16:58:00] <ema>	 ottomata: then we ran for a while with v4 and downgraded for a few days after the varnish bugs were discovered
[16:59:24] <ema>	 See SAL for the details, I've gotta go now. See you guys tomorrow!
[17:00:45] <ottomata>	 ok thanks!
[17:54:48] <wikibugs>	 10Traffic, 06Analytics-Kanban, 06Operations: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2334727 (10Ottomata) Ah, I was incorrect in my previous comment.  The dt is the request timestamp, and the sequence number is not generated until the respon...
[19:11:40] <paravoid>	 bblack: cp3048 has puppet disabled, I see some "unpuppetized" chatter on SAL