[00:00:07] <bd808>	 but yeah the batch size would have to work for the worst case in any dc
[00:00:12] <Krenair>	 I'm guessing that there are appservers in codfw for a reason though :)
[00:01:22] <bd808>	 I'm sure we will get there
[00:10:04] <ori>	 bd808: https://gist.github.com/atdt/62a09a9102104c43a50b
[00:10:33] <ori>	 i forgot my high school math so i roll the dice to estimate probability instead of plugging numbers into an equation :P
[00:13:38] <bd808>	 do we only care about restarts for servers in pybal? Or is that just the only place we care about messing with apache?
[00:14:16] * bd808 has a python snippet that finds the lo:LVS addr
[00:14:38] <ori>	 i think it's OK to restart HHVM on all the jobrunners at once, for example
[00:14:42] <ori>	 because they're not user facing
[00:15:10] <ori>	 so it's not really an issue if all jobrunners aren't processing jobs for 5 seconds
[00:15:10] <bd808>	 do we have a way to "greaceful" hhvm?
[00:15:23] * bd808 can't spell
[00:15:31] <ori>	 we do, let me dig up the relevant config vars
[00:15:43] <ori>	 hhvm.server.graceful_shutdown_wait	
[00:15:47] <ori>	 let me see the code that consults that
[00:16:45] <bd808>	 Our apache config seems to set GracefulShutdownTimeout to 5 seconds which seems a bit short
[00:17:54] <ori>	 yeah, doubling that wouldn't be the worst idea
[00:18:28] <ori>	 bd808: does 10 seem reasonable to you?
[00:18:53] <bd808>	 what's our 90% response time?
[00:21:02] <ori>	 we have 95% in graphite: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1434759637.731&target=varnish.eqiad.backends.ipv4_10_2_2_22.POST.p95&target=varnish.eqiad.backends.ipv4_10_2_2_22.GET.p95&target=varnish.eqiad.backends.ipv4_10_2_2_1.GET.median&target=varnish.eqiad.backends.ipv4_10_2_2_1.POST.p95
[00:21:41] <ori>	 POSTs to the normal (non-API) app servers are slowest and p95 is in the 600-700 range
[00:21:44] <ori>	 ms
[00:22:09] <bd808>	 oh
[00:22:16] <bd808>	 5s might be tons then?
[00:23:00] <ori>	 well, p99 is different story: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1434759768.17&target=varnish.eqiad.backends.ipv4_10_2_2_22.POST.p99&target=varnish.eqiad.backends.ipv4_10_2_2_22.GET.p99&target=varnish.eqiad.backends.ipv4_10_2_2_1.GET.p99&target=varnish.eqiad.backends.ipv4_10_2_2_1.POST.p99
[00:23:35] <ori>	 but we're not going to wait 30 seconds for someone to generate their gigantic pdf
[00:25:00] <bd808>	 I guess if we've been using 5s until now we can stick with it and just be mindful of what we see on the frontend
[00:25:30] <bd808>	 my dinner guests just showed up so I'll come back to this tomorrow
[00:25:43] <bd808>	 thanks for the input
[00:26:15] <ori>	 thanks for the chat!
[04:27:02] <ori>	 (#hhvm) <ebernhardson|zz> porting something from zend->hhvm, whats the best choice for a zend_llist replacement?
[04:27:15] <ori>	 teehee, he's going to port the fast strtr() implementation to hhvm
[04:27:30] <ori>	 i knew that if i dangled it he would
[04:27:51] <ori>	 *evil laughter*
[04:44:29] <bd808>	 ori: It looks like HHVM does a graceful shutdown via FastCGIServer::stop() whenever the config is set
[04:44:47] <bd808>	 if we graceful apache though I don't think we will need it
[04:45:06] <ori>	 right
[04:45:12] <bd808>	 since there will be no route back to the client anyway
[04:45:18] <ori>	 yep
[04:45:29] <bd808>	 when we can signal pybal though it would be useful
[04:46:23] <bd808>	 I think I've got something ready to test but now I need to piddle with my test vm to make it possible to actually try it out
[04:46:43] <bd808>	 and sadly my test vm will need a lot of changes for this
[04:48:35] <ori>	 i know you just LOL now when i say "we'll test in production", but hear me out --
[04:48:51] <ori>	 we can depool 10-20 servers for days at a time
[04:48:55] <ori>	 and we can make them a dsh group
[04:49:21] <ori>	 can we parametrize the dsh group scap deploys to? make it 'mediawiki-installation' by default but make it customizable
[04:49:35] <bd808>	 yeah that would be pretty easy
[04:50:00] <ori>	 when gabriel turned off the parsoid extension load on the api cluster was nearly halved IIRC
[04:50:09] <ori>	 we can definitely depool a small group
[04:50:23] <bd808>	 k
[04:51:06] <greg-g>	 thirded
[04:51:12] <greg-g>	 http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=API+application+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name
[04:51:19] <greg-g>	 ish
[04:51:57] <ori>	 heh
[04:52:15] <bd808>	 that yearly cpu graph is interesting
[04:52:45] <ori>	 yeah there you go, hhvm cut load by a lot too
[04:52:52] <ori>	 and we _added_ api servers since then
[04:52:59] * greg-g nods
[04:53:12] <bd808>	 I've heard mobile is the future or something
[04:58:21] <ori>	 oh, duh
[04:58:31] <ori>	 there's a whole cluster of depooled app servers
[04:58:33] <ori>	 all of codfw
[04:58:49] <ori>	 problem solved
[05:07:02] <ori>	 sent an email to ops@ asking for a green light
[05:09:54] * greg-g is sad ori didn't say thirded
[05:10:02] <ori>	 heh