[00:00:07] but yeah the batch size would have to work for the worst case in any dc [00:00:12] I'm guessing that there are appservers in codfw for a reason though :) [00:01:22] I'm sure we will get there [00:10:04] bd808: https://gist.github.com/atdt/62a09a9102104c43a50b [00:10:33] i forgot my high school math so i roll the dice to estimate probability instead of plugging numbers into an equation :P [00:13:38] do we only care about restarts for servers in pybal? Or is that just the only place we care about messing with apache? [00:14:16] * bd808 has a python snippet that finds the lo:LVS addr [00:14:38] i think it's OK to restart HHVM on all the jobrunners at once, for example [00:14:42] because they're not user facing [00:15:10] so it's not really an issue if all jobrunners aren't processing jobs for 5 seconds [00:15:10] do we have a way to "greaceful" hhvm? [00:15:23] * bd808 can't spell [00:15:31] we do, let me dig up the relevant config vars [00:15:43] hhvm.server.graceful_shutdown_wait [00:15:47] let me see the code that consults that [00:16:45] Our apache config seems to set GracefulShutdownTimeout to 5 seconds which seems a bit short [00:17:54] yeah, doubling that wouldn't be the worst idea [00:18:28] bd808: does 10 seem reasonable to you? [00:18:53] what's our 90% response time? [00:21:02] we have 95% in graphite: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1434759637.731&target=varnish.eqiad.backends.ipv4_10_2_2_22.POST.p95&target=varnish.eqiad.backends.ipv4_10_2_2_22.GET.p95&target=varnish.eqiad.backends.ipv4_10_2_2_1.GET.median&target=varnish.eqiad.backends.ipv4_10_2_2_1.POST.p95 [00:21:41] POSTs to the normal (non-API) app servers are slowest and p95 is in the 600-700 range [00:21:44] ms [00:22:09] oh [00:22:16] 5s might be tons then? [00:23:00] well, p99 is different story: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1434759768.17&target=varnish.eqiad.backends.ipv4_10_2_2_22.POST.p99&target=varnish.eqiad.backends.ipv4_10_2_2_22.GET.p99&target=varnish.eqiad.backends.ipv4_10_2_2_1.GET.p99&target=varnish.eqiad.backends.ipv4_10_2_2_1.POST.p99 [00:23:35] but we're not going to wait 30 seconds for someone to generate their gigantic pdf [00:25:00] I guess if we've been using 5s until now we can stick with it and just be mindful of what we see on the frontend [00:25:30] my dinner guests just showed up so I'll come back to this tomorrow [00:25:43] thanks for the input [00:26:15] thanks for the chat! [04:27:02] (#hhvm) porting something from zend->hhvm, whats the best choice for a zend_llist replacement? [04:27:15] teehee, he's going to port the fast strtr() implementation to hhvm [04:27:30] i knew that if i dangled it he would [04:27:51] *evil laughter* [04:44:29] ori: It looks like HHVM does a graceful shutdown via FastCGIServer::stop() whenever the config is set [04:44:47] if we graceful apache though I don't think we will need it [04:45:06] right [04:45:12] since there will be no route back to the client anyway [04:45:18] yep [04:45:29] when we can signal pybal though it would be useful [04:46:23] I think I've got something ready to test but now I need to piddle with my test vm to make it possible to actually try it out [04:46:43] and sadly my test vm will need a lot of changes for this [04:48:35] i know you just LOL now when i say "we'll test in production", but hear me out -- [04:48:51] we can depool 10-20 servers for days at a time [04:48:55] and we can make them a dsh group [04:49:21] can we parametrize the dsh group scap deploys to? make it 'mediawiki-installation' by default but make it customizable [04:49:35] yeah that would be pretty easy [04:50:00] when gabriel turned off the parsoid extension load on the api cluster was nearly halved IIRC [04:50:09] we can definitely depool a small group [04:50:23] k [04:51:06] thirded [04:51:12] http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=API+application+servers+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [04:51:19] ish [04:51:57] heh [04:52:15] that yearly cpu graph is interesting [04:52:45] yeah there you go, hhvm cut load by a lot too [04:52:52] and we _added_ api servers since then [04:52:59] * greg-g nods [04:53:12] I've heard mobile is the future or something [04:58:21] oh, duh [04:58:31] there's a whole cluster of depooled app servers [04:58:33] all of codfw [04:58:49] problem solved [05:07:02] sent an email to ops@ asking for a green light [05:09:54] * greg-g is sad ori didn't say thirded [05:10:02] heh