[23:00:20] <ori>	 Hey TimStarling. If you have time to spare, could you take a look at https://gerrit.wikimedia.org/r/#/c/211320/ and https://gerrit.wikimedia.org/r/#/c/211498/ ? Since March 12, page save time has slowed down by 200-300 ms, because edits are blocked on enqueueing jobs in redis, which is overloaded.
[23:00:40] <TimStarling>	 ok
[23:01:08] <TimStarling>	 that doesn't seem like the right solution
[23:01:29] <TimStarling>	 can't you just make it not be overloaded?
[23:01:44] <TimStarling>	 it's an in-memory database, it shouldn't be necessary to defer writes to it
[23:01:50] <ori>	 It's not simple. Enqueuing a job requires EVALing lua code on redis 
[23:02:11] <ori>	 the slow log shows the lua functions often taking >100ms to run
[23:02:37] <TimStarling>	 it's single-threaded, is that including the time it waits for the mutex to become available?
[23:03:29] <ori>	 I'm not sure. Does it use a mutex?
[23:03:50] <TimStarling>	 I mean is that measured by the client or the server?
[23:04:19] <ori>	 measured in redis. you can see it using the 'slow log' command on the server.
[23:05:37] <TimStarling>	 so there's no bug for the overload?
[23:06:03] <ori>	 No, I just caught it last night.
[23:06:08] <ori>	 I'll file one
[23:07:07] <ori>	 I agree that it has to be fixed, and that this is not a full solution, but it's not a hack -- it is quite useful regardless. Some requests flush a lot of stats to statsd as well and I've seen that show up in the xenon stacks too, so that's another example of work that can be done without the user having to wait.
[23:09:05] <ori>	 while I write up the bug, some context: while preparing save time figures for the quarterly report, I noticed a spike: https://phabricator.wikimedia.org/T97378#1288887
[23:09:50] <ori>	 ...which we haven't fixed: http://i.imgur.com/NhnbjBS.png
[23:11:01] <ori>	 ...and which correlates with increased inbound traffic on the redises: http://ganglia.wikimedia.org/latest/graph.php?c=Redis%20eqiad&m=cpu_report&r=year&s=by%20name&hc=4&mc=2&st=1431904208&g=network_report&z=medium
[23:11:45] <ori>	 a lot of this is attributable to restbase and parsoid
[23:12:12] <ori>	 we enqueue two jobs (one in the RestBase extension, one in the Parsoid extension) for each edit
[23:12:48] <TimStarling>	 on rdb1001 at the moment, redis is using ~95% of one CPU
[23:12:55] <ori>	 yep
[23:13:46] <ori>	 aaron has a task for provisioning additional instances on each machine
[23:13:47] <TimStarling>	 so that is quite a severe overload
[23:14:01] <ori>	 the puppet redis module does not make it easy to have multiple instances per host
[23:14:09] <ori>	 yes, I think it's UBN!
[23:14:58] <ori>	 if you look at http://i.imgur.com/NhnbjBS.png , the huge spike was when it was completely overloaded and users started noticing saves taking forever. that qualified as an outage.
[23:17:27] <ori>	 i am considering making an overhaul of the jobqueue be one of the annual goals for the perf team
[23:17:49] <ori>	 it's not what i was hoping to work on but it's in pretty bad shape