[16:53:45] rfarrand: hello! i see i have an event on my calendar for the lyon travel briefing [16:53:56] but i'm the only one invited, just want to make sure that's the hangout i should be joining [16:54:15] nope, see your email - another link was just sent out [16:54:27] ah, thanks! [16:54:27] brb, heading up to the 5ht floor [21:00:39] #startmeeting RFC meeting [21:00:57] no meetbot again? [21:01:08] Pretty sure someone mentioned that yesterday? [21:01:26] marktraceur, ^ [21:01:56] did anyone work out what labs instance it is on? [21:02:23] well there's a meetbot tool [21:02:29] So I'd imagine one of those instances [21:02:44] maintained by marktraceur, scfc_de and hashar [21:03:02] Coren told me last week "Hashar is the "real" maintainer." [21:03:02] I started it back up [21:03:14] thanks yuvipanda [21:03:19] yw [21:03:22] #startmeeting RFC meeting [21:03:22] * yuvipanda goes back to lurking [21:03:22] Meeting started Wed May 6 21:03:22 2015 UTC and is due to finish in 60 minutes. The chair is TimStarling. Information about MeetBot at http://wiki.debian.org/MeetBot. [21:03:22] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. [21:03:22] The meeting name has been set to 'rfc_meeting' [21:03:24] I was just repeating what I was told. :-) [21:03:55] Well we really need a restart-button-pusher rather than the maintainer :p [21:04:00] #topic Re-evaluate varnish-level request-restart behavior on 5xx | RFC meeting | Topic for #wikimedia-office is: If you're looking for the staff meeting, join the staff channel. | Wikimedia meetings channel | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ [21:04:13] #link https://phabricator.wikimedia.org/T97206 [21:04:17] dinner, bbiab [21:05:21] some background: this came out of the investigation after the commons / DjVu related API outage [21:05:43] where we checked on timeouts and retry behavior throughout the stack [21:05:55] bblack: you around? [21:06:12] yeah but slightly pre-occupied [21:06:18] is that the same outage as the other RFC? [21:06:24] TimStarling: yes [21:06:42] one is a more specific one of the other, right [21:06:45] some user page had massive galleries which were taking a long time to parse? [21:06:52] mark: yes [21:07:22] TimStarling: yes, and the problem was made worse by retries [21:08:14] so parsoid was sending requests to the API via varnish? [21:08:22] the more general RFC argues for not retrying 503 responses from a server unless that is explicitly allowed by a Retry-After header [21:08:44] and not retrying other 5xx responses in general [21:09:16] I think this part might be the least controversial [21:09:47] bblack: do you think that would work for Varnish too? [21:10:01] the timing part wouldn't [21:10:15] yeah, assuming we don't use that [21:10:32] so, to simplify: don't retry 5xx, period. [21:11:00] ok let me read back a sec here and get in sync [21:11:35] any 5xx, on any backend? [21:11:37] that sounds extreme [21:11:45] so, as I argued in related tickets, I don't think retry-after is something we should even be looking at here [21:11:51] dropping that from the question simplifies things a bit [21:12:30] i think there are many cases where a backend can't make a reasonable decision about whether a request should be retried or not [21:12:34] or even a suggestion [21:12:51] however for the cases where it can, we could introduce a header as a suggestion [21:12:54] the basic sides of this argument is that retrying failures protects users from intermittent failure, but if every element of our stack in recursively processing requests does retries, they multiply into a storm [21:13:11] adn they did in this case ^ [21:14:00] however, so far we are only talking about the case where an explicit 5xx is returned from the server [21:14:02] I generally think we should avoid any retries anywhere except (a) if we can really make a strong case for limited retry somewhere deep, in a specific place, because of some deficiency and (b) at the outermost layer in varnish, a very limited retry at the top of the stack to protect the user from intermittent-whatever below [21:14:15] gwicke: what does that mean? [21:14:29] varnish has a concept of backend health [21:14:40] paravoid: for example, that the client connection did not time out before the server responded [21:14:54] TimStarling: yes, we could argue for no retries and relying on health, too, at the varnish level [21:14:55] maybe a retry should be done if the first request was sent to an unhealthy backend? [21:15:09] we have just one backend, the LVS IP [21:15:17] mmm, true [21:15:20] but yes, that's the problem there.... [21:15:24] how would the failure mode be communicated back to source? [21:15:33] gwicke: I still don't understand tbh :) [21:15:52] paravoid: the client received a responde with status 5xx, within the client timeout [21:15:52] I generally assume we want to treat timeouts + 5xx's approximately similarly [21:15:56] *response [21:16:12] bblack: lets discuss that separately [21:16:14] (ditto connrefused) [21:16:27] they all fall under "this failed and we don't really know why" [21:16:48] with an explicit server response we know a lot more about the why [21:17:01] sometimes the failure is resource-specific, sometimes it's the whole service or the network reachability, etc, but it would be rare we'd reliably know the difference in retry logic [21:17:02] connection refused can always be retried immediately, it is cheap [21:17:03] an explicit server response [21:17:06] if there is no response at all, we know nothing [21:17:08] may just be a 5xx from apache [21:17:16] simple example: [21:17:18] stop hhvm [21:17:31] (on all machines or just one?) [21:17:33] apache will emit 500s from the inability to talk to the fastcgi socket [21:17:40] the other idea I had at the time was to alert on % of connection attempts that are retries [21:17:50] part of the issue was with all of hte failure and retry parsoid never sent up any alert at all [21:17:53] TimStarling: agreed [21:18:08] I don't know if you can set a header on such failures, there's a relatively high chance that's not going to be possible [21:18:25] I'm not sure what you'd set it to, even if it was possible [21:18:29] funny that we have never patched apache [21:18:34] paravoid: should we depool before restarting instead? [21:18:42] (if we can't set a header) [21:18:44] I'm not talking about planned maintenance [21:19:05] this isn't about procedures, it's about handling this case which will happen regardless [21:19:10] this will happen if HHVM segfaults [21:19:13] in this case pybal would depool the service almost immediately [21:19:25] even in the connrefused case, we don't want to do it x100, and we're not going to pause in the midst of handling a request, either. the chances of connrefused suddenly working 0.01ms later are small enough to not matter. [21:19:27] but in-flight requests will still be returned as 500s [21:19:46] in a segfault, HHVM's listening socket will be closed for a short time [21:19:55] TimStarling: the question then becomes if repeating the same request elsewhere would just segfault another instance [21:20:05] sure, it may well do [21:20:08] the ability to pause a request for X seconds would be a good feature for varnish to have [21:20:15] at the VCL level [21:20:17] that's a very efficient DoS multiplication if it does [21:20:39] I don't think anyone suggesting to restart forever [21:21:19] (and hence crash all possible backends) [21:21:38] #info mark: the ability to pause a request for X seconds would be a good feature for varnish to have [21:21:40] right now we are discussing how we can work around a service that doesn't send the headers that it maybe should [21:21:41] it will happen anyways, if it's any kind of common URL [21:21:49] multiple users + reload buttons, etc [21:22:25] do you think it's possible to teach Apache to send a retry-after header if it detects HHVM issues that are likely temporary? [21:22:58] I don't necessarily agree it would be good to pause a request mid-varnish. then we'd just stack them up there and kill resources, and retry at a random future time, when we have no information about future availability... [21:23:17] bblack: well it can do it efficiently [21:23:20] having a way to distinguish an app-level 5xx that would just produce the same result on retry vs. a temporary failure would be very valuable [21:23:23] and well bounded etc [21:23:35] yeah I think varnish is a good place for a queue [21:23:37] gwicke: that bit is the difference between 500, 502, 503 etc. [21:23:43] better than having it queue in parsoid or HHVM [21:23:49] no need for a header for _that_ [21:23:52] bblack: even if we didn't delay, we could at least avoid retrying all 5xxs [21:23:54] i'm not talking about having a thread just sit there for 10s [21:24:04] fastcgi issues are probably 502s, possibly 503s [21:24:06] doubt it's a 500 [21:24:14] right, but it's still a tcp port/state, some context memory, etc [21:24:24] varnish can handle a *lot* of concurrent connections [21:24:28] yes, it would need very careful consideration of those resources [21:24:38] many more than any other service, excepting maybe LVS [21:25:07] maybe it is 1KB of overhead per connection [21:25:19] millions of connections could fit into RAM, it's not like the olden days [21:25:22] fair. that's a big project, though, and might want to be a separate consideration from what we could change about all related things today [21:25:22] paravoid: 503 is also returned from overloaded systems, in which case we don't want to retry [21:25:31] agreed, it's not a trivial implementation at all [21:25:33] you eventually start to fill up the kernel limits, but that takes a while [21:26:21] gwicke: which systems? [21:26:22] even if we don't delay, can we narrow down the conditions under which we retry further? [21:26:49] in the general case with the simple changes we can make today, I think there are a lot of constraints, chiefly: (1) varnish can't hold a single request in a queue anywhere and wait and (2) we can't reliably know if a 5xx/timeout is for a specific resource, or a failure of the whole underlying service [21:26:59] paravoid: for example, what does Varnish return when it's overloaded? [21:27:15] 502 or 504? [21:27:20] depends on your definition of overloaded [21:27:23] I think I am leaning towards: retry on connection refused, retry on connection timeout, don't retry any other time [21:27:44] so don't retry when any work has been done [21:27:47] retry on connection timeout is tricky [21:27:50] don't retry on server-side abort after request sent (the segfault case) [21:28:27] why is it tricky? [21:28:42] the second RFC focuses on that [21:28:54] also, I'm not confident varnish has the ability to tell the difference between conn-refused and conn-timeout, that may all fall under the same bucket? I'd have to look [21:29:01] shall we change the topic? it's 28 past anyway [21:29:18] bblack: IIRC there are separate timers [21:29:21] but doesn't matter if we stick with Tim's leaning [21:29:42] #topic Request timeouts and retries | RFC meeting | Topic for #wikimedia-office is: If you're looking for the staff meeting, join the staff channel. | Wikimedia meetings channel | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ (Meeting topic: RFC meeting) [21:29:42] happy to move on [21:29:45] sounds reasonable to me [21:29:54] #link https://phabricator.wikimedia.org/T97204 [21:30:17] this RFC talks about the interplay of timeouts and retries [21:30:35] with emphasis on the timeouts on server and client [21:30:55] (fwiw, I don't agree with the "don't retry any other time" above) [21:31:05] a very common problem we see in production is that services have very long request timeouts, much longer than the clients interacting with them are prepared to wait [21:31:32] (re: paravoid: I tend to agree, we could still do a singular retry of 5xx at the outermost layer, and save a fair chunk of user-visible failings) [21:31:54] clients often try to ensure a reasonable response time themselves, so need to limit the time they wait for a response [21:32:41] and there are some class of errors where retrying on client timeout actually results in a success by hitting another backend [21:32:46] would be nice if we could do it as a contract throughout the stack [21:33:04] outermost client sets a budget for any deeper services [21:33:14] yes, this RFC basically sketches a possible way to coordinate timeouts and retries through the stack [21:33:32] mark: in an HTTP header? [21:33:37] that'd be difficult to implement though [21:33:38] yes, in the request [21:33:49] "you can take X seconds to do this, after that I want an error" [21:34:00] starting from a limit on the outside, lets say 5 minutes (common browser timeout), work your way inwards [21:34:04] would we allow 3rd party clients to seet that? [21:34:10] no, just internal [21:34:17] yes [21:34:40] maybe external clients could set a budget lower than the internal one, if they wish. but not larger [21:34:46] how is that really different from the setter simply timing out the client-side of the request in X seconds of his choice, and the receiver aborting any work progress on lost-connection? [21:34:47] right [21:34:57] the idea is that for each server - client pair, the server would have a slightly *shorter* timeout than the client making the request [21:35:19] in theory any part of the stack could decide whether it makes sense to retry or not I guess [21:35:25] bblack: for one thing, client-side aborts are not implemented in HHVM, afaik [21:35:26] how much budget is left at any point [21:35:49] but I'm just brainstorming this on the spot, haven't thought it through yet [21:36:05] I've written a fair bit about this on phabricator [21:36:05] mark: that would be even nicer, yes [21:36:38] my main point was that in an overload condition, a server timeout and subsequent retry is stupid and makes the overload worse [21:36:38] mark: https://phabricator.wikimedia.org/T97192 proposes something like that for the PHP API [21:37:06] right [21:38:07] we'd like to avoid clients making short requests that consume a lot of resources in the backend service [21:39:00] which means that we need to bound the resource consumption, ideally by designing things so that they finish within a reasonable time, or by timing them out [21:40:13] this means that some functionality like galleries with many thousands of images might no longer work [21:41:06] we'd probably want to enforce that in a better way than with timeouts, for example by introducing a limit on the number of images that are allowed per gallery [21:41:46] I'm not totally averse to the idea of time limits on parsing... [21:41:50] not as much as I used to be [21:42:08] you know for scribunto I introduced a CPU time limit [21:42:22] maybe it makes sense to extend the concept to the rest of the parser [21:42:53] but we have to make very sure that if the parser exceeds a resource limit, the failure is very very permanent [21:43:22] with scribunto in particular the failure is saved into the parser cache and so is permanent until the page or module is edited [21:43:26] yes, we don't want to have clients reload an expensive operation in the hope that it sometimes sneaks just below a timeout [21:43:30] so instead of a retry-after [21:43:43] what I keep seeing in these discussions is a "don't-you-dare-retrying" header [21:43:56] which could just be another 5xx code of course [21:44:09] 566 :P [21:44:20] mark: the default according to the HTTP spec is that you should not retry unless Retry-After is set [21:44:24] I still think retry-after is never going to be the right header to use, FWIW, as that should apply to a whole HTTP server/endpoint, not to a particularly-failing resource within [21:44:26] for 503 [21:44:57] bblack: imho it's fine to interpret it per request [21:44:59] gwicke: right, but there are gonna be cases where we'll want to cover that up [21:44:59] we can make up our own if we want something for one resource [21:45:52] mark: yeah, but I think we should try hard to separate the cases where we should retry from the ones where we shouldn't [21:45:57] ditto for 503 in general: it's not the right response if it's likely the problem is with one resource and not the service as a whole [21:46:06] yeah [21:46:08] resource-level -> don't retry. service-level -> maybe? [21:46:17] it's just generally apache or something else "not in the know" emitting that [21:46:24] I'm not entirely clear on what failure mode the retry is meant to overcome, it seems like making rety cheap and handling failure as gracefully as possible is better than stages of reattempt [21:46:40] what failures should we not be handling at the service obfuscation level in lvs I wonder? [21:46:42] bblack: the problem is to guess right about what is a single service and what isn't [21:47:09] just because it's a single domain doesn't mean that it's powered by a single service [21:47:23] gwicke: we're talking HTTP terms an no other here, so it's pretty clear. An http server host:port is one service here. [21:47:29] chasemp: mostly simple failures such as a quick connection refused or crash on a single instance [21:47:38] which lvs can't handle by design [21:47:41] bblack: I don't think that's explicitly defined anywhere [21:48:01] honestly, this conversation is too theoretical [21:48:30] can someone give an example for each of the cases of (500, 502, 503) x (retry after, no retry after)? [21:48:35] I think it's pretty implicit in the RFCs. 503 means "server overload", for the whole server you contacted. it even documents that an alternative is to just refuse connections. [21:48:37] so, lets make it more practical [21:48:45] paravoid: same thought here [21:49:10] how about we start trying to implement the RFC defined server codes a bit more accurately? [21:49:25] yeah I'd also like that [21:49:40] we'll also gather the weird cases then [21:49:45] and can have a more informed discussion [21:49:51] also unrelatedly to this whole discussion; mediawiki emits 5xxs for 4xxs in various cases [21:49:52] one main goal of the RFC is to make sure that clients actually receive server responses in time [21:50:08] we've fixed some of them, but it's still not widely enforced, is my impression [21:50:30] gwicke: server responses from the bottom-down layer that failed, you mean, right? [21:50:33] mark: you mean 500 versus 503? [21:50:41] I guess, yes [21:50:47] are you on board with setting stricter timeouts in backend services? [21:50:51] and 4xx when we think it helps [21:51:03] we probably give 503 a lot when we mean 500 [21:51:07] yes [21:51:11] yup [21:51:21] paravoid: yes [21:51:46] gwicke: ok. why, though? :) [21:52:10] basically, if the server response comes after the client is gone, there will be no status available to the client [21:52:28] no, why from the bottom-down? [21:52:48] what's the benefit of broken down timeouts/budgets vs. just one big timeout at the top of the stack, close to the browser's timeout? [21:52:54] I don't support an immediate reduction in HHVM timeouts [21:53:08] paravoid: this is about the timeouts between each client/server pair [21:53:15] I would support an analysis of the slow-parse log, leading to development of better parser limits [21:53:35] the total request budget impacts those, as obviously no part of the request processing can take longer than the outermost timeout [21:53:37] leading to a reduction in the number of parses which exceed the timeout, perhaps eventually leading to a reduced HHVM timeout [21:53:45] gwicke: still: if a server fails, it should return 500, and a consumer should pass that 500 up the stack as a 500 [21:53:51] not 503 [21:54:18] but what does it matter? [21:54:25] and I would support a CPU time limit on parse operations, with permanent cached failure [21:54:32] TimStarling: I agree that we need to get there gradually [21:54:49] i.e. a 200 response on internal parser limit exhaustion [21:54:49] bblack: agreed [21:55:15] why do you want to recognize a timeout and emit a 500 at the bottom of the stack rather than at the top? [21:55:30] bblack: the RFC talks about retrying only 503s out of the 5xx responses, and only if Retry-After is set [21:55:31] TimStarling: I like that, the 200 idea. [21:55:34] it sounds better in theory but is much harder to implement and I fail to see a huge benefit [21:55:46] the 500 page that says "request timed out" will probably just look the same :) [21:55:56] bblack: it's what we do already for template depth limits, scribunto CPU limits, etc. [21:55:59] TimStarling: why not 500? [21:56:11] because the 200 will get cached until another edit hits [21:56:15] up in varnish [21:56:29] that sounds like a hack [21:56:39] we can cache 5xx too if we are sure that it's permanent [21:56:44] we never are, though [21:57:01] that is indeed tricky with the standard 5xx error codes [21:57:05] but in this case, we know that for this one resource, we have a resource-consumption problem that won't fix itself until something changes. [21:57:07] then we shouldn't play with 200 ;) [21:57:10] hence 200 [21:57:24] error codes in 200 responses are not a strange concept [21:57:27] in the general case, we can't cache 5xx, but in "one page takes too long to render" case, we could cache a 200 [21:57:34] if we are sure, we can return 500 and appropriate headers that allow caching [21:58:01] caching 5xx is always going to be a bad idea. once we're in unknown-error territory, you just don't know the fallout. [21:58:04] or the scope [21:58:04] maybe it doesn't fit in with the REST idea but it is certainly used in api.php [21:58:36] (aside from maybe efficiency hacks of caching a 5xx for a very tiny window of time) [21:58:45] well, lets table that and get back to the timeout stacking for a moment [21:58:54] plus, is a humongous request that exhausts server-side limits a server-side error or a client error for trying it :) [21:59:42] do you all feel that working towards making sure that clients actually receive a response within their time budget makes sense as a general guideline? [22:00:05] that's too vague [22:00:13] I'm sure that not all clients agree on or communicate a time budget to us in the first place [22:00:15] clients being...? users/eyeballs? or internal HTTP clients? [22:00:22] oops, out of time [22:00:24] the only problem right now is retries [22:00:30] paravoid: client as in HTTP client [22:00:36] then no [22:00:40] so yes, internal + external [22:01:09] mark: as an upper bound, five minutes? [22:01:22] please try to wrap up and make action items [22:01:26] that's the default HTTP timeout in browsers [22:01:45] one action item: let's start implementing HTTP error codes more closely to the RFC? [22:01:47] IMHO we should aim for much less, but we got to start somewhere [22:01:54] that's a step in the right direction at least [22:02:11] well, a proper action item is to file a bug to do that, right? [22:02:12] to the RFC and our conventions [22:02:19] Define a set of conditions under which retry is better tahn failure and see how prevalent it is be in real life. [22:02:32] #action mark to file a bug: let's start implementing HTTP error codes more closely to the RFC [22:02:33] the RFC(s) can be (sometimes intentionally) vague [22:03:03] well [22:03:05] can we quickly revisit 0 vs 1 retries of 5xx within the varnish layer of the stack? [22:03:11] that'd be a set of bugs then, for specific implementations? [22:03:14] or it's part of the RFC [22:03:18] to make HTTP error codes useful in practice, they have to be timely as well [22:03:28] I think either is acceptable, I tend to think 1x retry there is a reasonable tradeoff, though [22:03:34] possibly [22:03:54] (to cover intermittent from server restarts, etc) [22:03:55] I remember hearing on the multi-DC RFC that mediawiki can still perform actions with GETs [22:04:08] but I don't remember the details [22:04:20] in such a case, retrying a 500 may be bad? [22:04:24] bblack: IMHO we should try harder to isolate the cases where retries really make sense [22:04:29] bblack: no we can't revisit anything in this meeting [22:04:41] but feel free to talk about it in #wikimedia-operations [22:04:44] feel free to discuss on the rfc/bug too [22:04:45] by narrowing down the return codes, and revisiting the timeouts [22:05:25] thanks for coming everyone [22:05:37] thank you [22:05:50] next week we will probably talk about: Graphical configuration interface https://phabricator.wikimedia.org/T388 [22:05:59] TimStarling: should we create an action item for reducing timeouts? [22:06:04] clearly not [22:06:10] as we didn't have consensus on that yet [22:06:22] tim was in favor of a gradual process [22:06:31] oh in the parser [22:06:33] sure [22:06:42] that's specific though [22:06:45] any concrete objections? [22:06:54] and also because I think action items should be short tasks achievable by people present in the meeting, not major projects for teams [22:07:09] that's true [22:07:18] the action item should be to document that in the rfc [22:07:23] or in a bug indeed [22:07:25] yep [22:07:28] #endmeeting [22:07:29] Meeting ended Wed May 6 22:07:28 2015 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) [22:07:29] Minutes: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-05-06-21.03.html [22:07:29] Minutes (text): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-05-06-21.03.txt [22:07:29] Minutes (wiki): https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-05-06-21.03.wiki [22:07:29] Log: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-05-06-21.03.log.html [22:07:34] if we are talking about architectural guidelines, then it's hard to avoid them being cross-cutting [22:08:11] this one was clearly specific for one service/implementation, and especially one specific resource in it [22:08:22] so that's not an action item to start doing it for all other http connections yet, no [22:08:25] I can open tickets for my parser ideas [22:08:35] needs more discussion [22:08:56] mark: are you opposed to any timeouts at all? [22:09:00] i am not [22:09:13] but that doesn't mean we have consensus on a plan yet [22:09:13] ok, I'm not sure that anybody is tbh [22:09:32] there was the question whether it's better than one big timeout [22:09:35] and that's not resolved [22:09:52] no convincing arguments heard yet, etc [22:09:55] it might be worth writing that up [22:09:57] so let's discuss further [22:10:01] how the big timeout would work [22:10:24] feel free :) [22:10:24] mark: are you interested in doing that? [22:10:29] definitely not [22:11:00] or are you talking about the request budget idea? [22:11:04] that's not what I was referring too [22:11:25] Faidon asked about a global timeout, but didn't say anything more specific [22:11:28] i think the question above was more, how is lower level timeouts in more specific services better than the status quo of one big timeout (modulo retries) [22:11:34] about how that would look for backends [22:11:42] yes [22:12:13] he didn't hear convincing arguments yet [22:12:16] the question is basically, do we want clients to receive HTTP error codes? [22:12:49] doesn't really matter much, especially not if the outer layer makes one [22:12:53] or do we want to rely on clients timing out before the server they are talking to does [22:13:11] as long as the outer layer catches that, it's the same to users [22:13:29] i'm not arguing in favor of either [22:13:35] i'm just saying, it's not clear yet what is better [22:13:39] so needs more discussion/thinking [22:14:04] kk; lets try to narrow down what the alternatives are [22:14:06] so another way I have seen a similar problem solved is to configure upper limits in the client service, i.e. parsoid. it seems like we are talking about making server side responses our indication of request concurrency when we could define an artificial one we know to be sane [22:14:08] and alert above it [22:14:22] that's what tim now wants to do yes [22:14:31] oh well that in case I support that [22:14:43] or specific resource limits on parser time etc [22:14:53] but anyway [22:15:00] it's past midnight, i've had a long day, i'm going to bed :) [22:15:04] "hit me until I fall down" is a poor rate limit mechanism usually [22:15:16] mark: thanks for the discussion, and good night ;) [22:15:19] nini! [23:12:02] yuvipanda: http://etherpad.wikimedia.org/p/tool_labs-doc [23:54:31] spagewmf_: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin