[08:05:16] i think i just maxed out the traffic included in the mobile wifi hotspot i am using and it's on somebody else's contract. got throttled so i am still here but can barely load websites.. gotta figure out how to add more data [14:29:08] one question about Ganeti capacity for the next year (eqiad) - we (as analytics) will probably need to create some vms (I hope something like 3/5 max) for various things, all miscs with low requirements for CPU/RAM (say around 4/6 vcores and 6/8G of ram per vm) and something like 50/100G max disk space needed [14:29:36] is it something acceptable or do I need to ask for more capacity on this front? [14:29:44] for some definition of $low_requirements :-P [14:30:07] volans: I didn't ask 32 cores with 64G of ram come on :D [14:30:14] elukey: if it helps, add some -very rough- requirements into the CapEx sheet that you'll share with me [14:30:16] see https://netbox.wikimedia.org/virtualization/virtual-machines/?sort=-memory [14:31:25] volans: is that a blame list for analytics? :P [14:31:30] paravoid: ack will do thanks [14:32:41] elukey: ahahah, no I was just pointing out that those are towards the max requirements of existing VMs ;) [14:37:18] volans: okok got it, these analytics people are the worst, they always need capacity, I hate them, etc.. [14:37:45] :D [14:38:06] jokes aside, noted, will add requirements to the spreadsheet so we'll all be aware [14:38:21] lol, indeed, and thanks for the headsup [15:23:45] 4/6 cores and 6/8 GB RAM is low ? heh [17:02:48] _joe_: so something I wanted to ask -- I think it makes sense to spend some time considering what things we're concerned about saturating [17:03:09] is it edge cache NICs? is it PHP worker threads? is it memcached NICs? [17:03:17] <_joe_> yes [17:03:24] I mean, it's all of the above, but we need to know what's most likely [17:03:35] <_joe_> i would say the last two [17:03:36] btw, are the jobrunners part of cluster=appserver? [17:03:43] <_joe_> in no particular order of priority [17:03:44] <_joe_> no [17:03:47] <_joe_> cluster=jobrunner [17:04:11] ok, in that case (make sure you are logged into grafana) I find https://w.wiki/Lt9 concerning [17:04:22] even though https://w.wiki/LtA looks fine [17:04:38] <_joe_> what is that? [17:05:03] <_joe_> that all appservers have at least 15 idle workers? [17:05:11] it's the minimum over time [17:05:24] across all servers, what was the least number available, at a given instant [17:05:33] <_joe_> cdanis: I think we still use pm = dynamic [17:05:39] oh [17:05:42] i thought we had a hard limit configured [17:05:46] <_joe_> cdanis: we have data bout saturation [17:05:56] <_joe_> *about [17:06:35] yeah, I've seen that query: `sum(irate(mediawiki_http_requests_duration_sum{cluster="$cluster",handler!="-",instance="$node:3903"}[5m])) * 100 / sum(phpfpm_processes_total{instance="$node:9180"})` [17:06:40] still thinking about if I trust it ;) [17:07:37] <_joe_> why wouldn't you :P [17:08:09] <_joe_> it's the number of wall clock seconds spent in every request / the number of workers [17:08:29] <_joe_> for each time interval [17:08:44] it's a trailing indicator though [17:09:04] not only is there the 5 minute window in the irate but it's also that requests aren't logged until they've completed [17:09:12] but yeah i think it's correct aside from that [17:09:13] <_joe_> you mean the values are shifted by a bit? sure [17:09:22] <_joe_> it is just useful to see trends [17:09:32] <_joe_> not to spot immediate issues as they happen [17:09:43] <_joe_> but say a server is constantly above 50% there [17:09:46] <_joe_> you have to worry [17:09:52] the next thing I want to know is what happened at 15:30 on 4/1 [17:10:00] sorry, Apr 1 [17:10:06] <_joe_> I got it [17:10:19] <_joe_> probably the switch of the canaries to envoy? [17:10:24] <_joe_> what do you see? [17:10:54] <_joe_> no that happened on the 31st sorry [17:11:21] I see a bunch of appservers going to low idle worker threads (some going to 0) [17:11:41] https://w.wiki/LtC (and maybe click 'show all timeseries') [17:11:57] going back to the transitive timeouts/retries thing, I think a possible simple scheme that doesn't have major flaws would be: [17:12:19] the requesting service sends something like X-Timeout-Remaining: 37 X-Subrequests-Remaining: 10 [17:12:37] and the responding service sends back: X-Subrequests-Consumed: 5 [17:13:06] (time is implicit in the response time from the requestor's pov. if there's an actual timeout with no response from below, you're done and fail upstream). [17:13:08] <_joe_> bblack: you're reimplementing what envoy already does, more or less :) [17:13:26] let me know when everything inside our network is behind envoys :) [17:13:36] <_joe_> we're working on it [17:13:56] does it pass a subrequest counter transitively around and actually track it? [17:13:58] <_joe_> we used tls termination as our troy horse [17:14:28] <_joe_> no, but it has a series of headers that are used as control flow between different parts of the mesh [17:15:07] <_joe_> bblack: it works well if you only make your application talk to envoy, and let envoy deal with the other application's envoy under the hood [17:15:37] is it transitive though? [17:15:53] <_joe_> your application more or less says to envoy "give me this info within N seconds or give up", and all the retry logic and circuit breaking happens within there [17:16:10] <_joe_> but no, it's not necessarily transitive , if I got what you mean [17:16:30] the X-Subrequests-* header system can be [17:16:44] (if you don't retry timeouts, which is probably a good idea anyways) [17:17:36] it prevents some loop in our stack from causing one user request to blossom into 3x service A requests which are 30x service B requests which become 5,478 service C requests which become 504,897 service A requests again, etc [17:17:59] you set a budget for the total number of internal subrequests that can happen as it fans out, and when the limit is reached, all things just fail fast. [17:18:46] (and set it to some sane value your stack is not expected to exceed for a successful requests, like say 100) [17:19:13] <_joe_> or you reason in bulk and decide you can't originate more than x requests from service a to service b overall at the same time [17:19:16] bblack: Envoy doesn't have propagation like that, but AIUI it does have functionality to "you can have maximum N requests outstanding and C open connections to service X, globally" [17:19:36] which I think still does something close enough to the right thing here [17:19:37] yeah but that doesn't sound reactive to scale [17:19:54] the wrong thing is getting shot down, in that case, randomly [17:21:17] <_joe_> (I'm going offline now sorry, I need to rest) [17:21:21] ok [17:21:46] so you could reason that service b can only handle 10K concurrency, and have envoy limit the a->b (or *->b) traffic to 10K to save service b [17:22:29] yes, and it will stop dropping things, and you don't have much control over what (except you can have different limits for different priorities of traffic) [17:22:37] s/stop drop/start drop/ [17:22:55] but when that limit kicks in, it's going to start killing off random requests. You'll see a 503 (or whatever) for a certain request, but you won't be able to find any problem on a manual retry of that requests, because there was nothing wrong with it other than it went over the global max at that moment in time. [17:23:17] whereas if you make the *offending* requests fail, you can more-easily see why things went off the rails [17:24:01] I totally agree, but I also want to point out that we'll soon have something that implements a suboptimal-but-better-than-status-quo in-line with most of our subrequests :) [17:24:14] ok :) [17:24:53] anyways, sub-requests/calls (inter-service calls) can be a budgeted resource like timeouts, and both can be transited via headers. [17:25:27] +1 [17:25:47] then if a given service wants to do some policy like "I want to retry service D 5 times with a 1 second time", well, maybe I think that's a fundamentally bad idea, but that's up to them really not me.... [17:26:01] ... so long as they do it within the timeout and subrequest budgets handed to them in the inbound headers [17:26:27] (that counts for 5 subreqs) [17:26:46] and if a service wants to make parallel subrequests, it's going to have to divide its budget [17:27:07] (whereas timeout budgets can run concurrently without division, for parallel subreqs) [17:27:13] something that I'd argue is that the best place to implement both the tracking of timeout/subrequest budget, and the enforcement of it, is in Envoy itself [17:27:46] yeah, the upside of that is consistency of implementation [17:27:54] it saves you from having to get all those fiddly details right, including observability/metrics thereof, in N different languages (and then having to reason about any subtly different implementation details) [17:28:30] but the downside is the application may be less-aware of what's going on and less-able to fine tune the use of those budgets in creative ways [17:30:35] (if the app is ignoring those same headers, anyways) [17:30:49] who does the logic of breaking up the budget into sub-budgets? [17:31:03] speaking in the first person as the code of ServiceA [17:31:32] I mean, the app doesn't _have_ to ignore those headers [17:31:37] I've been given a single requests from , and received with it a budget of 37 seconds and 22 subrequests [17:32:18] I may wish to now call 3 other services in paralle, while doing a series of 5 calls to a 4th service serially, and may choose to allocate 2 subrequests each to those first three calls and 3 each to the others, etc... [17:32:39] only the application's code has a hope of understanding how to do budgeting [17:33:57] I think maybe ideally the app is very cognizant of these headers in the case that it is an app that makes other subrequests, just envoy is doing the job of enforcement and observability [17:34:52] at least timeout enforcement [17:35:10] with the app budgeting, I guess it's really up to the app to decide it has zero subreqs left and fail [17:35:54] (I don't think envoy would understand request context for multiple outbound subreqs in this case) [17:37:11] mmm, it possibly _could_ as long as apps are doing request-id propagation, but yes that is difficult [17:38:47] yeah but then envoy would also have to save state on reqids [17:39:21] image I'm the specific instance of the ASDF daemon on host asdf1001 [17:39:28] I have many user requests flowing through me [17:39:51] in response to one of those incoming requests, I now want to serially fire off 3x subrequests to other services, within that one budget context [17:40:12] now the ASDF daemon is going to make 3x serial outbound connections to envoy (because it's the inter-service mesh routing, one way or another) [17:40:48] but how does envoy understand that these 3 requests correspond to a specific budget, unless that envoy daemon also has a state table of every relevant request-id it's tracking [17:41:26] the application already has some request-context object. budgeting is much easier for it. [17:43:06] (especially if they're different envoy daemons - the ones managing ASDF's influx and outbound) [17:43:30] (unless we start talking about the evoy mesh storing global state somewhere shared across the network, but that sounds terrifying for scalability) [17:44:50] anyways, whether or to what degree envoy or the app-code do parts of it, the important thing is having a standard universal mechanism. [17:45:16] which is defining the meaning of those custom headers [17:45:41] yeah, agree with all that [17:45:45] then if we s/envoy/whatever/ 4 years from now because envoy isn't cool anymore, we can retain the meaning through the transition. [21:23:14] back to appserver thread saturation issues -- were we debugging something on 2020-03-16 around 18:00? https://grafana.wikimedia.org/d/ifM0GzjWk/cdanis-xxx-php-worker-threads?orgId=1&from=1584370800628&to=1584403194855&fullscreen&panelId=2&var-datasource=eqiad%20prometheus%2Fops&var-clusters=appserver [21:23:39] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1584370800628&to=1584403194855&fullscreen&panelId=9