[00:06:29] Amir1, https://github.com/wiki-ai/ores-wmflabs-deploy/pull/68 [00:07:02] Not necessary for the wmflabs deploy that I'm just about to do [00:07:11] But would have saved me some headache when testing on staging. [00:08:43] Starting deploy [00:11:38] I was afk for coffee [00:12:23] Workers are deployed. [00:12:28] I had a hiccup on web-04 [00:12:34] continuing with -03 and -05 only [00:13:13] I deployed at least ten times to labs, and never ever web-04 acted normally [00:13:45] I think we even threw it away and rebuilt it [00:13:52] (but not sure) [00:16:02] Looks like we're 100% online [00:16:12] graphite might be a bit weird for a while because I made some changes there. [00:17:51] grafana acts sooo wierd [00:17:55] *weird [00:18:09] https://grafana-admin.wikimedia.org/dashboard/db/ores-labs [00:18:09] yeah. It's all messed up today. I couldn't get any graphs for ore-labs. [00:18:20] I use graphite directly [00:18:23] So I was using graphite. [00:18:26] https://graphite-labs.wikimedia.org/ [00:19:32] It looks like we're successful [00:19:34] Time to log [00:43:58] https://www.youtube.com/watch?v=Pn7cLbK7mo8 [01:04:57] lol [01:08:36] OK I'm out of here. [01:08:42] Have a good one Amir1 et al.! [01:08:52] halfak: you too [01:09:02] o/ I stay to work a little and I go [06:35:28] 06Revision-Scoring-As-A-Service, 10Wikilabels: Add "info" URL to campaign data so that we can link to campaign page - https://phabricator.wikimedia.org/T139957#2504399 (10schana) How is the schema file versioned/managed? Is it intended to be able to be run as-is with the actions being idempotent? Or is it mean... [06:54:22] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 10revscoring, 07Documentation: Add MacOS instructions for installation to README - https://phabricator.wikimedia.org/T139355#2504430 (10schana) `pyenchant` is currently broken on OSX (without a manual patch): https://github.com/rfk/pyenchant/issues/45 Shoul... [08:07:49] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Migrate wp10 models to gradient boosting. - https://phabricator.wikimedia.org/T141603#2504513 (10Ladsgroup) [09:24:11] (03CR) 10Thiemo Mättig (WMDE): "I love tests. :-)" (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/301162 (https://phabricator.wikimedia.org/T140455) (owner: 10Ladsgroup) [11:02:14] (03PS1) 10Ladsgroup: Some more CI tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/301786 (https://phabricator.wikimedia.org/T140455) [13:36:28] o/ [13:39:42] hey Amir1 [13:49:33] 06Revision-Scoring-As-A-Service, 10Wikilabels: Add "info" URL to campaign data so that we can link to campaign page - https://phabricator.wikimedia.org/T139957#2505122 (10Halfak) I should note that, we've never done any meaningful schema changes, so we're still figuring this one out as we go. [13:52:52] 06Revision-Scoring-As-A-Service, 10ORES: Update wmflabs deploy repo for new version of ORES - https://phabricator.wikimedia.org/T141377#2505123 (10Halfak) And deployed! [13:52:59] o/ sabya [13:54:24] halfak: o/ [13:54:46] I just woke up [13:55:19] Will you be able to get on a call to talk framing for the upcoming bloggings or should we do IRC? [13:57:07] 10Revision-Scoring-As-A-Service-Backlog, 10ORES, 10revscoring, 07Documentation: Add MacOS instructions for installation to README - https://phabricator.wikimedia.org/T139355#2505127 (10Halfak) pip install git+git://github.com/rfk/pyenchant.git@v1.6.7 should work for now. I agree that it might be good to... [13:59:51] I'll be on a call in a minute [14:00:41] Sure. Just drop me a call through hangouts when you are ready [14:03:14] k [14:45:54] Amir1: halfak, is it ok for you guys if we talk hardware reqs in an hour or so ? [14:46:27] Yeah that should work for me. [14:46:29] :) [15:04:34] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Some more CI tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/301786 (https://phabricator.wikimedia.org/T140455) (owner: 10Ladsgroup) [15:15:17] akosiaris: we were at a meeting, It's okay [15:15:28] I would love to [15:52:24] o/ [15:52:32] akosiaris, want to talk now? [15:57:45] halfak: in an ops meeting currently, should be done in about 3-4 mins [15:57:58] kk [16:01:12] I'm watching the metrics meetings [16:01:16] *meeting [16:04:10] halfak: I think Detox people can use wikilabels system [16:04:19] have you talked to them about it? [16:04:23] Agreed. That's one of the things we discussed. [16:04:50] kk [16:06:57] I am around now [16:11:05] OK so, I see our CPU and especially memory requirements growing substantially in the near future. [16:11:13] I'm too now [16:11:46] It seems to me that having a dedicated machine with lots of memory and substantial CPU for our workers would be a nice next step. [16:12:04] So we could run the web workers on scb1001/2 and the celery workers on ores1001 (or whatever) [16:12:44] In a perfect world, all of the recent interest and plans for ORES would translate directly into funding for new hardware. [16:13:15] I 've proposed the same approach btw already to mark [16:13:24] it's the one that makes the most sense [16:13:29] great. Of course we have budget and other stuff to worry about. [16:13:36] exactly [16:13:40] I'm not sure how navigating the budgeting and stuff is going to work [16:14:00] I 'd say it's not our job so much. My main concern is getting numbers [16:14:07] But I could muster some political pressure or write up some justification. [16:14:19] Gotcha. E.g. a projection for memory usage? [16:14:45] as well as CPU (which is a tad more difficult) [16:14:59] Na. CPU is super easy :) [16:15:13] We've already done that work. It seems that 4 workers per CPU lets us max out nicely. [16:15:33] And with a worker count, we can estimate capacity [16:15:44] so I can get some numbers for memory out of the current installation. but I don't know how to correlate that with extra models and wikis [16:15:50] let's do memory first, cpu later [16:16:13] so, how do prediction models consume CPU ? [16:16:16] eer [16:16:20] memory, sorry [16:16:57] It roughly depends on the prediction strategy and we mix prediction strategies based on their fitness. [16:17:12] I think we need to do some experiments to see how real memory usage changes as we add models. [16:17:59] ok, makes sense [16:18:02] we currently have 50 models working and we get current memory usage by ores so getting an estimate is easy [16:18:24] as an average of memory usage [16:18:55] akosiaris, thinking about how to prioritize this work -- the experiments with memory. [16:19:11] (We also need to increase number of web workers from to around 72) [16:19:24] are we talking prod or labs here btw? [16:19:29] prod [16:19:30] both [16:19:38] labs is easy to manage WRT memory and CPU [16:19:53] So yeah, I guess prod. [16:20:00] Prod is the only place I'm worried about hardware. [16:21:12] ok, I suppose doing the experiments is not that difficult [16:21:33] getting measurements is a bit trickier [16:21:37] +1 [16:21:45] especially due to the shared infra right now [16:21:52] but I can work on it [16:22:03] shouldn't be too hard [16:22:10] We should be able to get accurate measurements outside of the prod env. [16:22:18] so, we will only care for celery workers, right ? [16:22:39] we don't expect our experiments to change the footprint of the webapp [16:22:42] akosiaris, yes, eventually that will be true [16:22:51] And by eventually, I mean, probably by the end of today. [16:22:52] yeah, memory usage by web workers will drop drastically sooon [16:23:01] oh.. nice [16:23:03] how come ? [16:23:57] we are preventing model files to be loaded in web app, since we don't need them and they take lots of memory [16:24:23] oh, we were loading them ? ouch [16:24:50] :D [16:24:57] Yeah. It's an artifact of how celery works. [16:25:04] ok fixing that is great for starters [16:25:12] Essentially, the client (in the webapp) and the server (in the worker) run the same code. [16:25:25] yeah I know, I 've written my share of celery code [16:25:30] never like that part [16:25:37] I always wondered why [16:25:38] Indeed. But then again, we want to replace all of the uwsgi memory usage with new celery workers. [16:26:23] that's a super urgent issue for us now, 1% ores extension jobs fail because of that [16:26:36] ? [16:26:46] https://grafana-admin.wikimedia.org/dashboard/db/ores-extension [16:27:11] I want to add more celery workers https://gerrit.wikimedia.org/r/301750 [16:27:17] but we are tight on memory [16:27:20] and you think it's because uwsgi loads the models ? [16:27:26] or we don't have enough workers ? [16:27:38] we don't have enough workers [16:27:42] ah ok [16:27:53] but we can't add because memory is already packed with web apps [16:29:12] as halfak said, these are short term issues, we can fix it by adding some more workers (8 per node would be enough for now) but if we want to grow. It'll be problematic [16:29:21] * halfak starts to work on the remaining web memory issue. [16:29:42] ok. so apart from that, when/how do you want to conduct the experiments so we can get some number on what hardware we will need in the long term ? [16:30:03] long term being a couple of years btw [16:30:28] where couple get the usual definition https://xkcd.com/1070/ [16:30:36] akosiaris, I want to replicate the analysis I did for https://phabricator.wikimedia.org/T139177 [16:30:55] But I want to do it starting with a couple of models and slowly adding models back into the config. [16:31:10] I think that using this will allow me to make projections. [16:31:15] E.g. right now, we host 50 models. [16:31:39] If we were to stop adding new wikis and just complete support for the wikis we currently partially support, that will change to 100 models. [16:32:12] If I'm right, that'll double memory usage. [16:32:32] I should be able to roughly approximate the trajectory we've taken re. memory usage. [16:32:52] ok that sounds like a plan. how can I help ? [16:32:53] Woops. that was the wrong phab ticket [16:32:55] * halfak digs. [16:32:58] Detox integration might add some memory usage too [16:33:26] akosiaris, right now, I need a guide for "tell me this in this way so I can relay it to Mark et al." [16:33:34] but not sure if it is negligible or not [16:33:36] And as we go I might ask you to help me with some specifics. [16:33:41] Amir1, +1 [16:33:47] btw T139177 wrongly gives the impression mobileapps problems were cause by ORES, which is not true [16:33:47] T139177: Investigate increased memory pressure on scb1001/2 - https://phabricator.wikimedia.org/T139177 [16:33:47] Not sure what that'll look like yet either. [16:34:20] This one https://phabricator.wikimedia.org/T140020 [16:35:24] so, what I 'd like to in the long run is a "we 'll need in total X GB of RAM" which can be split over a number of servers ofc [16:36:07] that's the end goal. now to get there I suppose we could break it down a bit more [16:36:30] akosiaris: it would be great if you document somewhere why the mobileapp issue is not being caused by ORES (or point me to the right place) [16:36:31] things like average memory consumption added per model, average memory consumption added per wiki [16:36:38] akosiaris, I'll be able to give you a relationship between CPU cores and memory [16:36:52] Amir1: er, even better I 'll link you to the patch that fixed the issue [16:39:15] halfak: I 'll need that relation as well but it's not that important. [16:39:41] it's more like a guideline in the end as to which set of equipment possibilities is better suited [16:40:29] akosiaris, it's the only one that matters, but I can see how it doesn't directly help at purchase time. [16:40:39] Amir1: https://gerrit.wikimedia.org/r/#/c/298714/ [16:40:44] E.g. if you ask "how much RAM does a machine need" I'll ask "how many cores?" [16:41:02] actually that depends on the workload very much [16:41:03] It's like one of those math problems where you need to identify whether the question is answerable or not. [16:41:32] akosiaris, I'm looking to be *able* to fully utilize a machine. To do that I want 4 workers per core. [16:41:34] thanks [16:42:03] So take the RES of a worker and multiply that by 4*cores. [16:42:06] Amir1: TL;DR service-runner has a limit on the heap, MCS would hit it and problems would arise [16:42:09] And you get the RAM we need for the machine. [16:42:26] why do you are about the utilization of the box ? [16:43:10] I guess I want ORES to be able to handle as much capacity as possible. [16:43:11] unless you got an infinitely parallelized problem it might not be that important [16:43:25] Well, it's roughly infinite :) [16:43:39] score thing * number of things. [16:43:51] both are more or less finite [16:43:57] at every given point in time [16:44:06] finite * finite => finite :-) [16:44:07] anyway [16:44:16] that's more philosophical [16:44:44] it's btw why I asked CPU usage [16:44:58] I see that would directly relate memory usage to CPU usage [16:45:01] I think utilization is the wrong way to look at this. [16:45:16] potential utilization is better. [16:45:29] We often get analysis jobs that will request scores as fast as possible. [16:45:45] I'd like to tell people that they can make N parallel requests with the highest N we can muster. [16:46:06] And in order to maximize that N on the hardware we have, I think about potential utilization. [16:46:23] Batch jobs are funny like that. [16:46:56] hmm, I get your point of view. but there are probably gonna be other things that will limit that N anyway [16:47:06] like network, the frontend caches [16:47:07] etc [16:47:29] I mean, you may ago around saying to people N but it will end up not being N [16:47:30] akosiaris, not in our experience on labs [16:47:46] which is pretty much in the API netiquette we don't clearly set a limit [16:48:00] Though, you are right, we might end up with a bottleneck to the MediaWiki APIs we use. [16:48:00] (which btw has some people frustrated at times) [16:48:36] Still, I don't think we're talking about infinity here and it's not terribly productive to challenge an argument based on that. [16:48:56] yeah, as I already said, philosophical [16:49:04] Right now, we have 3X workers in labs than in prod [16:49:17] And our utilization under heavy load comes together nicely. [16:49:40] ah, so you got me talking about CPU [16:49:57] so .. the labs workers are almost definitely less powerful than the prod ones [16:50:19] as is less bogomips, less flops etc [16:50:45] with all the extras that this means [16:51:07] that is, the same number of workers in prod hardware will definitely be able to serve more requests [16:51:10] Cool. So we *might* want to bump that up to 6 workers per CPU. [16:51:15] that'll be nice to work out. [16:51:52] well, as far as CPU consumption goes, YES [16:52:07] as far as memory consumption goes... not currently ? [16:52:40] +1 [16:52:49] see the nice rabbithole we go down everytime we estimate stuff for a new service ? [16:54:04] OK. So I'll (1) aim to build a memory projection per celery worker, (2) use that estimation assuming 4 workers per CPU and machines like scb1001/2 to estimate necessary RAM. [16:54:22] Once we have the new hardware to play with, we can look into increasing the number of workers per CPU. [16:54:47] ok, sounds like a plan [16:55:13] I 'd like to help, what can I help with ? [16:56:29] * halfak thinks. [16:57:48] ok ok, tell me if you got anything [16:58:15] so on to another issue. We are missing ORES redis boxes in codfw, that's fully on my plate, I 'll start the wheels moving [16:58:30] once we got those we should be fully operational in codfw as well [16:58:36] there is 1 big question [16:58:40] Awesome. [16:58:50] That was going to be my suggestion for thing to do. [16:59:02] do we want the redis to be shared among DCs ? [16:59:03] And I have a suggestion that might be reasonable. [16:59:09] or not ? [16:59:29] I am mostly thinking not ... but feel free to correct me [16:59:34] Hmm... I think not since speed will be important and we want to be able to fall back to codfw if necessary. [16:59:46] Why not run redis on scb nodes with tewmproxy? [17:00:10] memory ? [17:00:28] We could potentially operate with very restricted memory. [17:00:45] twemproxy... we are trying to figure out something to replace that software [17:01:07] E.g. we should be able to get all recent scores in 2GB of memory [17:01:14] And LRU for keeping it clean [17:01:56] instead of the current 6 ? [17:02:12] Yeah. [17:02:14] 6 would be better [17:02:17] But 2 could work [17:02:51] well, we were planning long term here. could work should not be our target [17:03:30] OK, long term, we should have a redis node in codfw. [17:03:39] er 2 [17:03:41] not 1 [17:03:47] as we do in eqiad [17:03:54] redundancy reasons [17:04:00] Ah Yes. For redundancy [17:04:07] right now in eqiad we got 2 with replication between them [17:05:11] but we could enable twemproxy for automatic failover [17:07:37] lemme discuss the twemproxy with giuseppe and faidon .. there are not in love with the software either, it has caused us at least 1 big downtime [17:07:49] I 'll move on with the rest in the meantime [17:07:56] so anything else we want to discuss ? [17:08:06] I think that's it. [17:08:17] I'll ping early next week with progress on memory estimates. [17:08:22] ok then [17:08:24] Thanks akosiaris [17:08:25] :) [17:08:25] thanks! [17:08:31] * halfak runs to Lunch [17:08:38] * akosiaris dinner [19:34:28] PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:17] RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 384 bytes in 3.114 second response time [20:40:08] OMG done with meetings for the day [20:40:10] \o/ [21:55:54] So, it looks like the memory usage isn't really all that tied to models. [21:56:12] Hmm. It looks much more like language artifacts are the cause of massive amounts of memory usage. [21:57:30] Darn. [21:57:56] We need those to be loaded into the web nodes for feature extraction. [22:09:28] Oh wait... no this might be working. I'm dumb. I was looking at the celery workers. [23:29:57] 06Revision-Scoring-As-A-Service, 10ORES: Don't load models into memory on web workers - https://phabricator.wikimedia.org/T139407#2507213 (10Halfak) Looking at this on my laptop with the wmflabs deploy config | model set | uwsgi | celery | | all | 532MB | 920MB | | half contexts d... [23:30:26] Looks like we're getting a benefit for keeping the models out of uwsgi, but it's not as big as I'd like. [23:42:48] 06Revision-Scoring-As-A-Service, 10ORES: Don't load models into memory on web workers - https://phabricator.wikimedia.org/T139407#2507238 (10Halfak) I tried writing a little script to get a sense for how our models took up memory. ``` >>> import glob >>> from revscoring import ScorerModel >>> # RES Check 1 .... [23:56:28] OK. Research complete. Looks like we are getting gains from not loading the models into memory -- just not as much as we'd hoped.