[14:02:34] halfak: fyi, we picked up again the budgeting exercise. No rush but if you have cpu/ram/disk/network needs for the next fiscal year now is the time [14:02:56] akosiaris! Oh! Good to know. [14:03:30] This is good timing. I'm making some estimates right now for what I think we can still do with the hardware we have. [14:04:04] akosiaris, can I have you comment on something quickly. I'm not sure if I'm doing a wasteful exploration or not. [14:04:11] My next step with https://phabricator.wikimedia.org/T242705 [14:04:26] ...is to look into switching our uwsgi to be threaded. [14:04:44] They are IO blocking so in theory it should not cause a bottleneck. [14:04:57] But I was originally advised by yuvipanda not to go down that road. [14:05:00] WHat do you think? [14:06:41] It sounds like you recommend extensive testing. Would you rather see us focus on new hardware? [14:06:58] generally speaking, yuvi made the sensible suggestion. Python is known to not behave very well with regards to threading due to the GIL. [14:07:14] Honestly, this uwsgi restart issues is pretty insane and we haven't found another way around it. [14:07:47] it is indeed. I can not understand why it would consume all this memory on stop [14:08:30] Right. Or CPU. That doesn't make any sense either. [14:08:44] oh the CPU I think is a bit more easy to explain [14:08:51] Oh? How's that? [14:09:00] it's the garbage collector that tries to GC all that memory that suddently gets allocated [14:09:05] at least I think [14:09:17] Oh. So the CPU spike only makes sense given a memory spike? [14:09:33] But you reported a CPU spike with no memory spike (which I am highly skeptical of) [14:09:34] there must be a crazy churn rate of memory allocation/freeing at that point in time [14:09:51] it was in a different example though, right? [14:10:15] Oh. well, it happens with gunicorn and uwsgi all the same. [14:10:29] Even when it's that simple flask app that just uses a bunch of memory to store numbers. [14:10:34] my test for that was to start it up, see the memory increase as normally expected, wait it out a bit and when things stabilized, send the INT signal to stop it [14:10:46] then memory usage decreased and cpu usage increased for a short time [14:11:03] which is expected. Python was using CPU to garbage collect all those objects [14:11:12] Hmm. I've never seen memory usage decrease. [14:11:18] While the CPU spiked. [14:11:37] yes, this is not happening in our bug/case. quite the contrary [14:11:45] I was checking memory usage by looking at KiB Mem "free" in top. [14:11:51] memory increases when we send the INT signal [14:11:57] which is insane tbh [14:12:26] so I think that simple flask app is not really reproducing the issue [14:12:51] Hmm. But it does on my machine. And on our VMs in cloud services. [14:13:02] At least I see it in free memory on top. [14:13:06] Let me double check. [14:13:24] you see memory usage increase after you kill the service? [14:13:42] s/service/the flask app/ [14:13:51] Right. [14:14:04] I don't locally. let me retry it in fact [14:15:10] I'm running Ubuntu 16.04 and testing with Flask 1.0.3 [14:15:49] Debian 10 and 1.1.1 but I doubt it's important in any way [14:16:06] oh and gunicorn 20.0.4, with the command in your paste [14:17:20] ok, started it up, I see the memory increase indeed as expected and then stabilize and a normal level (I 've changed a bit the number of integers to fit in my local memory). Everything up to now is fully expected [14:18:13] and I just hit Ctrl-C and I see in htop all my CPUs pegged at 100% for like 5s and in the meantime memory usage per worker decreasing substantially with each passing second [14:19:04] and that increased CPU usage is explained by the garbage collector running [14:19:45] but memory usage is definitely at its ceiling while the flask app is idling [14:20:04] Hmm. Looks like you are right. I'm not seeing the memory spike. Maybe I was confusing gunicorn with uwsgi behavior [14:20:13] which is not the case with uwsgi-ores. In that case memory usage skyrockets when it receives SIGINT [14:20:40] btw, using gunicorn and the actual uwsgi-code, is the original issue reproducible? [14:20:43] Hmm. I do see a rise in memory usage though. All of the workers go from 5500 KB to 5800 KB while they pin the CPU usage. [14:21:01] I'd tested that and confirmed it in my notes. Let me retry. [14:21:54] Oh! Looks like we need to use the "--preload" option or gunicorn works strangely. [14:22:03] strangely == not like prod. No shared memory. [14:22:10] I had that in my notes. [14:22:24] I have them at a RSS of 1585MB at peak (flask app idling) and when I hit ctrl-c every process drops by 300-400MB per sec [14:22:33] yeah, I use --preload per you notes [14:22:50] Oh yeah. I see we did. Somehow didn't see it when i copy-pasted the command. [14:23:12] * halfak runs ores-prod code. [14:25:19] Forgot to look at free. Was looking at res for each worker. FWIW, I do not see a rise in RES ever. [14:25:29] It's always a huge drop in free. [14:25:56] yeah I only see the rise if RSS while the app is starting up (while creating the integer array) [14:26:02] s/if/in/ [14:26:18] 9.8GB free drops to 6.9GB free while shutting down. [14:26:24] oh wait, the above is for the flask app, not ores production code [14:26:31] Ores production code. [14:26:43] I haven't tested ores production code with gunicorn, is there a place I can easily? [14:27:33] Just pull down the ores-prod-deploy repo and run "gunicorn --workers=16 --preload ores_wsgi:application" in the base of it. [14:27:54] oh that will work? no need to also clone submodules? [14:28:19] * akosiaris trying [14:28:22] Oh yeah, you'll need to pull in the submodules and install the requirements. [14:28:28] Sorry that wasn't clear. [14:28:33] ok, makes sense [14:28:43] I just confirmed that I see the same drop in "free" with the test app that I made. [14:29:06] I dropped from 10GB free while the app is running to 7GB free while it is shutting down. Then I rise back up to 10GB free as soon as it is gone. [14:29:14] * halfak wipes brow [14:29:21] I can believe my old notes and that is a relief :D [14:30:13] BTW, I removed one zero from my loop so that each worker reports ~200MB of RES rather than 1.7GB :| [14:30:21] Much more manageable. [14:30:37] ORES in prod reports ~405MB RES so it's more comparable too. [14:30:49] yup, done that already. On my first try with that code I almost lost my machine [14:31:02] Ha. I'll edit the comment so no one else gets bit. [14:36:41] Confirmed: Here's what I see with 2m items in the app.py dict: 9.7GB free, start gunicorn for app.py, 9.3GB free, INT, 6.4GB free, , 9.7GB free [14:36:53] top reports ~380MG or RES per worker. [14:36:56] *MB [14:42:31] a neat trick with top to monitor only those process all the time is top -c -p $(pgrep -d ',' -f gunicorn) [14:42:43] so you can focus only on the RES change while doing the tests [14:42:44] Oh nice :) [14:42:56] I just sort by RES and they are at the top. [14:43:15] RES never seems to go up for the workers or anything else reported by top. [14:44:16] So, one of the things I have been thinking about is this: All of the workers fork from the main process and are getting the benefits of shared memory. They report a high RES but are not actually using that much themselves. [14:44:48] We send an INT to the main process and it starts to shut down. Maybe because the main process is shutting down, all of the workers immediately copy memory that was shared to preserve it. [14:45:02] Then after they have copied the memory, they then deallocate and shut down themselves. [14:45:33] I think the master tells the children to stop first and then it shuts down but you got a point. Maybe it does internal cleanups before sending the signal? [14:45:33] Seems like this would be the kernel doing something stupid. Or maybe both gunicorn and uwsgi don't do their shutdown dance in the right order. What do you think? [14:46:25] if that is true it's quite possible that copied on write memory will have to be copied over to the children [14:46:47] Right. That seems to line up with reporting no increase in RES but seeing a lot of free memory go away. [15:01:35] Jumping into a meeting. akosiaris, if you have any high level thoughts like "it's worthwhile to keep digging into this restart issue" or "let's focus on new hardware", I'd like to take them into consideration as we plan out what models we can deploy and when. [16:22:04] hey Helder: I saw your questions yesterday but couldn't get back to you due to meetings. I haven't run the example code in a couple of years, but last time I did it worked just fine. If it doesn't work, that's a bug we should fix [16:22:23] and I appreciate your patience with me getting back to you on this! :) [16:22:41] No problem :) [16:24:30] Nettrom, this is what I got when I tried: https://pastebin.com/yNwCMr2U [16:31:40] Hello halfak [16:31:52] I've been going through: https://etherpad.wikimedia.org/p/nlp_for_ores [16:32:21] I don't get this part though: Trey's code for stripping invisible/tokenization-breaking characters (it's Perl, sorry). Note that for my use case I've already done tokenization via Elasticsearch and I strip all these so I can do simple match with Unicode regex patterns (like /^\p{Arabic}+$/) on the remainder to classify tokens. If you are [16:32:21] normalizing text before sending it to a tokenizer, some you want to strip, others you want to convert to spaces. [16:32:31] It's on line 31 [16:32:51] Does this apply to revscoring? The part of doing tokenization via Elasticsearch [16:33:21] Helder: thanks for capturing that, I'll look into it when I have time [16:33:56] it's unfortunately something I'll have to do as a volunteer, so it might take a few days [16:36:33] No hurries [16:53:15] 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Discovery-Search, 10NewcomerTasks 1.1, and 3 others: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559 (10MMiller_WMF) 05Open→03Resolved I'm resolving this, now that we're se... [16:53:17] 10Scoring-platform-team, 10Discovery-Search, 10Epic, 10Growth-Team (Current Sprint): [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics) - https://phabricator.wikimedia.org/T240517 (10MMiller_WMF) [16:53:53] haksoat, hey! [16:53:57] Heavy meeting day today. [16:54:43] Oh. Okay. We could talk about it tomorrow. Or when you are chanced. halfak [16:55:07] So yeah, that etherpad is relevant. Right now, we do a lot of manual work to try to make our tokenizer process work-like things. Trey was pointing me towards better strategies. [16:55:29] Using elasticsearch as an API is an interesting opportunity but I'm not sure we'll be able to do it. [16:55:50] In the meantime, we at least learn from Trey's experience to improve our tokenization. [16:56:35] Yeah. Reading up Trey's articles. Great work done there. [16:57:10] Any reason why you don't think we'll be able to do it? [16:57:21] That is using elasticsearch as an API [16:58:07] AFAICT, the tokenization API is not exposed and their team manager said they couldn't spare the time to expose it in the next 3 months. [16:59:54] Oh. Okay. I get now. [19:38:12] Hey folks! Ooof it's been a long meeting day so I'm super late with async today. [19:38:31] Kevin: [19:38:33] : [19:38:33] Jade [19:38:33] Added MW message key for jade-setpreference [19:38:33] Added MW message key for jade-moveendorsement [19:38:34] MW Core [19:38:36] Adjust notification type colors to match utility colors in the Wikimedia Design Style guide [19:38:38] Add notification type success [19:38:40] T: [19:38:42] Remove "successfully" from translation messages [19:38:44] Based on: https://www.mediawiki.org/wiki/Localisation#Avoid_jargon_and_slang [19:38:46] Added MW message key for jade-setpreference [19:38:48] Added MW message key for jade-moveendorsement [19:38:50] accraze: [19:38:52] Y: Worked on fixing up the link table helper classes to reflect our updated schema, started work on creating migration files for secondary schemas, also code review [19:38:55] T: More of the same (mostly working on migration files), I need to make a schema file for jade_facet and write a script to populate it on extension install, also more code review. [19:38:58] And me, halfak: [19:39:00] Y: I mostly worked on ptwiki with chtnnh in the morning. I also reached out the GSOC folks to get signed up as a mentor. I talked to Erika to get some updates on hiring (will bring to staff) and I worked on some metrics stuff and tuning session stuff. I ended up checking out early due to some back pain that has happily mostly eased up for today. [19:39:07] T: It's been meetings all day. Tuning session, topic embeddings, intervals, interviews, and working with my grad student who is studying ORES auditing strategies (Zach who is not around IRC). I still have one more interview and then I'm going to start working on a capex proposal by estimating the amount of additional memory we need for each topic model in ORES. [19:39:44] I should mention that most of this morning was dedicated to re-hashing debugging of our uwsgi woes. I'm going to try to get a post in stack overflow today and then link it from a bug filed against uwsgi. [21:41:52] OK just finished drafting this monster: https://stackoverflow.com/questions/61130651/memory-available-free-plummets-and-cpu-spikes-when-shutting-down-uwsgi-gunicor [21:49:48] accraze, if you have a minute, I'd appreciate any feedback about how I put that stackoverflow question together. [21:50:02] cool taking a look right now [21:51:46] I think tomorrow morning, I'm going to start working on estimates of the additional memory usage of adding more topic models. In the meantime, I'm really hoping we can use this to get some headspace! [21:53:59] yeah this stackoverflow questions looks great [22:01:19] Alright! I'll leave it there for today. [22:01:28] Have a good one accraze :) [22:01:29] o/ [22:01:38] later halAFK