[00:04:47] o/ [08:17:39] o/ [13:26:53] Blah. Looks like my notes didn't make it since I was disconnected and my client didn't know [13:26:55] * halfak repastes [13:27:03] Hey Amir1. I've watched memory pressure increase on the scb machines, but our memory usage is roughly the same [13:27:12] I'm reviewing my memory computation from before and it looks like I dropped an order of magnitude on one of the computations [13:27:17] It looks like we went from using 72.8GB of RES to 53.2GB. [13:27:27] Which is maybe good news because it means we should be able to start up 16 more workers [13:27:31] If my calculations are correct, that would put us back up at 70GB [13:27:37] We'd then have 80 workers in Prod and 96 in labs [13:27:41] Which would be pretty badass [13:27:47] [13:46:28] Well, I found an issue in prod. [13:46:46] Looks like we sometimes store features with model scores in the cache. [13:46:47] Arg. [13:47:04] Looks like I know what I'm working on right away [13:48:31] Ooh. This is an easy fix. [13:58:50] Weird. Cannot replicate on localhsot [13:58:54] *localhost [13:59:32] WTF [14:18:19] halfak: o/ [14:18:25] I sent you a telegram earlier [14:18:47] Sorry I missed it. [14:19:12] Ahh. Yeah. So I'm working on an issue I found in production [14:19:23] This result is improperly formatted: https://ores.wikimedia.org/v2/scores/enwiki/reverted/324231245 [14:19:31] I'm trying to work out why this has happened. [14:20:10] okay, we can have the backlog grooming later :D [14:20:17] We shouldn't have "features" or "score" key in there at all. I think it might be that, somehow, we got something dumb in the cache. So I'm working on what could have caused it. [14:20:27] Regretfully, I can't seem to replicate the issue locally. [14:28:25] OK. I think I've got it. [14:28:28] The issue is our cache [14:28:42] And remember how we blew out the cache earlier to fix beta? [14:28:47] Well, it seems to have happened again [14:28:56] halfak: yeah but for prod? [14:29:00] The reason that it all worked out in prod is that we have some other cache messing us up [14:29:08] I don't think we have proper rights [14:29:25] Alex probably does [14:29:29] So, here's what I propose: we take this opportunity to rebuild all the models [14:29:43] why models? [14:29:51] We ought to do that soon because of changes within revscoring [14:29:57] And then we increment model numbers [14:30:04] And that will invalidate all of the old cache [14:30:39] okay, I get it. Sounds like a plan. I will keep monitoring for the extension [14:30:40] This will be a bummer in that it will increase our cache misses within ORES for a while [14:30:49] But it won't screw up the extension's tables [14:31:18] So, in order to execute this, we need to rebuild the models in editquality and wikiclass before the deployment window today [14:31:23] I think we can do it. [14:31:31] I'll start up a second compute node. [14:31:51] I want to totally re-extract features and build the models again. [14:32:42] Amir1, what do you think? [14:33:15] since it'll be our first model incrematiation [14:33:41] and I have never tested that functionality for ores review tool [14:33:50] I think we should go slowly [14:34:01] Oh yeah. I forgot that the review tool will handle this too. [14:34:02] Darn [14:34:27] Oh wait. [14:34:34] We handle model increments manually, right? [14:34:50] The system doesn't automatically engage in rebuilding the oresscores table, right? [14:35:07] some parts are manually some parts are automatic [14:35:23] What parts are automatic? [14:36:26] getting new version and updating the ores_model table [14:36:39] That sounds less concerning [14:36:57] Than re-scoring the whole recentchanges table [14:37:07] https://github.com/wiki-ai/ores/pull/164 [14:42:05] Looks like ores-staging-02 would serve nicely as a second compute node. [14:42:47] * halfak downloads models at 30MB/s [14:42:48] Mwahahaha [14:43:19] This will be a good opportunity to switch the wikiclass models to GB [14:43:34] the problem with the ores review tool is that we never tried that system in beta or prod. I'm super worried it breaks something [14:43:49] We should try in beta then? [14:44:01] We need to know if this works anyway [14:44:12] yeah, definitely [14:44:15] kk [14:44:29] Let's not be confident doing this in prod today [14:44:37] Let's just aim for beta [14:44:47] and if everything goes super fast, let's re-discuss prod. [14:45:35] sure, for start we need to increment version number of testwiki damaging [14:45:48] I think we can do that alltogether [14:46:27] * halfak installs everything from wheels in his venv [14:46:32] So awesome to not have to compile [14:49:45] Yeah, Wheels are super awesome [15:02:24] Well, this was a good opportunity to clean up our editquality makefile [15:02:27] Lots of little issues [15:04:05] OK. Looks like we're moving. I'll need to babysit this. [15:24:15] I just kicked off rebuilding the wikiclass models. [15:25:01] halfak: please make gb [15:25:05] please [15:25:06] Yup [15:25:11] On it ^_^ [15:25:11] yesss [15:25:13] :D [15:25:21] <3 those tuning reports [15:25:25] Such a good investment [15:25:41] I want to start looking at profiling reports again too [15:25:55] yeah, definitely [15:26:16] halfak: btw. the change prop patch is live now [15:26:24] Great! [15:26:39] * halfak races to dashboard to look at CPU [15:26:50] Ooooh [15:26:53] I think I can see it [15:26:59] https://grafana.wikimedia.org/dashboard/db/ores?panelId=5&fullscreen [15:27:30] \o/ [15:27:32] Goddamn right [15:27:34] Look at us [15:27:36] Less CPU [15:27:39] Less Memory [15:27:46] * halfak struts around like a damn fool [15:28:30] :D [15:29:08] I also added the pane for timeout errors [15:29:31] Oh interesting. Almost all of our errors are timeout errors it seems [15:30:12] Oh I suppose we deployed fixes models for nl and sv [15:30:15] ANd hu [15:30:16] \o/ [15:30:20] So that helps too [15:31:45] we definitely need more workers [15:32:04] I make the patch [15:32:18] Cool [15:32:49] If we believe that our memory usage before was safe, we should be able to have 40 per node now :) [15:33:02] And still be using a little bit less [15:38:12] woah [15:38:26] I go for 32 / node for now [15:38:32] Sounds very reasonable [15:38:59] Will give us 1/3rd capacity increase [15:39:03] :D [15:39:41] yeah, And when we deploy your changes we will have less memory footprint [15:39:48] wp10 models are huge [15:40:08] GOod point. I didn't even consider that :) [15:41:33] OK. I'm going to take a shower quick (crazy morning) and then I'm going to start a series of meetings. [15:41:45] I'll update re. model progress as I get a chance. [15:42:09] I think it'll be tight, but I should have PRs ready by the deploy window today [15:43:38] awesome [15:51:43] akosiaris: hey do you have time to review this? https://gerrit.wikimedia.org/r/#/c/304245/ [16:34:00] Amir1, looks like the processing for feature extraction is going slower than expected. [16:34:27] :( [16:34:32] So I'm much more skeptical that we'll be ready for beta today [16:35:23] It's strange to me we don't have a service deployment window in friday [16:35:34] we don't have any deployments in Friday at all [16:36:07] Otherwise we could just do it tomorrow [16:37:29] we already have one hell of a week: https://phabricator.wikimedia.org/tag/revision-scoring-as-a-service/ [18:42:16] Ran into another Makefile bug. [18:42:31] I've processed arwiki, cswiki, dewiki and part of enwiki. [18:42:38] So this may take all day :/ [19:16:51] Amir1, https://github.com/wiki-ai/wikiclass/pull/24 [19:17:00] Article quality models are all done [19:30:40] awesome [19:30:44] I was afk [19:40:54] :D [19:41:27] OK. Now ores-compute-01 and ores-staging-02 are processing models