[02:52:41] 06Revision-Scoring-As-A-Service, 10revscoring, 07Spike: [Spike] Investigate HashingVectorizer - https://phabricator.wikimedia.org/T128087#2535653 (10Sabya) 05Resolved>03Open Closed by mistake. [07:46:18] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Code is fine, but I find it a bit complicated. Suggestions inside." (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/302703 (https://phabricator.wikimedia.org/T141978) (owner: 10Ladsgroup) [13:57:26] o/ Amir1 [13:57:46] halfak: o/ [13:57:49] Hey [13:57:58] I accidentally self-merged my updates to huwiki reverted model yesterday. I figured it was OK and not worth the revert [13:58:07] it's okay [13:58:14] I'd like to get that out to labs today. [13:58:23] great [13:59:13] I think we should do with uwsgi what we did with the celery [13:59:20] restarting affter a while [13:59:20] What is that? [13:59:23] Oh [13:59:24] Hmm [13:59:27] http://uwsgi-docs.readthedocs.io/en/latest/Options.html [13:59:32] max-requests [13:59:37] a very simple puppet change [14:06:25] Amir1, that doesn't seem too crazy, but it also seems like we don't really have memory leaking issues on uwsgi. [14:06:46] Amir1, do you think that we'll get some other benefit from periodic restarts? [14:06:46] we can test that [14:07:02] basically we know for sure that there is a memory leak [14:08:17] we are dealing we tight memory on web nodes right now [14:08:28] I'm not saying that we have memory leak but we can be sure [14:08:44] are you okay with that test? [14:13:13] Hmm... Yeah. I think so. [14:13:48] I want to get some metrics for precached deployed soon. That will allow us to know what our request failure rate is. [14:13:51] Amir1, ^ [14:15:23] yeah, that would be great [14:15:39] OK. I think we're pretty much ready with the precached metrics. [14:16:03] I'll make sure that is ready as soon as I finish testing the new tamil PR. [14:17:18] ^ unrelated [14:17:28] Just wanted to push my last changes for work on sparse vectors. [14:22:11] I need to go and travel for a while to see my parents [14:22:22] I'll be back in a few hours [14:23:13] OK. Will keep cleaning for a little bit. I might have a puppet change for uwsgi ready for you too. [14:23:26] nah, I'm already on it [14:24:09] Cool [14:36:07] https://gerrit.wikimedia.org/r/#/c/303807/ [14:36:12] see you later [14:41:59] o/ [14:42:08] Revscoring 1.2.9 is not in pypi [15:21:21] 10Revision-Scoring-As-A-Service-Backlog, 10ORES: Add graphite logging to precached - https://phabricator.wikimedia.org/T119341#2536684 (10Halfak) https://github.com/wiki-ai/ores/pull/163 [15:21:30] 06Revision-Scoring-As-A-Service, 10ORES: Add graphite logging to precached - https://phabricator.wikimedia.org/T119341#2536685 (10Halfak) [15:21:45] 06Revision-Scoring-As-A-Service, 10ORES: Add graphite logging to precached - https://phabricator.wikimedia.org/T119341#1823853 (10Halfak) a:03Halfak [19:39:41] Just released RTRC v1.3.0 to stable, which will move its traffic from wmflabs.org to ores.wikimedia.org (previously only for RTRC beta users) [19:40:17] Great! Thanks for the heads up. I'll check out our performance in a couple of hours to make sure we're handling the capacity OK [19:40:27] Regretfully, it turns out that we have less capacity in prod than in labs. [19:40:31] A lot more constraints. [19:40:38] Still, you should see *much* more stability. [19:41:26] https://github.com/Krinkle/mw-gadget-rtrc/releases/tag/v1.3.0 [19:41:27] Okay :) [19:42:11] halfak: I notice a significant improvement wrt to latency. It's almost instananeous for a batch of 50 revisions. [19:42:15] recent revisions. [19:42:49] Krinkle, good. yeah. We've had some caching issues in labs recently. I'm working to get that cleaned up before we move recent code to prod. [19:42:59] there is no longer a human-observable difference imho between api-recentchanges or api-recentchanges.then(ores) [19:43:15] It used to be that the latter is 300ms+ longer. [19:43:26] I guess the warmup also helps with this [19:43:39] Although that is presumably happening in labs too [19:43:42] somehow? [19:43:59] It is, but for some reason, it seems the cache isn't as fast as expected in labs. [19:44:10] I'm not sure if it's not being used in some cases or if there's some other issue. [19:47:08] halfak: Whats the source for warmup in labs? rcstream? [19:47:17] Yup [19:47:20] In prod its jobqueue? [19:47:29] Or also rcstream [19:50:07] Prod uses Change Propagation [19:52:29] (Kafka/Event stream) [19:59:20] Nice [20:00:03] halfak: Can I see where that is set up? I'm curious what is between Kafka and ORES for this. [20:00:36] Sorry. Just hopping into a meeting. Will dig it up for you soon. Or you could ask about it in -services. [21:22:36] halfak: https://grafana.wikimedia.org/dashboard/db/ores [21:22:55] we definitely have a memory leak in uwsgi [21:41:01] Amir1, look back a couple of days [21:41:20] We go a whole week without a memory change. [21:41:22] That's not a leak [21:43:10] yeah, it stays at the same way but still periodic restart improved the memory usage drastically [21:43:33] Sure. the system dynamically allocates memory [21:43:45] It's probably happening in feature extraction [21:43:52] Rather, datasource extraction [21:44:00] Which happens on the web workrs [21:44:30] yeah [21:44:44] so you're okay on keeping it this way? [21:44:59] Yeah. I think so. I don't think it's actually a problem to solve. [21:45:58] okay [21:46:55] halfak: I think we are good to go to production now [21:46:56] Hmm... By doing periodic restarts of the workers, we're forcing a re-allocation of memory [21:47:06] Amir1, I'm running some tests against staging. [21:47:08] It does look OK [21:47:13] But I want to make sure [21:47:22] ALso, we missed our window [21:47:29] you, tomorrow [21:47:32] *yeah [21:47:48] Let me update editquality [21:53:36] Make sure you update ORES too so that we get the right output format for v1 [21:53:45] Amir1, ^ [21:54:08] I think I and I deployed it to beta [21:54:16] that's why we don't have failed jobs anymore [21:54:22] https://grafana.wikimedia.org/dashboard/db/ores-extension [21:54:57] Good. [21:55:23] Just to confirm, you're not seeing a higher rate of TimeoutError or anything like that? [21:55:40] on web nodes or beta? [21:55:46] beta [21:55:52] nope, beta is okay [21:55:54] Oh... hmmm... Beta doesn't get that much activity [21:56:09] the instance is running out of memory though, it's 3% now [21:56:09] :D [21:56:12] I'm looking at staging now and I'm checking on the revisions that precache reports as timing out. [21:56:26] Amir1, lol. We could probably turn uwsgi and workers down a bit there [21:56:48] yup, I'll do it soon [21:57:06] do you get time out in staging. [21:57:18] in labs setup timeout errors stopped [21:57:22] https://icinga.wikimedia.org/cgi-bin/icinga/avail.cgi?t1=1470596539&t2=1470682939&show_log_entries=&full_log_entries=&host=ores.wmflabs.org&service=ORES+worker+labs&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringnotrunning=yes&includesoftstates=no&initialassumedhoststate=0&initialassumedservicestate=0&timeperiod=thisweek&backtrack=4 [21:57:24] It looks like the ones that do are timing out for good reason [21:57:54] And the only DependencyErrors are in non-main revisions for wikidata [21:58:36] oh, we disabled those in prod, both in extension and change propagation [21:59:03] Sure. Not worried about them :) [21:59:20] I want to run this code for a little while in labs before we try prod. I think I'm ready for that deploy now. :) [22:00:13] * halfak starts [22:00:16] I wanted to say, it would be good if filter them out in our precaching system too [22:00:26] Agreed. [22:03:07] RTRC just switched to ores.wikimedia.org and you can *really* see it in the scoring requests graph [22:03:12] I think that was our last major user [22:08:44] OK deploy to labs is complete. [22:10:44] nice, I'm thinking on how I can reduce number of workers in beta [22:10:58] it's in the deploy repo [22:11:09] Custom config for beta? [22:11:14] and if I change it, it also changes the prod [22:11:25] I don't think that would be good idea [22:11:41] I think we should rewrite that using puppet [22:11:48] We already have one for the passwords [22:11:58] Yeah, that's what I was imagining too [22:12:08] and fix it via hiera [22:12:29] it's not in puppet though, I should add it [22:21:41] https://gerrit.wikimedia.org/r/303928 [22:24:18] halfak: this will add it ^ then I will change it via hiera for beta [22:25:02] Amir, IMO, this should be in the config repo and configured especially for beta. [22:25:15] Oh... wait. I think I see what is going on [22:25:28] We need to have it in the custom config because we'd like to set it in hiera [22:25:38] yup [22:25:42] Damn [22:26:24] I don't want diverge the repos for beta and prod [22:26:36] and I want puppet configs stays as similar as possible [22:26:38] Yeah... not what I was thinking [22:26:54] I was thinking that we'd have a custom config for beta that was checked into the repo [22:27:02] But hiera seems to be a more standard way to do these things. [22:27:22] It's just that checking a config into the repo seems to be a more powerful to do this. [22:27:23] yeah, [22:27:32] I think we're stuck with hiera [22:27:52] I'd really like to have the config in the repo reflect prod as closely as possible. [22:28:21] As it stands, we should really make it so you can't even run ORES without a custom config otherwise we duplicate settings and that's confusing [22:30:14] legoktm: hey, around? [22:32:05] (03CR) 10Ladsgroup: "I'll do that in another change set :) Thanks for the review" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/302703 (https://phabricator.wikimedia.org/T141978) (owner: 10Ladsgroup) [22:32:12] (03CR) 10Ladsgroup: [C: 032] Jobs fail instead of throwing error when score is not right [extensions/ORES] - 10https://gerrit.wikimedia.org/r/302703 (https://phabricator.wikimedia.org/T141978) (owner: 10Ladsgroup) [22:32:17] :D [22:33:02] (03Merged) 10jenkins-bot: Jobs fail instead of throwing error when score is not right [extensions/ORES] - 10https://gerrit.wikimedia.org/r/302703 (https://phabricator.wikimedia.org/T141978) (owner: 10Ladsgroup) [22:36:53] Amir1: sup [22:37:15] I just merged my own patch, not a big deal [22:37:26] Amir1, error rate in labs 1.5-2% :) [22:37:37] * halfak continues to watch precaching :D [22:37:38] legoktm: but it would be great if you review this one: https://gerrit.wikimedia.org/r/264608 [22:37:56] halfak: nice [22:38:04] Oh wait... something is weird. [22:38:10] it's around what we have with prod [22:38:11] This might be 150-200% [22:38:12] :S [22:38:29] :/ [22:38:41] How is the metric working [22:39:12] Amir1: okay, I'll put that on my list for tonight [22:39:15] "ores.ores-web-03.precache_scoring_error.count / ores.ores-web-03.precache_score.count" should work, right? [22:39:20] yess [22:39:23] Hmmm [22:39:24] thanks legoktm [22:39:44] The left side of the graph has values > 1 [22:39:50] Which should not be possible [22:39:58] Unless it's somehow multiplying by 100 [22:41:00] Weird, when I plot the raw values, the 1-2% makes sense [22:41:16] graphite must be like "Oh you want a percentage whenever you divide" [22:41:22] STOP BEING SMART [22:41:29] IT'S CONFUSING [22:42:45] get the values [22:43:09] see if they are really being multiplied by 100 [22:43:24] Yeah. They are [22:43:28] Blah! [22:46:10] Ha! I forced it to knock it off [22:46:26] But scaling to seconds and then dividing that [22:46:28] Mwahahahah [22:51:13] https://grafana.wikimedia.org/dashboard/db/ores-extension [22:51:30] halfak: we have a pane for failure rate now [23:01:13] wmf.15 :(((( [23:06:44] Amir1, just finished adding a failure rate to https://grafana-labs-admin.wikimedia.org/dashboard/db/ores-labs [23:06:54] Rather https://grafana-labs.wikimedia.org/dashboard/db/ores-labs [23:07:08] Oh nevermind. The dashboards are still not publicly visible [23:07:43] halfak: why grafana-labs [23:07:45] put them in grafana [23:07:59] Because people been telling me to move [23:08:00] it's fixed now (yuvi fixed it several days ago) [23:08:09] But he said to move anyway? [23:08:24] no, he told me to use grafana [23:08:39] https://phabricator.wikimedia.org/T141891#2530657 [23:08:53] Oh... "once it works" [23:09:00] grumble grumble [23:22:32] halfak: https://grafana.wikimedia.org/dashboard/db/ores-beta-cluster [23:22:55] Wow. Looking at that web memory [23:22:56] I guess 7 workers is enough and also 4 or 5 uwsgi worker [23:23:31] Daniel Zahn just merged the puppetize change [23:24:03] and I fixed stuff with it [23:31:24] Amir we should be able to get away with 1/3rd the workers with the recent changes in ORES [23:48:57] rc in beta is pretty quite [23:50:21] * halfak starts working on a spam bot