[05:14:16] (03PS2) 10Ladsgroup: Make FetchScoreJob.php more readable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/304686 [05:22:29] (03CR) 10Ladsgroup: [C: 032] Make FetchScoreJob.php more readable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/304686 (owner: 10Ladsgroup) [05:23:44] (03Merged) 10jenkins-bot: Make FetchScoreJob.php more readable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/304686 (owner: 10Ladsgroup) [09:12:11] 06Revision-Scoring-As-A-Service, 10ORES, 15User-Ladsgroup: Add uwsgi-related metrics to grafana - https://phabricator.wikimedia.org/T143081#2556425 (10Ladsgroup) [12:37:51] Amir1: I see statsd works fine... quite a few metrics though. It's gonna be a bit of a pain to create meaningful dashboads [13:35:42] akosiaris: yeah, I got some experience with metrics now :D [13:43:24] o/ Amir1 & akosiaris [13:43:34] halfak: hey [13:43:44] Hey akosiaris. I have an idea I want to run past you re. starting up more workers on scb nodes [13:43:45] sorry, I was asleep last night :) [13:44:10] ^ totally don't need to apologize for basic human stuff :P [13:44:23] I was working late because I was AFK for so long traveling [13:45:01] halfak: how many more workers ? [13:46:15] we should be ok with a few. We got some 20Gs free per box. But let's not pressure it too much [13:47:08] halfak: we already in 32 / node [13:47:27] (I just recently increased them) [13:47:46] akosiaris, so, we dropped RES usage by 20GB per node with recent deployments [13:48:00] I've been using RES to think about our real memory usage. Is that fair? [13:48:14] In 20GB, we can fit 18 more workers [13:48:36] So that would bring our original 24 workers up to 42 workers [13:48:57] And our RES would be a little less than it was before the recent deployment [13:49:03] yeah but let's not aim for filling 20Gs [13:49:12] akosiaris, +1 [13:49:23] Just looking to use the same amount of resources we were using before [13:50:23] if we follow that, it's about 10Gs we should fill [13:51:03] 2 weeks ago, total memory usage was around 25GB per box, now it's 20GB per box [13:53:13] akosiaris, seems something else is using more memory [13:53:16] It's not us [13:53:25] I'd like to recover what we were using before [13:53:40] Could it be that less stuff is getting swapped out due to the amount of available memory? [13:53:49] swap is zero [13:53:58] Was it zero before? [13:54:01] yes [13:54:09] Well... [13:54:18] So we lost some available memory in our changes? [13:54:19] the goal is to almost always be at zero swap [13:54:32] Still you see why this doesn't make sense, right? [13:54:51] haven't seen the numbers so ... nope [13:55:10] I mean the ORES memory usage numbers after the patch [13:55:17] I see some 1G RES ? [13:55:20] per worker that is [13:55:35] Yeah. Looking for my notes. [13:55:42] I thought I put them on a phab task and pinged you. [13:55:45] Maybe not [13:55:58] you mean the estimation one ? [13:56:03] Nope [13:57:06] Hmmm... I guess maybe it didn't make it to phab :( [13:57:09] * halfak digs more [13:58:17] Arg! Looks like this lives in chat :( [13:58:22] To phab! [14:08:12] 06Revision-Scoring-As-A-Service: Increase celery workers to 40 per scb node - https://phabricator.wikimedia.org/T143105#2557161 (10Halfak) [14:08:15] akosiaris, https://phabricator.wikimedia.org/T143105 [14:08:18] ^ numbers [14:10:34] halfak: 2 "currently" lines ? [14:11:02] Yes. That was for uwsgi and celery as of 8/09 [14:11:17] ah uwsgi vs celery [14:11:21] I just saw that [14:11:35] But note that we've bumped up the number of workers to 32 right now. [14:11:42] I'd like to bump it even higher to 40 [14:12:02] 27.6 + 45.5 = 73.1GB of RES [14:12:07] heh, so that can not be correct [14:12:15] the 2 boxes have 64 in total [14:12:24] Yes. I'm talking per node [14:13:39] So, I'd like to use up 8.8GB more RES per node than we are using right now. [14:14:25] In total, this would use up 17.7GB of the 19.2GB that we free'd up. [14:17:33] ok, sounds sane [14:20:49] heh, are you by any chance mapping a 1Gb file somewhere ? [14:21:06] ^ not sure what you mean [14:21:13] But none of our models are that big [14:21:33] * halfak imagines memory-mapping [14:22:03] memory mapping... hmm that actually may take a while to explain [14:22:15] so there the mmap() system call [14:22:20] is* [14:22:44] it create creates a new mapping in the VIRT space of the process [14:23:00] it is possible (and is being used a LOT) to map a file to memory [14:23:24] akosiaris, gotcha. yeah, we don't do that manually, but theoretically, we'd like to do that for all of the worker processes so that they could share memory for each of the models [14:23:30] Right now, it doesn't look like that is happening. [14:23:44] akosiaris, is it constantly happening or happening a lot around our restarts? [14:23:56] Oh! It would happen each time a worker gets restarted [14:24:03] which happens once every 100 tasks or so [14:24:37] I'm not sure how celery forks work, but I imagine it really is reloading the "application" which would result in a lot of memory activity. [14:24:51] actually, they already shared memory [14:24:53] Essentially, we'd load ~1GB every restart [14:24:54] share* [14:25:01] 1GB RES [14:25:07] I think it's more like 1.7 VIRT [14:25:23] * halfak doesn't look at VIRT all that carefully [14:25:38] VIRT is just the memory the process has requested [14:25:44] it does not mean it uses it [14:25:48] it's mostly useless [14:25:49] Yeah. Including shared stuff [14:25:56] at least these days [14:26:37] that part is heavily shared. And if you are reading memory models read-only (which I assume you do) [14:26:54] sharing happens more or less automatically [14:27:11] We rely on cpython to worry about that for us for the most part [14:27:26] I haven't done much research into how we might be able to work with cpython on that [14:27:38] you probably don't want to [14:28:20] actually the kernel does most of the work btw [14:28:45] for example when you fork() a process from another, the new process will not consume memory [14:29:00] it will display VIRT and RES but will not really be eating up any memory [14:29:05] until it starts using it [14:29:23] the kernel uses a COW (copy on write) approach [14:29:49] akosiaris, gotcha. [14:29:53] until a child process requests to write at its memory, it will just use the memory the parent has [14:30:03] That's what I figured [14:30:35] so, IIRC from celery forks model, it's all done by a central process [14:30:46] where I assume you do all the initialization, right ? [14:32:04] that's the Main PID that systemd display btw [14:33:42] so, that anon map I asked about earlier, is probably the sum of all the models [14:33:44] akosiaris, +1 [14:33:58] akosiaris, seems like it year [14:34:00] *yeah [14:35:01] so, it is shared among the processes [14:35:09] I seem the same memory address mapped [14:35:12] I see* [14:35:38] I wonder why it is not counted in shared though [14:36:04] ah, it's read/write [14:36:31] (that's btw irrelevant to the COW thing I talked about earlier) [14:36:47] We don't actually write to a very large chunk of memory. [14:36:55] It really just sits there for us to use in computations [14:37:05] Maybe cpython doesn't know that in the ways we'd like [14:37:59] ah, it doesn't matter for most stuff... just trying to get the exact memory usage of a worker [14:38:48] RES is fine most of the times, but with those huge models that statistic is more polluted than usually [14:38:52] hmmm [14:39:12] I suppose the model loading code is in the main ores repo, right ? not some submodule [14:39:54] akosiaris, it's in revscoring [14:39:56] * halfak gets [14:40:26] See https://github.com/wiki-ai/revscoring/blob/master/revscoring/scorer_models/scorer_model.py#L65 [14:40:29] load() and dump() [14:40:48] oh, it's just pickled data ? [14:40:52] They are pretty simple. They just use cpickle.load and cpickle.dump [14:40:54] Yup [15:02:12] halfak: when you have time: can you tell how many tagging is needed to be done in es.wiki? And how many are already done [15:03:01] Oscar_, here's a way you can check: http://labels.wmflabs.org/campaigns/eswiki/?campaigns=stats [15:03:15] It looks like 812/8434 have been completed. [15:03:25] 8434 seems like too many. [15:03:34] I want to review that today. I might be able to auto-label some of those. [15:04:42] Oscar_, have you noticed any trends in the non-damaging edits you think we could pick up on? [15:05:42] We have a few volunteers interested after a message that Johan left in our village pump, but yes, still a long way to go [15:06:44] Thanks for checking in on this and doing the work. I'm excited to be able to deploy the damaging/goodfaith models in eswiki. [15:06:48] 8434 are a lot :O [15:06:53] That will allow us to deploy the ORES review tool as well. [15:06:57] So it'll be worth it. [15:11:06] halfak: I don't know if this is a trend, but in many occasions we see editions changing date of births or things like that without references, how do you deal with this cases? non-damaging and good faith? or the other way around? [15:11:55] Oscar_, it would be helpful if you can make a judgement based on what you would do about it. E.g. some changes might be good and others bad. [15:11:58] Would you revert it? [15:12:28] Generally, "damaging" would be anything you'd revert or insist on fixing immediately. [15:12:30] halfak: btw. We are deploying ores review tool in plwiki atm [15:12:49] Oscar_: tools.wmflabs.org/dexbot/tools/wikilabels_stats.php I built this which gets updated daily [15:12:54] Thanks Amir1. Anything you need from me? Otherwise, I'll be standing by. [15:13:19] halfak: nah, Don't worry. We should have a sync meeting today [15:13:43] Can we? please! A lot in backlog of done column [15:13:59] +1 Amir1. How about right after the deploy? [15:14:19] +1 [15:14:22] halfak: if I don't know the actual date of birth in that case, I just let pass the edit, non-damaging and good faith [15:14:56] Oscar_, that' [15:15:00] s probably fine. [15:15:17] Mostly, I want the prediction models to replicate your judgement as closely as possible [15:19:42] Ok then :) [15:19:50] superb Amir1, I'm bookmarking that! [15:21:06] :) [15:29:50] halfak: It's live in plwiki now [15:30:01] \o/ [15:30:05] TarLocesilion: We deployed ores review tool in plwiki [15:30:17] * halfak got beat to the ping [15:30:29] We will send an announcement soon [15:31:10] yess, yess! splendid, wonderful! I was just about to ask :) [15:31:42] halfak: I'll be afk for a minute and then we can have the meeting [15:31:48] is it okay? [15:31:53] Amir1, sounds good [15:38:10] halfak: should I join revscoring? [15:38:19] Give me 1 min [15:38:26] kk [15:43:49] Amir1, OK ready. [15:43:58] which channel? [15:44:04] revscoring? [15:44:05] https://hangouts.google.com/hangouts/_/wikimedia.org/revscoring?authuser=0 [16:40:26] 10Revision-Scoring-As-A-Service-Backlog, 10Wikilabels, 15User-Ladsgroup: Wikilabels UI reports non-200 status errors badly - https://phabricator.wikimedia.org/T138255#2557601 (10Halfak) a:05schana>03Ladsgroup [16:41:08] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-editquality, 15User-Ladsgroup: Include specific user groups in the trwiki edit quality model - https://phabricator.wikimedia.org/T140474#2557610 (10Halfak) a:05schana>03Ladsgroup [16:42:22] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-articlequality , 15User-Ladsgroup: [Spike] NLP for article quality models. - https://phabricator.wikimedia.org/T132533#2557613 (10Halfak) a:05schana>03Ladsgroup [16:45:08] 10Revision-Scoring-As-A-Service-Backlog: Semi-supervised machine learning - https://phabricator.wikimedia.org/T143123#2557618 (10Halfak) [16:48:21] 10Revision-Scoring-As-A-Service-Backlog, 07Spike: [Spike] Semi-supervised machine learning - https://phabricator.wikimedia.org/T143123#2557636 (10Halfak) [16:51:25] akosiaris, could you say something positive in https://phabricator.wikimedia.org/T143105 ? [16:59:29] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Update editquality models with new version of revscoring - https://phabricator.wikimedia.org/T143125#2557664 (10Halfak) [17:02:19] * halfak --> lunch [17:04:08] ok guys, one simple question: a "damaging edit" means a "vandalism", or "any kind of edit that needs to be curated"? [17:06:33] can the letter "r" for "Single letter for tagging possibly damaging recent changes" be translated as "v" ("vandalism") or rather "p" ("problem")? "r" out of nowhere seems to mean nothing [17:09:43] TarLocesilion: All edits that need to be review [17:09:45] *revied [17:09:49] *reviewed [17:10:01] r stands for review [17:10:27] thx, okey, so we've got a broad definition. [18:06:32] halfak|Lunch: For when you're back: https://etherpad.wikimedia.org/p/ores_weekly_update [18:06:45] review it and then I will add links :) [18:19:10] Amir1, edits complete [18:19:12] Post when ready! [18:19:35] Awesome [18:31:54] halfak: sent [18:32:01] SPAM IS COMING [18:32:06] \o/ Thanks Amir1 [18:32:29] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review, 15User-Ladsgroup: Make scb1002.eqiad.wmnet the canary node for ORES - https://phabricator.wikimedia.org/T142630#2558115 (10Ladsgroup) 05Open>03Resolved [18:32:33] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 07Puppet, 15User-Ladsgroup: Move vagrant role to use ores in production - https://phabricator.wikimedia.org/T142618#2558116 (10Ladsgroup) 05Open>03Resolved [18:32:34] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review, 15User-Ladsgroup: Increase web and worker processes in production - https://phabricator.wikimedia.org/T142361#2558117 (10Ladsgroup) 05Open>03Resolved [18:32:37] 06Revision-Scoring-As-A-Service, 10ORES, 07Puppet, 15User-Ladsgroup: Puppet config changes for ORES refactor - https://phabricator.wikimedia.org/T141575#2558118 (10Ladsgroup) [18:32:40] 06Revision-Scoring-As-A-Service, 10ORES, 07Puppet, 15User-Ladsgroup: Puppet config changes for ORES refactor - https://phabricator.wikimedia.org/T141575#2503579 (10Ladsgroup) [18:32:43] 06Revision-Scoring-As-A-Service, 10ORES, 07Puppet, 15User-Ladsgroup: Change CP to do several models at once. - https://phabricator.wikimedia.org/T142360#2558119 (10Ladsgroup) 05Open>03Resolved [18:32:45] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 15User-Ladsgroup, 05WMF-deploy-2016-08-16_(1.28.0-wmf.15): ORES extension jobs should just fail when scoring is errored not to throw exception - https://phabricator.wikimedia.org/T141978#2558121 (10Ladsgroup) 05Open>03Resolved [18:32:48] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Migrate wp10 models to gradient boosting. - https://phabricator.wikimedia.org/T141603#2558122 (10Ladsgroup) 05Open>03Resolved [18:32:50] 06Revision-Scoring-As-A-Service, 10ORES, 13Patch-For-Review, 15User-Ladsgroup: Enable statsd uwsgi settings - https://phabricator.wikimedia.org/T141543#2558123 (10Ladsgroup) 05Open>03Resolved [18:32:52] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 10ORES, 07Epic, 03Research-and-Data-2017-Q1: [Epic] Deploy ORES review tool - https://phabricator.wikimedia.org/T140002#2558125 (10Ladsgroup) [18:32:55] 06Revision-Scoring-As-A-Service, 10revscoring: Tamil language utilities - https://phabricator.wikimedia.org/T134105#2558126 (10Ladsgroup) 05Open>03Resolved [18:32:57] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 10Wikimedia-Site-requests, 07Beta-Feature, and 2 others: Deploy ORES review tool in Polish Wikipedia - https://phabricator.wikimedia.org/T140005#2558124 (10Ladsgroup) 05Open>03Resolved [18:33:00] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-ORES, 15User-Ladsgroup, 07User-notice: Integrate ORES extension with Special:Contributions - https://phabricator.wikimedia.org/T132371#2558127 (10Ladsgroup) 05Open>03Resolved [18:33:04] 06Revision-Scoring-As-A-Service, 10rsaas-editquality, 07Epic: [Epic] Edit quality models (damaging/goodfaith) - https://phabricator.wikimedia.org/T130213#2558130 (10Ladsgroup) [18:33:06] 06Revision-Scoring-As-A-Service, 10ORES: Add graphite logging to precached - https://phabricator.wikimedia.org/T119341#2558128 (10Ladsgroup) 05Open>03Resolved [18:33:08] 06Revision-Scoring-As-A-Service, 10rsaas-editquality, 15User-Ladsgroup: Deploy edit quality models for plwiki - https://phabricator.wikimedia.org/T130292#2558129 (10Ladsgroup) 05Open>03Resolved [19:36:39] 06Revision-Scoring-As-A-Service, 10Bad-Words-Detection-System, 15User-Ladsgroup: Generate bad words for all languages more than 100K articles - https://phabricator.wikimedia.org/T134629#2558356 (10Quiddity)