[00:00:22] Thank you! [00:00:36] halfak: I have one open PR in editquality, please take a look [00:01:00] I see tw [00:01:01] o [00:01:26] Oh I see a question in one. [00:05:00] Added comments to both. [00:17:11] Amir1, I'm out of here for the day, but ping me if you want me to take another look in my AM :) [00:17:36] halfak: AM? [00:17:49] ante meridiem [00:17:53] aka morning :) [00:18:02] AM/PM [00:22:21] halfak: nah [00:22:33] halfak: enjoy your day, will talk more tomorrow [00:22:52] kk [00:22:54] o/ [10:31:16] o/o/ [11:03:49] 10Scoring-platform-team, 10ORES, 10editquality-modeling, 10User-Sebastian_Berlin-WMSE, and 2 others: Check ORES feedback for possible bugs - https://phabricator.wikimedia.org/T188896#4031513 (10Lokal_Profil) Many thanks! I've [[ https://sv.wikipedia.org/w/index.php?title=Wikipedia%3ABybrunnen&action=histor... [11:54:14] (03PS1) 10Ladsgroup: Cleanups and small fixes [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416926 [12:03:12] (03PS1) 10Ladsgroup: Change default config of ores models to use the new system [extensions/ORES] - 10https://gerrit.wikimedia.org/r/416928 [14:48:38] o/ [14:50:31] o/ [14:52:05] halfak: you should take a look at this: https://gerrit.wikimedia.org/r/#/c/416004/ [14:53:26] So, it's structured data about the package itself? [14:54:55] Amir1, ^ [14:55:02] I'm not sure what I'm reviewing really. [14:56:05] Is there an SPDX schema that we should be referencing in these files? [15:01:58] halfak: the problem is that there are several licenses have been mentioned as license of the package [15:02:09] the SPX thing just unearthed it [15:02:17] Staff meeting [15:02:25] on it [16:08:28] halfak: i take back my words, with no computation and only vectors loaded, this is the memory usage - https://gist.github.com/codez266/0ba6212378477619582daaf5622a21fb [16:08:44] buggy.py is the script thats doing multiprocessing [16:08:56] and ram usage reaches 8GB which is outrageous [16:19:19] Looks like the vectors are not loaded into shared memory somehow. [16:19:35] It's really important that the vectors are fully-loaded before we start forking the process. [16:19:49] Can you try using the vector to match a work just before starting the imap? [16:41:16] codezee, ^ [16:41:22] halfak: forking happens only after loading the vectors, its a sequential operation and logs show that [16:41:47] I understand that the vectors are *supposed* to be loaded, but it might be a lazy operation. [16:41:57] Can you confirm that they are loaded by using them before forking? [16:42:38] but i'm assuming gensim's logs of loading and loaded should be accurate? [16:42:45] ok i'll use them [16:44:38] hmm. the logs should be believable, but it is certainly worth a shot. [16:45:30] yeah i can do that [16:45:39] clearly it seems that its replicating [16:45:53] i printed the length of the vector for 'help' [16:47:33] Gross. My IP is blackholed for Freenode [16:49:37] codezee, if it is replicating for imap, it's gonna replicate in celery. [16:49:44] If it does that, we're mega-hosed. [16:49:51] This is a show-stopper. [16:49:54] halfak: i think i have a solution although ugly but it works [16:50:08] let me check it one or two times [16:51:51] halfak: yeah it does work!!! [16:53:10] 150k vectors across 8 threads working at the original pace... [16:53:19] its extracting very fast now [16:54:23] i'll have my features now in 3 hrs or so... :D [16:55:20] the solution is simple - make the vectors global, i have a feeling that gloabal data persists across threads too [16:55:33] remember my makeshift script for word vectors in drafttopic? - https://github.com/wiki-ai/drafttopic/blob/extract-from-text/drafttopic/feature_lists/w2v.py [16:55:49] it specifies word2vec as global and assigns only once. [16:56:07] a crude optimization i had made in flow of the moment not realizing its potential up until now [16:56:44] i did the same thing in vectorizers.py and there it goes... we can even load more vectors now. [16:56:52] halfak: Do you know if anyone has estimated the amount of paid editing happening? If not, that's an exciting result we'll get from our future model. [16:57:00] lots of memory to spare [16:57:13] awight: which future model? [16:57:32] The paid promotional editing estimator [16:57:42] *classifier [16:57:47] :o :o [16:57:47] codezee, can you share the code for that? [16:58:07] awight, I've been working with doc james et al. on a paid editor dataset. [16:58:13] Yassss [17:00:07] halfak: https://gist.github.com/codez266/bde0d2384ef1cda0e105b8f59d25524a [17:45:02] codezee, I wonder if the w2v is being serialized every time a process is spawned. [17:45:14] Maybe that's the problem and the global issue isn't a big deal. [17:45:23] * halfak works on the original script. [17:45:42] Actually, could you share your test script with me? [17:49:58] OK I take it back. I think the global is important because of the way that serialization works. [17:51:14] We use pickle to pass features around (including in multiprocessing/celery) and pickle will serialize anything that isn't available via a module (e.g. importable) [17:51:33] Any variable that isn't in global scope (called global or not) is not importable. [17:52:40] So I think that we should be able to train and extract features efficiently so long as we load the keyed_vectors in our feature_list/enwiki.py file and assign it to a variable before passing it to word2vec() [17:55:16] codezee, looking at https://github.com/wiki-ai/drafttopic/pull/15/files#diff-67ff4efd33a9e593d678e0795913200bR24 it looks like you're loading the vectors every time the process() method runs. [17:55:21] O_O [17:55:38] That seems slow and extremely memory intensive. [17:58:28] halfak: i'm not loading word2vec everytime process runs [17:58:44] i'm caching it , so once its loaded, i'm returning the cached version [18:00:11] Oh I see the global. I'm confused why you'd do that at all. I'm working on a suggested new feature_lists file. [18:02:37] Amir1: you want to do SoS today? [18:03:13] yeah, that was sth very improvised at the time, for doing experiments, hadn't given much thought to global then ,just that thought of not loading it again and again :D [18:03:33] awight: yeah, what should I say? [18:03:53] We have 3 points in the staff doc [18:05:47] back soon [18:06:35] halfak: do you mean assigning the vectors to a variable in the script allow it to not serialize? [18:08:05] https://gist.github.com/halfak/ce0a19c7f81fcaff5e8c9f75a5df5dcb [18:08:08] codezee, ^ [18:08:50] E.g. you could import the kvs by doing something like "from feature_lists.enwiki import google_news_kv" [18:10:59] halfak: my featurelist file contained exactly this [18:11:10] i don't think this works [18:12:15] why would we import google_news_kv if we its being used in the same script? [18:12:29] We wouldn't but we could [18:12:47] Option 2 is to make a file that just imports the kv and makes it available so that we explicitly import it. [18:13:06] yeah i was going to say that, split vectors and features [18:13:08] * halfak edits the gist [18:14:01] but isn't it so much motivating when we know there's *some* solution atleast :D compared to sometime back when nothing seemed in sight [18:14:02] https://gist.github.com/halfak/ce0a19c7f81fcaff5e8c9f75a5df5dcb [18:14:05] ^ updated [18:14:15] +1 :))) [18:15:39] i'll try experimenting with this one, if it works its better than global atelast [18:15:44] *atleast [18:15:59] +1 [18:16:37] This is great because it will help me work out some of the issues with using grammars too. [18:17:20] data extraction with the fast script is about to complete, so we're back at the same pace ;) [18:25:45] Cool. Can you do a quick test with the multi-file strategy then too? [18:25:55] * halfak looks at wikigrammar stuff again [18:28:40] yeah, i'll look but need to go now [18:29:01] OK sounds good. Thanks for working through this codezee :) [18:29:02] o/ [18:29:13] np :) [18:29:18] o/ [18:34:15] awight: let me know when you're back [18:34:23] o/ [18:34:58] Amir1: Anything I can do? [18:35:11] yeah [18:35:17] let's start with this: https://gerrit.wikimedia.org/r/#/c/416004/ [18:35:21] :) [18:36:15] (03CR) 10Awight: Use SPDX 3.0 license identifier (031 comment) [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416004 (https://phabricator.wikimedia.org/T183858) (owner: 10Legoktm) [18:36:17] awight: you're the main author which license you prefer? [18:36:29] cool [18:36:38] I'm a bit ignorant about it, but yeah see my comment [18:38:04] (03PS2) 10Legoktm: Use SPDX 3.0 license identifier [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416004 (https://phabricator.wikimedia.org/T183858) [18:38:04] (03CR) 10Ladsgroup: [C: 04-2] "Until Sam Reed approves, he has a patch in this extension." [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416004 (https://phabricator.wikimedia.org/T183858) (owner: 10Legoktm) [18:39:11] (03CR) 10Awight: [C: 031] "Approved, waiting on @Reedy" [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416004 (https://phabricator.wikimedia.org/T183858) (owner: 10Legoktm) [18:40:16] well [18:40:37] you don't really need his approval since GPL 3 or later is compatible with GPL 2 or later [18:42:04] very valid point, ignore my comment [18:46:12] Yeah, I'm seeing that I made the two mistakes in the license tags, so feel ok about CR+2 [18:46:52] (03CR) 10Awight: [C: 032] "On a second review, we're fine. I'm the author of both mistakes in the license tags." [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416004 (https://phabricator.wikimedia.org/T183858) (owner: 10Legoktm) [18:49:59] shoot, I missed SoS [18:50:12] I thought it starts at 8 but it started at 7:30 [18:50:22] and it's 7:50 now [18:58:14] (03Merged) 10jenkins-bot: Use SPDX 3.0 license identifier [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416004 (https://phabricator.wikimedia.org/T183858) (owner: 10Legoktm) [18:58:18] :) [19:00:30] (03CR) 10jenkins-bot: Use SPDX 3.0 license identifier [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416004 (https://phabricator.wikimedia.org/T183858) (owner: 10Legoktm) [19:48:20] awight: hey, please take a look at https://gerrit.wikimedia.org/r/#/c/416928/ and https://gerrit.wikimedia.org/r/#/c/416926/ [20:01:57] (03CR) 10Awight: [C: 032] Change default config of ores models to use the new system [extensions/ORES] - 10https://gerrit.wikimedia.org/r/416928 (owner: 10Ladsgroup) [20:07:56] Amir1: Are you sure we can drop the namespace prefix in docstrings? [20:08:19] awight: Phpstorm understands it [20:08:49] awight: you're patch should be live in beta really soon [20:08:55] please take a look [20:09:00] your [20:09:08] I just noticed, thanks! [20:10:54] (03Merged) 10jenkins-bot: Change default config of ores models to use the new system [extensions/ORES] - 10https://gerrit.wikimedia.org/r/416928 (owner: 10Ladsgroup) [20:11:00] I go lie get some rest for now [20:11:08] *lie down and [20:11:56] (03CR) 10jenkins-bot: Change default config of ores models to use the new system [extensions/ORES] - 10https://gerrit.wikimedia.org/r/416928 (owner: 10Ladsgroup) [20:13:54] Cool. I found some other examples of Doxygen correctly interpreting imported classes without the namespace. [20:14:59] (03CR) 10Awight: [C: 04-1] "Very nice! Waiting to hear your opinion about the NS_JADE constant." (031 comment) [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416926 (owner: 10Ladsgroup) [20:16:01] Amir1: fyi, the config change isn't looking good. [20:17:35] (03PS2) 10Ladsgroup: Cleanups and small fixes [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416926 [20:17:38] (03CR) 10Ladsgroup: Cleanups and small fixes (031 comment) [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416926 (owner: 10Ladsgroup) [20:18:26] awight: oh, do you want to make a follow up? [20:18:47] All I know is that it's broken. Wasn't planning to debug right now. [20:19:14] (03CR) 10Awight: [C: 032] Cleanups and small fixes [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416926 (owner: 10Ladsgroup) [20:21:55] (03Merged) 10jenkins-bot: Cleanups and small fixes [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416926 (owner: 10Ladsgroup) [20:22:26] (03CR) 10jenkins-bot: Cleanups and small fixes [extensions/JADE] - 10https://gerrit.wikimedia.org/r/416926 (owner: 10Ladsgroup) [20:53:32] o/ [20:53:39] Was in meetings all day. [20:53:57] Taking my lunch break now. [20:54:35] I think I'm going to do a wikilabels deploy when I'm back [20:54:44] Then I'll kick off the pilot for fawiki article quality [20:59:56] 10Scoring-platform-team (Current), 10Packaging, 10Patch-For-Review: Package word2vec binaries - https://phabricator.wikimedia.org/T188446#4033364 (10mmodell) D1000 bumps scap version to 3.7.7 and adds git-lfs support ...we still need git-lfs packages on all relevant servers [22:31:00] I think I'm on target for panicking about this paper. [22:50:57] In meetings that I forgot about :| [23:01:22] awight, panic? [23:01:29] Anything you want from me this evening? [23:29:15] 10Scoring-platform-team (Current), 10articlequality-modeling, 10User-Ladsgroup, 10artificial-intelligence: Article quality campaign for Persian Wikipedia - https://phabricator.wikimedia.org/T174684#4033850 (10Halfak) The pilot is alive! http://labels.wmflabs.org/ui/fawiki/ @Ladsgroup, any issues you can... [23:29:51] 10Scoring-platform-team (Current), 10MediaWiki-extensions-ORES, 10MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)), 10Patch-For-Review, 10User-Ladsgroup: Clean up ScoreLookup implementations - https://phabricator.wikimedia.org/T185534#3918856 (10Halfak) Did this get deployed? [23:33:04] fawiki article quality stuff is live! :)