[08:31:39] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2622474 (10Ladsgroup) Okay, the generation is done, I'm looking for place to put the dump. [11:40:47] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2306938 (10Ladsgroup) https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/wp10-... [13:59:53] o/ [14:00:01] I'm got to run an errand right away this morning. [14:00:08] So I'm not here for long. [14:02:40] ok halfak [14:03:01] i think by tomorrow i can have the graph plotted [14:04:03] \o/ great news sabya_ [14:05:21] btw, what do you guys use for seeing the archived IRC logs? from browser, saved locally or something else? [14:07:46] sabya_, there's a link to the archive in the "/topic" [14:07:51] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-ai/ [14:08:13] sabya_, I use that if I need to see something that's too old for my IRC client's history [14:16:04] halfak: around? [14:19:49] Amir1, just about to run away [14:19:56] Gotta get the doggie to the vet [14:20:07] halfak: https://phabricator.wikimedia.org/T135684#2622793 [14:20:19] I just wanted to let you know about this [14:20:22] have fun [14:20:27] Saw it! :) [14:20:29] Great work! [14:20:39] I've already downloaded it and started reviewing it. [14:20:44] OK now I'm running away [14:20:45] o/ [14:20:51] (Back in ~ an hour) [14:21:35] o/ [15:30:39] back! [15:31:10] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2623217 (10Halfak) Looks great -- except that it still has the integer mapping of the predicted c... [15:38:15] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2623228 (10Ladsgroup) I explained in the PR that I did it because of storage and performance for... [15:43:28] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2623242 (10Halfak) It looks like there is a problem. There's only 664k lines in the file and the... [15:43:34] Amir1, https://phabricator.wikimedia.org/T135684#2623242 [15:44:11] let me check [15:44:26] I guess somewhat something made into the file [15:47:25] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2623257 (10Halfak) Re. the integer mapping, that's a fine point, but (1) that breaks the spec of... [16:03:05] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2623312 (10Ladsgroup) >>! In T135684#2623242, @Halfak wrote: > It looks like there is a problem.... [16:03:10] Amir1, did this run single-threaded? [16:03:18] the article quality scoring script [16:03:30] I don't think so [16:03:55] halfak: It's fixed now. I'm uploading it [16:03:57] How many dump files were you working from? [16:04:20] one dump file [16:04:36] Ahh yes. Must have been single-threaded then [16:04:48] mwxml.map only splits processing based on input paths. [16:05:03] Oh, okay. It's designed to work multi-threaded too [16:07:54] halfak: uploaded. It takes some time to get stat1001 updated [16:08:09] (I should make a access request to stat1001) [16:08:13] Indeed. Great. I'm having a look at the score extraction script now. [16:09:30] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2623331 (10Ladsgroup) >>! In T135684#2623257, @Halfak wrote: > Re. the integer mapping, that's a... [16:10:24] Amir1, I'll make it always run concurrently :) [16:10:43] much faster [16:10:58] If I knew that, I would do it [16:11:01] my bad [16:11:11] no worries :D [16:11:20] Still got done pretty damn fast for single thread [16:35:16] I'm excited to see what I can accomplish with a full 16 processes [16:35:32] This will likely come in handy for moving to v1 and v2 of this dataset [16:56:21] 06Revision-Scoring-As-A-Service, 10revscoring, 07Spike: [Spike] Investigate HashingVectorizer - https://phabricator.wikimedia.org/T128087#2623508 (10Sabya) @Halfak Here is the plot. {F4451218} [16:56:35] * halfak clicks [16:58:14] o/ sabya_ [16:58:19] Looking at your graph... [16:58:29] Why is it that 0 get_support shows the worst fitness? Didn't it show the best fitness in all of your past analyses? [16:58:30] o/ halfak [16:59:23] * sabya_ was thinking the same [16:59:48] In the plot, it looks like 262 is the clear winner. Am I reading the colors right? [17:00:09] yes [17:00:25] Cool. So this seems much more promising than I thought before :) [17:00:58] cool [17:01:23] Do you think that mistakes were made previously? [17:01:38] ...that associated get_support == 0 with the best fitness [17:02:22] not sure if this is correct or the previous ones.. i'll do some sanity health check of my code, clean it up, publish as a complete ipython noteboox [17:02:45] based on the sequence of our conversation in the Phab [17:03:06] Sounds great. Thanks sabya_. This is great work. :) [17:03:25] basically your instructions on the Phab would be the notes, the code will interleave. [17:04:11] i think I made some mistake somewhere. need to cleanup the code and see. [17:05:54] Amir1, looking at the article scoring system, it looks like there were multiple input XML files. [17:06:14] SO many dump files == multi-processing [17:06:36] It looks like there are 27 pages-articles XML dump files [17:07:20] Oh! You might have just processed the one! [17:07:53] There's one big one and many smaller ones that can be combined into one big one. Hmm... That means we could process the smaller ones in order to get our performance gain. [17:17:01] halfak:in the paws instances, can I get my data files uploaded? [17:17:19] There's a button "upload" [17:18:15] ok. using scp? i've my data on ores-staging. it would be much faster to transfer from there [17:21:49] This is a good question for yuvipanda [17:21:58] I'll ping him in -research [17:22:14] Bah! He's offline :( [17:22:21] OK. Best answer is "I don't know" [17:22:23] :( [17:22:33] sure, i'll check with him. [23:37:22] halfak: Do you know what step 2 of https://phabricator.wikimedia.org/T137966 entails? [23:38:26] I know there's an open question of how "good fath" should be represented in the UI, but we're interested in adding filtering good faith edits only in places like RC and Special:Contributions/newbies (because "good faith new users" is a thing), and for that we'd also need to have goodfaith scores in the DB [23:44:24] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality , 03Research-and-Data-2017-Q1, 15User-Ladsgroup: Generate recent article quality scores for English Wikipedia - https://phabricator.wikimedia.org/T135684#2306938 (10Catrope) >>! In T135684#2623331, @Ladsgroup wrote: >>>! In T135684#2623257, @Halfak wr...