[14:06:21] Hey folks. [14:06:42] I've got to go back some coffee. I'll be back soon. :) [14:10:17] *make [14:19:30] OK! [15:20:35] wiki-ai/revscoring#291 (more_languages - 9e87034 : halfak): The build was fixed. https://travis-ci.org/wiki-ai/revscoring/builds/88526493 [15:21:00] Had. Take that travis [15:21:07] *Ha! [15:50:50] o/ guillom [15:50:59] thanks for the edits on the frwiki stuff [15:51:10] I hope to try extracting features again today. :) [15:51:40] hey halfak [15:51:42] you're welcome :) [15:53:04] Would you be interested in taking a pass on our bad/informal work lists too? [15:53:26] See https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/french.py [15:53:41] And https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/fr [15:54:04] We have been working with a minimal set that a volunteer put together for us in Lyon. [15:54:17] halfak: sure; I'll take a look later. I'm still in the waking-up phase :D [15:54:27] Gotcha. Thanks :) [16:08:05] o/ aetilley [16:10:57] hello halfak [16:11:28] how are you? [16:11:58] Not bad. :) [16:12:12] I just got three new languages added to revscoring: Dutch, German and Italian. [16:12:25] Now I'm hacking on the article quality predictor for French Wikipedia. [16:12:42] I think we should be able to substantially boost fitness thanks to some help from guillom :) [16:12:53] And now I know who to bother about French stuff :D [16:13:28] is the recommendation API working? [16:13:46] harej, which one? [16:14:04] recommend.wmflabs [16:14:28] Oh! Yes. [16:14:48] e.g. http://recommend.wmflabs.org/api?s=en&t=ca [16:15:03] Even better: http://recommend.wmflabs.org/api?s=en&t=ca&article=Orange [16:15:13] That way you can give it a seed article. [16:15:37] On a side note, I think that taking the "recommend" subdomain is lame. [16:15:38] I like seed articles. [16:15:42] They should name the project. [16:15:49] Generic brands are the best. [16:17:56] > This webpage is not available [16:17:56] > ERR_INCOMPLETE_CHUNKED_ENCODING [16:18:06] ha [16:18:24] lzia & ewulczyn are the right people to page [16:18:40] They usually don't hang out here -- to massive regret. [16:18:52] actually, I think I've never got them to join :( [16:27:00] OMG. I just realized we can have tests for our feature lists. [16:27:04] TEST EVERYTHING [16:27:11] Also, this thought is exhausting [16:27:22] Tests are great. I like things working. [16:46:08] o/ Krinkle [17:00:48] Krinkle, would you be interested in helping me cleanup our regex detector for Dutch badwords? See https://github.com/wiki-ai/revscoring/blob/more_languages/revscoring/languages/dutch.py#L27 [17:12:22] * halfak completes test for enwiki and frwiki features. [17:20:06] OK. Now to train models. [17:38:41] * halfak waits for data to move around. [18:02:33] Just finished a primitive script to recursively remove small clusters. [18:02:45] Look for an update soon if you're interested [18:02:51] halfak: ^ [18:02:56] Amir1: ^ [18:03:00] Cool :) [18:04:00] it appears min of the pair of cluster sizes goes 2, 11, 30, 40, 796. [18:04:13] (as we remove small clusters and recluster) [18:04:17] More to come though. [18:05:34] aetilley, still thinking it is weird that you're getting substantially different cluster results from Amir [18:05:51] Have you implemented scaling? [18:05:57] halfak: are you sure it's the same dataset? [18:06:14] aetilley, it's a different dataset, but they shouldn't be *that* different [18:06:23] not yet [18:06:24] Have you tried scaling? [18:06:27] k [18:06:47] Seems like that should be easy and more likely to address the problem to me. [18:06:48] Well actually I think I tried it in R and it gave similar results. [18:08:50] hmm [18:21:52] halfak: oh look [18:21:55] http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html [18:33:36] :) [18:33:38] Perfect [19:02:55] Aha, after rescaling, first pair of cluster sizes are (14291, 5572). And the p value went from 1 to zero. [19:03:08] yay! [19:03:12] WOooT! [19:03:16] I guess? [19:03:25] Well yeah [19:04:09] I'm happier with a p value of 0 than 1. [19:17:09] * halfak implements a parallelized feature extractor. [19:17:14] Hello 8x speedup :) [19:18:29] win [19:18:51] Now if only I could make model-building a bit faster [19:30:18] Just sent you two files of rev_ids, one for each of the clusters. [19:30:41] neat [19:30:56] I intend to complete my pull request sometime tiday or tomorrow [19:31:00] halfak: ^ also Is there a way to get np.savetxt to NOT automatically convert my input to floats? [19:31:01] I do not like it sitting there [19:31:13] assuming someone will merge it once I am done :3 [19:33:01] aetilley, could you include the last column (damaging/not-damaging) in that? [19:33:14] White_Cat, the one on revscoring? [19:33:36] If so, I just closed it in favor of one that I have that completed the work. [19:34:51] oh okay [19:34:53] thats fine too [19:35:00] sorry I got distracted from it [19:38:27] No worries. You could still pick up Turkish [19:38:37] It needs some work on the regular expressions. [19:38:51] A native speaker can do better than I can. [19:46:38] halfak: yes. just one sec [19:51:03] yes [20:11:45] there is a pattern I noticed as well [20:12:21] you know how people unnecesarily use ph instead of f? [20:17:02] * aetilley goes for a walk [20:20:59] White_Cat, indeed. That is a thing. [20:32:01] FYI guillom: Substantially pushed the accuracy of the frwiki models from .54 to .57. [20:32:16] It looks like 'a' class is really hard to predict. [20:36:20] and for my last act, I [20:36:35] 'll be implementing parallel processing in revscoring's feature extractor too [20:59:07] wiki-ai/revscoring#295 (parallel_extraction - 1d5e701 : halfak): The build passed. https://travis-ci.org/wiki-ai/revscoring/builds/88566768 [20:59:18] shuddup travis [21:08:21] And... time to go [21:08:27] have a good one, folks! [21:08:28] o/ [23:44:14] halfak: Hi [23:45:14] halfak: OK. I'll check the list out [23:45:19] halfak: I'd like to know how this process works [23:45:29] I assume it's of course a lot more than detecting bad regexes precoded. [23:45:45] From what I remember from the intro session at the office I'm surprised such list exists at all. [23:46:06] I assume there's some sample revisions as well which it trains based on [23:46:21] " OK. Now to train models. " – I'm curious what that entails. to "train models"