[17:07:44] Amir1: you around? [17:07:56] yeah :) [17:08:07] but I got to go soon [17:08:12] ok [17:08:21] be online in half an hour [17:08:26] ok [17:08:42] I'm chasing dependencies for Kian [17:09:11] MySQLdb in particular [17:09:19] We can talk later [17:22:05] o/ [17:22:07] aetilley, thanks for the email update. [17:23:17] I've also been pushing on some modeling work. I'm working on an implementation of gridsearch for revscoring that will test out a set of model/parameter combinations and report those with the highest fitness. [17:23:43] What do you need in order to experiment with massively boosting N for the nlp stuff? [17:27:02] halfak: hello. That's a good question. I suppose we'd have to do feature extraction for all revisions. [17:27:12] Or at least lots of them. [17:27:35] But they would be new features. [17:27:44] Would you want to implement some nlp features first? [17:27:50] right [17:28:19] This is new ground for me but I'm up for a challenge. [17:28:37] +1 Sounds good. [17:28:51] I've been experimenting with removing user.age and user.is_anon from our feature set. [17:29:05] We lose a lot of fitness, but most models are still OK. [17:29:29] I think that anything we can do to boost fitness that isn't biased on the user, will be great. [17:29:39] NLP-based features seem like the right direction. [17:31:38] Yes, but we need to be careful, I think there's a potential to overfit. [17:32:27] I mean if we're really talking about a very high dimensional feature space, and we're trying to correlate word choice with, say, whether or not the editor is editing in good faith... [17:32:34] yo :) [17:35:57] aetilley: back [17:36:11] kian uses python 2 [17:36:52] but you can use the core w/o any dependencies [17:36:57] halfak: o/ [17:37:16] aetilley, yeah +1. [17:37:22] o/ Amir1 [17:37:51] halfak: do you get my email? [17:37:56] *did [17:38:31] Amir1, re new data for wikidata? [17:38:49] no, the one after that [17:39:23] and re. new data. I have something for you to feed the training system directy [17:39:36] do you have a liitle time to do it? [17:41:13] Amir1: I see. [17:43:12] I guess I try again on my non-Virtual box [17:43:23] (and python2) [17:43:47] aetilley: we can use dimen. reduction [17:44:11] there are tons and tons of methods for that [17:44:30] in fact scikit does have something ready [17:44:38] Amir1: sorry, which conversation is this pertaining to? [17:44:53] re. using NLP [17:44:58] ah [17:45:07] bag of words approach [17:45:32] halfak: Amir1 no, the one after that [17:45:33] Amir1 and re. new data. I have something for you to feed the training system directy [17:45:35] Amir1 do you have a liitle time to do it? [17:45:44] Amir1, might be able to make time. I'm away visiting my Wife's family, so I won't have as much time to hack this weekend. [17:46:12] great [17:46:26] can i do something to help out? [17:46:42] maybe I can extract features for you [17:47:07] Yes, I'm optimistic that it will give us some valuable information, but I'm just saying we should be careful. Yes, we can totally do a PCA and drop the lowest principal components. [17:47:12] halfak: ^ (just to be sure you get it :D) [17:47:38] Amir1, looking for the email now. [17:47:52] But we're still talking about detecting a fairly subtle label from word choice. [17:47:57] thanks [17:48:04] Amir1: ^ [17:48:32] I can see [17:48:50] I should spend sometime and think about it [17:49:31] we can have two group of word, words that are common in good edits but no in bad ones [17:49:38] and bad words [17:49:47] Amir1, good idea. [17:49:51] Positive features. [17:50:28] it's easy to extract for each language [17:50:44] It would be very interesting to have a "good word" list. [17:51:54] Ok. I suppose I assumed the idea of moving from the Badwords list to the general NLP technique.. [17:51:58] among other things... [17:52:13] was that you don't have to hard-code a list of badwords into your model. [17:53:01] aetilley, +1 but in the meantime, we should try things that ought to work and later we can make a comparison between the approaches. [17:53:15] ok [17:53:59] So, Amir, regarding the wikidata dataset, if you want to start extracting features, that'd be great. [17:54:07] I'd down-sample to 20k balanced first. [17:54:17] sure thing [17:54:32] no. I built a 24K edits sample [17:54:38] it's balanced [17:55:09] halfak: It'll be done in a few hours [17:55:40] 73K sample is good but unbalanced (16% reverted) [17:56:19] afk for ten minutes [18:00:05] Cool. Sounds good. [18:00:08] halfak: So would you recommend I try to use your features as templates? [18:00:30] Yes, I think so. [18:00:43] It depends on how you think we should incorporate the NLP stuff. [18:00:59] I imagine, we're going to do some modeling work before we can even extract any features. [18:01:11] Well I think Amir1 is right that we should start with a bag of words approach [18:01:36] just because it's simple. [18:01:44] Maybe one day we can move to n-grams [18:02:27] I think we have a neat opportunity here [18:02:56] To apply semi-supervised EM on the huge dataset of (labeled and unlabeled) revisions. [18:03:39] But I'll put that on the back-burner until B.O.W. is implimented. [18:04:14] So I'm guessing the way this is done typically is through some library like enchant? [18:05:55] (The BOW vocabulary) [18:08:22] aetilley, no idea unfortunately [18:08:37] Ok. I'll figure it out. [18:13:46] Amir1: Trying again on python2 [18:13:51] attempting to call [18:13:52] python scripts/initiate_model.py -n faHuman -w fawiki -p P31 -v Q5 [18:14:20] It's yelling at me about MySQLdb, [18:14:33] EnvironmentError: mysql_config not found [18:15:32] Sorry, that's when I try to install MySQL-python [18:15:52] I'm not sure if there's some package that takes care of all of this... [18:22:32] aetilley, you'll need to install the dev package for mysql [18:22:35] In your OS [18:22:59] http://stackoverflow.com/questions/7475223/mysql-config-not-found-when-installing-mysqldb-python-interface [18:23:29] back [18:23:34] reading [18:26:15] aetilley: there's no need to do this at all [18:26:27] ok [18:26:31] those parts are no needed for our work [18:26:37] *not [18:27:11] those are to parse and make a training set [18:27:20] (feature extraction) [18:27:52] Ok, well I'm just trying to understand how it works, so I thought I'd start with the example in the README [18:27:55] in case you want to use the ANN directly, build a training set. [18:28:22] (just to learn) [18:28:30] you can use revscoring data [18:29:01] I'm here to guide you through :) [18:38:58] halfak: Thanks for the link. I now have MySQL-python and that made the error go away. [18:39:07] (I had thought I had mysql itself but I was wrong) [18:42:43] Amir1: Ok, but is there any documentation on how to do this? Or should it be obvious if I look through the source code? [18:43:21] It's obvious if you read the file kian/core.py [18:43:25] ok [18:43:29] specially docstrings [18:43:32] I started on that, but didn't finish [18:47:52] And actually right now I'm not seeing any docstrings in kian/core.py [18:58:40] aetilley: I thought put some docstrings in it [18:59:16] aetilley: in readme file read the dependency injection part [19:11:15] ok [19:12:36] halfak: What would it take to get ORES/wikilabels listed in [19:12:38] https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures [19:12:46] (The Beta tab) [19:16:58] Amir1: So for instance, in the dependency injection part you have [19:17:06] bot = Kian(training_set=the_training_set) [19:17:19] Should I assume that this is a tsv? [19:17:22] aetilley, we can probably get ScoredRevisions (uses ORES) in the gadgets tab. [19:18:01] We have an old card for that. Helder didn't want to try to get it added as a gadget until there was substantial usage. [19:18:21] BUT the ORES extension that legoktm & awight have been working on will go in the Beta tab. [19:18:29] coolbeans [19:18:39] I think we might be able to get Wiki labels into the gadget tab now that we have several wikis with campaigns. [19:18:53] But it's a little weird. gadgets are usually more universal to the site. [19:26:10] Amir1: Ok, I see now that training_set is a list, but in what format? [19:26:51] Is each sample a tuple or a list or something else? [19:37:38] aetilley: [[feature1, f2, f3,..., y], ...] [19:38:16] ok