[00:24:29] hey halfak [00:24:45] I've a vagrant wiki in my local host now [00:24:48] trying to install ORES [00:26:22] No luck? [00:28:28] Amir1, ^ [00:28:30] ? [00:28:43] Trying to [00:29:16] my biggest problem right now is that I need to build an ORES server [00:29:24] I don't want to do it my localhost [00:29:35] Wait... is ORES installed? [00:29:58] You shouldn't need it actually build anything other than the ores dependencies. [00:30:17] then run "ores dev_server" [00:30:29] let me try [00:30:30] Oh! You'll need a config file. [00:30:46] Copy this one: https://github.com/wiki-ai/ores/blob/master/config/ores-testwiki.yaml [00:31:09] I don't remember what the vagrant dbname is, but you make have to change "testwiki" to whatever it really is in the config file. [00:33:11] It was "wiki" [00:33:23] where should I put it? [00:34:45] Change these two lines: [00:35:08] https://github.com/wiki-ai/ores/blob/master/config/ores-testwiki.yaml#L47 [00:35:09] https://github.com/wiki-ai/ores/blob/master/config/ores-testwiki.yaml#L29 [00:35:16] Line 47 is the definition [00:35:22] Line 27 is the reference. [00:36:22] ok [00:51:27] Hi all. [00:51:34] halfak: OSError: [Errno 98] Address already in use [00:51:40] o/ aetilley [00:52:28] probably because vagrant is up [00:52:32] I'm chaning ports [00:54:16] It works but I get timeout errors, My internet is a mess now [00:54:34] I have to wait for three hours to get a something better [00:56:49] Amir1, use the --port param to choose a different port [00:57:03] I changed it exactly this way [00:57:07] Oh. sorry. Should have read the rest of the messages :/ [00:57:09] now I get time out erro [00:57:13] :) [00:57:17] It's ok [00:57:23] Oh! It's trying to contact a wiki! [00:57:27] I forgot about that. [00:57:29] Hmm... [00:57:32] Sec. [00:58:53] Hmm... that's a bummer. [00:59:26] 1. you need to change the URL testwiki to point back to your wiki's location [01:00:07] This line: https://github.com/wiki-ai/ores/blob/master/config/ores-testwiki.yaml#L58 [01:00:50] 2. There's no reason this ORES server needs to contact the API host. I need to set up a basic Extractor that doesn't require a connection to function. [01:05:56] I'm trying [01:06:09] but my connection got crazy [01:10:02] Amir1, you shouldn't need to have a connection at all. You should set the wiki host to "localhost", most-likely. [01:10:34] I'm trying to open the github page [01:10:38] it's not opening [01:10:45] :| [01:12:41] halfak: I'll fix this, I'm waiting to get a reasonable connection but when I'm done with this please let's work on using it in tools.wmflabs.org/ores/mediawiki [01:12:55] it would be amazing to show people how it will integrate with mediawiki [01:13:40] Amir1, +1. Next deploy is going to have a model for testwiki. [01:13:51] awesome [01:14:09] also do you consider using random forest for wikidata? [01:14:10] I got the YuviPanda stamp on this not being stupid. :) [01:14:39] Yes. I'm finishing up the reports for the rest of the editquality models and will do them all at once I write up a report. [01:14:55] Fun story, GradientBoost works better for all of the other wikis. [01:15:21] :D [01:15:42] we can see different approaches in choosing features [01:16:07] If I were writing features for wikipedia, random forest might get better results [01:16:09] :D [01:16:26] awesome [01:17:26] With this new tuning utility, we can check all of the models and parameter sets (90 total combinations) in a couple hours. [01:17:44] (Really, more like 45 minutes, but I have some intermittent slowness I'm working on) [01:17:59] \o/ [01:19:50] So yeah. Any time we substantially change the feature list, we can re-tune. [01:23:16] http://tech.swamps.io/training-a-simple-pcfg-parser-using-nltk/ [01:23:55] https://labels.wmflabs.org/campaigns/wikidatawiki/?campaigns=stats [01:24:06] I like where it's going [01:25:20] Amir1, we're going to have data in no time! [01:26:19] aetilley, cool. Extra nice that stanford dataset is there. [01:26:41] So, we train a parser on a language. That could work. [01:26:50] We can include those trained models with the revscoring package. [01:27:32] Then languages that have models trained will provide PCFG-based features. [01:27:34] btw Kian now works with python3 (and 2) + it supports balancing datasets [01:27:44] Amir1, awesome! [01:27:51] balancing? [01:29:25] let's say you have a set with 100 positives and 10 negatives. It supports sampling only 10 positives [01:29:28] like what you do [01:29:43] Gotcha. [01:29:51] I've been thinking about the strategy for that. [01:30:04] How do you work out the proportion for randomly sampling? [01:30:09] Using this I was able to add 10K more statments to wikidata [01:30:20] That's tremendous! [01:31:20] https://github.com/Ladsgroup/Kian [01:31:32] Can you check the last two commits? [01:33:31] Gotcha. So you balance as you are loading in. [01:33:42] Is this doing sampling with replacement? [01:35:34] It changes the training set before loading it and making as an atrribute [01:35:56] i.e. it rebuild the training set [01:36:01] *rebuilds [01:36:55] Gotcha. [01:37:25] https://github.com/wiki-ai/revscoring/blob/master/revscoring/scorer_models/svc.py#L88 [01:38:25] Great! [01:38:27] Sorry, was away [01:38:33] No worried. Good find. [01:38:39] I can't open github right now, it seems it got blocked again [01:38:53] My next question is going to be: How big is the datastructure when you pickle it? [01:39:10] If it's smaller than 100k, we can just include it with the package. [01:39:22] (that's why I asked you to find my last commits) [01:39:27] If it's bigger, we might need to provide it as a dat file. [01:39:46] halfak: I don't pickle it [01:39:47] Amir1, gotcha. That's a bummer. I can't imaging having resources pulled away like that. [01:40:03] I use .dat files and json [01:40:18] Amir1, I was talking about aetilley's pcfg models [01:40:22] I want to but never encountered any issues that makes me to use it [01:40:26] oh sorry [01:40:28] :D [01:40:33] :)))) [01:40:34] halfak: Idk but I can find out [01:40:35] Amir1, but still, that's pretty cool. [01:40:50] Amir1, how big is your model when saved to disk? [01:41:00] usually about 30K [01:41:07] training set [01:41:07] That's pretty good! [01:41:46] AUC is around 99.8% [01:42:01] wha [01:42:20] Amir1, maayyybe you should try that on our prediction problems too. :DDD [01:42:40] It's pretty resource consuming [01:42:52] We should do a comparison. [01:42:54] e.g. my training set is big but number of features are small [01:43:22] but send me a file and I'll check it out [01:43:27] a tsv file [01:45:09] Amir1, still got the last one from clustering? [01:46:59] Amir1, sent! [01:47:33] I could find it [01:47:41] but it would a while [01:47:44] thanks [01:47:52] Sure. Hopefully email works. [01:48:22] * aetilley clones Kian [01:51:14] * halfak runs away [01:51:24] have a good night/day/whatever see you tomorrow! [01:51:28] o/ [01:52:05] :) [01:52:08] o/ halfak [01:52:31] ttfn [02:48:56] Amir1: So in line 37 of core.py, right after you assign self.training_set, you have an if/else block that seems to assume that the last element of each case is either an int or a collection (or something with a len). [02:49:13] Is this right? [02:49:38] yeah [02:50:19] for multiple y for each member of training set [02:50:46] I forgot the word but I think it was called "multionmial classifier" [02:50:56] gotcha [02:51:13] and I had forgotten that in python we do have isinstance(, int) [02:51:28] So we can have booleans in the last place too [02:51:44] ("multivariate") [02:52:33] Ok, so my data can be any iterable collection of items that can themselves be cast to list [02:58:00] the last part can be a list [02:58:06] aka y [02:58:13] but x [02:58:19] is different [02:58:45] No I mean the data as a whole [02:58:57] input data object [02:59:11] uh [02:59:14] yeah [02:59:16] for training [02:59:17] yea [02:59:25] input can be the same too [02:59:40] thanksw [02:59:42] thanks [02:59:47] but discalimer, I've never tested it on multivariate classifier [02:59:51] it should work as fine [03:00:02] but I didn't check it very carefully [03:04:07] No problem. I'm about to check it on the enwiki_damaging_20k [03:04:10] dataset [03:56:34] Amir1: http://pastebin.com/N1EUyTSi [03:58:51] probably you should change "True" and "False" to 1 and 0 [03:59:04] I need to examine this in great detail [03:59:11] wait a sec [03:59:18] maybe I can do something [03:59:41] This might be on my end. Let me see what this "k" that get's passed is. [03:59:56] gets [04:00:53] I send you something that Kian accepted [04:03:36] aetilley: Kian accepts something like this set([(0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0),...]) [04:03:51] this is the first member of the training set [04:03:57] the last part (0) is y [04:04:01] others are x [04:07:58] ah so I have to pack the feature_values to a tuple [04:09:20] It does accept float feature values correct? [04:22:07] aetilley: yeah [04:43:22] So in line 44 of core.py you pas len(self.training_set) / 5 as teh second argument to random.sample, but that function requires an int [04:43:38] Ah! It's a python 2/3 change! [04:43:45] Amir1: [04:44:30] Presumably you want integer division, and python3 reads float/int as floating point division. [04:44:34] You can fix it easily :) [04:45:00] If you like. Should I send a pull request? [04:47:05] (sorry, I didn't mean float/int, I meant int1/int2 where this is not an even division.) [04:49:56] you should do int(len(self.training_set) / 5) [04:50:08] I can make it and commit it right away [04:50:16] ok [04:52:48] * aetilley stretches. [04:54:38] shit, github is blocked [04:54:46] please do it in your system [04:54:57] I'll go to university and make the patch [04:55:01] aetilley: ^ [04:55:08] *push the patch [05:40:46] aetilley: I pushed the commit [16:08:56] Hello [16:09:17] Asking around for Chineese [16:50:52] hello internet [16:50:55] press 1 to continue [17:14:44] yo halfak [17:15:00] o/ [17:15:10] In meeting. [17:15:13] k [17:15:15] Should be done soon. [17:15:20] Want to chat before we meet? [17:19:11] Just got out of meeting. [17:19:18] Grab food or chat? [17:20:42] o/ halfak [17:20:44] hey [17:20:55] I'll be at the meeting bet ten min. late [17:21:06] Oh! that works for me. [17:21:13] I might be late too. [17:21:17] * halfak considers getting lunch [17:21:19] Na. [17:21:29] I'll be on time, but will be prepared to have you show up late :) [17:21:49] ok [17:21:55] see you soon [17:25:11] sure [17:25:28] I am struggling with identifying what is used for chinese wikipedia but I am getting close to a resolution maybe [17:25:34] it is being frustrating though :/ [17:25:46] I am ready to meat [17:25:48] :p [17:29:55] * White_Cat throws meat at halfak [17:30:00] :D [17:30:46] * halfak consumes meet and joins meeting [20:53:11] halfak: hey, sorry I had to go. Is there anything I can do now? [20:53:21] o/ [20:53:31] * halfak thinks. [20:54:09] Any TFiDF runs necessary? Convert your balanced dataset work to a configurable script. Label edits in Wikidata! [20:54:27] there's no tfidf run for now [20:54:28] * halfak pulls things from the top of his head. [20:54:39] the blog post? [20:54:45] or is it too soon [20:54:47] Gotcha. Maybe we should also turn that TFiDF script into something that lives in 'editquality' too. [20:54:54] Oh yeah. No let's start drafting now. [20:55:35] Stay high level. Think about the stories you'd like to tell and how you'd like people to think about ORES + Wikidata. [20:55:45] * YuviPanda waves very vaguely [20:55:47] Then let's have a follow-up discussion soon. [20:55:50] o/ YuviPanda [20:55:56] hi [20:55:56] getting vaguer every time. [20:56:00] yeah [20:56:01] :P [20:56:05] someday I'll just disappear :) [20:56:08] :( [20:56:36] :( [20:56:50] okay halfak let me think and I'll let you know [20:57:00] OK sounds good. [20:57:04] See you tomorrow at hack time? [20:57:08] Amir1, ^ [20:57:19] sure :) [20:57:54] Cool :) [23:11:49] awight: hey, around? [23:13:31] I've installed ORES extension in my localhst and http://tools.wmflabs.org/ores/mediawiki/index.php?title=Main_Page [23:13:56] Amir1, great! [23:14:05] I changed the source code in order to get scores from ores [23:14:06] Does it seem to be working? [23:14:19] but it seems it's not working as I can see [23:14:25] see the recent changes [23:14:31] maybe I'm missing some steps [23:14:32] Looks for a little "r" in the recent changes. [23:14:56] In the legend on the upper right. [23:15:05] You should see a symbol for "needs review" [23:15:53] no [23:15:59] it's not showing anything [23:16:05] I guess something is wrong. [23:16:10] Maybe awight has some suggestions. [23:16:10] except " | Hide revert predictions" [23:16:24] I can add him to the service group [23:16:29] then he checks [23:16:44] but he's not around [23:17:24] :( [23:17:33] Oh! [23:17:35] Hey~! [23:17:40] That's evidence that it is working. [23:17:47] What does clicking that do? [23:17:52] nothing [23:17:57] there is no change [23:17:58] "Hide revert predictions" seems like the old wording too. [23:18:06] Are you sure you are on the most recent patch? [23:18:07] yeah [23:18:14] I'm pretty sure [23:18:24] I suggest responding on the phab card. [23:18:24] maybe they patches are not merged yet [23:18:50] Make sure to not that you do see some changes "hide revert predictions" but that you don't see a "r" in the legend. [23:19:01] *note [23:19:18] sure thing [23:19:23] maybe it's a bug [23:19:30] I'm trying to figure it out [23:19:42] What browser? [23:19:45] Could be related [23:20:02] firefox [23:20:28] but compatibility is important and it's not js or css related stuff [23:20:30] IMO [23:20:59] Totally. It's important to know if the issue is browser-related though. [23:22:15] Also I can add you and you make any changes you wan to test [23:22:17] and show people [23:22:22] halfak: ^ [23:25:59] * halfak suddenly realizes that he can look at the wiki too :D [23:27:02] Amir1, I can't get any edits to show up in recent changes. [23:27:24] I made some edits [23:27:32] and also maybe there is the issue [23:27:46] I see 'em now [23:27:59] it doesn't work because I'm getting results of from enwiki models [23:28:08] and these edits are really old [23:28:10] Amir1, did you run the maintenance script when setting this up? [23:28:17] (diff=9 e.g.) [23:29:04] yes [23:29:10] several times [23:29:35] I've got a call now. Will be back in 30 minutes [23:29:45] okay