[00:52:58] o/ Helder [01:01:32] Looks like we can get back up to 0.84 AUC for enwiki "damaging" with hyperparameter optimization. :) [01:03:57] o/ halfak [01:04:04] o/ Amir1 [01:04:07] 90%! great [01:04:11] saw you merging stuff. [01:04:15] Right!? [01:04:16] :D [01:04:26] the good thing is we didn't use ser.age at all [01:04:36] yeah :D [01:05:08] if we use user.age we can AUC around 95% [01:07:33] Really! Honestly, I might rather use user.age than is_anon. Hmm. [01:09:08] depends on you :) [01:09:38] I'm not sure if it is better to be biased against anons or to take advantage of the full range of age. [01:10:24] So, also a fun story. I figured out how to slice an order of magnitude off of our SVC train times. [01:10:42] It turns out that there is a param for setting memory usage that is *super* low. [01:10:47] halfak: do we need to luanch edit quality campaign for wd? [01:10:48] If you turn that up, BAM. [01:10:54] Yes. Good Q. [01:11:10] amazing :) [01:11:11] So, one problem I haven't worked out yet. How do we sample. [01:11:30] we can use this one [01:12:04] the file I gave you [01:12:08] My thoughts are that (1) that we run with the balanced set you made or (2) that we make a balanced set of edits that need review based on the 'prelabel' script. [01:12:21] Yeah. I don't see this being necessarily a bad thing. [01:12:35] But it makes me think about thinks that purely random sampling allows me to ignore. [01:12:41] *things [01:20:25] halfak: can we let wikidatan(?) use the new model? [01:22:32] Yeah. I'm working on setting up a staging deploy now. [01:23:41] awesome [01:23:49] tell me when I can help [01:24:38] Once I get it up, I'd like to write up a little report of the problems we solved with example scores. [01:25:09] Amir1, I'm not sure I want you to type a bunch, but if you could copy-paste a bunch of rev_ids for revisions we should score, I can work from that. [01:25:27] E.g. recent revisions to Water and some merge edits. [01:26:58] I can use google voice to text tool in google docs if you want [01:33:29] Amir1, that would be great. So long as you're healing and comfortable :) [01:33:42] I just got the staging server up. [01:33:44] http://ores-staging.wmflabs.org/scores/wikidatawiki/reverted/4567894/ [01:33:55] So you can run tests. [01:34:01] I'll do the water edits real quick. [01:36:18] hi all [01:36:48] ohh, i expected a reply from an AI :( [01:37:48] awesome [01:46:18] o/ bmansurov_away [01:46:26] Welcome! [02:03:09] thanks! looking forward to learning more about this awesome project [02:04:33] bmansurov: I'd be happy to help [02:05:13] Amir1: thanks, I know how to bother now ;) [02:05:16] who [02:07:49] :D [02:08:18] If I'm not around ask halfak, he is not around now :) [02:12:57] ok [03:01:03] * yuvipanda waves at bmansurov [03:01:22] * bmansurov waves back at yuvipanda [03:50:16] halfak: you're on wired! [03:50:16] http://www.wired.com/2015/12/wikipedia-is-using-ai-to-expand-the-ranks-of-human-editors/ [07:14:02] Amir1: sorry I just deleted your paws instance... [07:14:05] just testing some more things [07:14:28] yuvipanda: It's okay [07:14:40] tell me when I can test it :) [07:15:13] Amir1: :D probably in a few minutes [07:17:55] thanks :) [07:34:59] Amir1: you can play with it now :) [07:35:07] hi everyone [07:35:21] Amir1: you can file bugs against the PAWS project on phabricator [07:35:23] hi Vinh_ [07:35:27] awesome thanks yuvipanda [07:35:32] hi Yuvi [07:35:35] I have a question [07:35:40] https://github.com/wiki-ai/ores [07:35:41] here [07:36:00] or more precise [07:36:01] https://github.com/wiki-ai/ores/blob/master/R/loader/enwiki_feature_reverts.R [07:36:21] the author used the dataset of reverted revisions of Wikipedia [07:36:34] named "enwiki.v2.features_reverted.5k.tsv" [07:36:41] is there a way to download this file? [07:37:06] Vinh_: ah, I'm sure there is. [07:37:27] great [07:37:34] could you tell me where can I get the file [07:37:36] Vinh_: maybe? http://datasets.wikimedia.org/public-datasets/enwiki/reverts/ [07:37:57] great [07:38:05] but why the size is 13GB [07:38:06] :-S [07:38:31] which file I should download to rerun the classification at the Github repo? [07:38:56] I've no idea :D [07:38:58] halfak would know [07:39:03] but he's asleep probably :( [07:39:15] Vinh_: can you file an issue on github? [07:40:37] yeah [07:40:40] here it is [07:40:41] https://github.com/wiki-ai/ores/issues/106 [07:51:43] Vinh_: cool :) [07:52:01] :) [07:52:16] Amir1: how do you like it? :) [07:52:34] (we are playing with https://tools.wmflabs.org/paws/hub/oauth_login, for others who are curious) [07:53:44] It's really good [07:53:56] I wish pwb was there by default [07:54:13] Amir1: it is [07:54:17] Amir1: in /srv/pwb [07:54:30] Amir1: also it's installed for the default python interpreter [07:54:40] is there any publications related to reverted prediction in wikipedia? [07:55:58] hmm [07:56:12] I'm trying to figure out how it works [07:56:41] Can you put scripts folder of pwb in the home directory? [07:56:49] (by default) [07:56:59] yuvipanda: ^ [08:01:41] Amir1: hmm, so that's a bit tricky [08:01:49] Amir1: since the home directory is persisted [08:01:56] Amir1: and the /srv/scripts is not [08:02:00] Amir1: but I can probably put a symlink there [08:06:49] I don't want to push. Do whatever is more convinient for you [08:16:47] does the reverted prediction base on any publication? [08:16:51] :-) [09:05:17] how can I download from http://datasets.wikimedia.org/public-datasets/enwiki/reverts/ [09:05:22] it always report an error [09:28:38] Amir1: no no, this is useful feedback :) [09:28:43] Amir1: do you think you can file a bug? [09:48:43] Amir1: https://phabricator.wikimedia.org/T120072 [09:54:43] I was afk [09:54:52] thank you for doing it :) [14:12:09] :-( [14:13:46] server problem still while downloading from datasets.wikimedia.org [14:38:26] Vinh_, what is the error? [14:40:18] Nevermind, I've got it. [14:42:58] Vinh_, https://phabricator.wikimedia.org/T120091 [14:45:40] thanks a lot, halfak [14:45:46] btw halfak [14:45:52] for ORES [14:45:57] for wp10 service [14:46:03] you tested with English and French dataset [14:46:23] where can I got this dataset (if possible, with feature scores :-) ), so it will be great [14:47:06] Oh! happy to share. [14:47:17] In fact, I might upload them to github. Let's see how big they are. [14:47:43] Yeah... a little bit too big. [14:48:17] could you upload to another host [14:48:30] I think I have an idea to improve the accuracy [14:48:33] Was going to use datasets.wikimedia.org :S [14:48:34] :) [14:48:53] so please [14:48:57] :-) [14:49:03] So, I have a dataset that contains the text for each observation. [14:49:14] I have another that just has rev_id/label pairs. [14:49:50] do you have a dataset with just feature scores [14:50:06] if I am correct, you're using the 11 features, right? [14:50:17] Oh yeah. I can do that. [14:50:31] Each row will be the list a feature values + the label [14:50:36] yeah [14:50:39] it will be great [14:50:55] from the authors' name, I guess the model is more or less similar with the model presented in the paper [14:50:56] Warncke-Wang, Morten, Vladislav R. Ayukaev, Brent Hecht, and Loren G. Terveen. "The Success and Failure of Quality Improvement Projects in Peer Production Communities." In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 743-756. ACM, 2015. [14:51:02] correct me if I am wrong [14:51:05] Yup. That's is right. [14:51:34] great [14:51:35] We've made some improvements since then, but it hasn't affected our accuracy that much. [14:51:40] I see [14:51:42] Email OK for now? [14:51:46] yeah [14:51:49] could I email to you [14:51:52] Want both frwiki and enwiki? [14:51:54] or we can talk via Skype [14:52:08] yeah [14:52:12] please [14:52:14] Gotta multitask :) Email is preferred. [14:52:20] sure [14:52:27] I will email you first [14:52:31] then please reply me [14:52:33] is it okay [14:52:48] ? [14:52:49] ahalfaker-at-wikimedia.org [14:52:55] your email, right? [14:53:30] Yup [14:54:10] email sent [14:54:16] please check [14:55:12] Sending... [14:55:27] So, just to be clear, those datasets are balanced sets of observations. [14:55:40] sorry, what does it mean? [14:55:47] sorry if it is a noob question :-( [14:55:53] Not at all :) [14:56:01] Just woke up. Words are hard :\ [14:56:03] So. [14:56:15] We have roughly the same number of obs. for each label [14:56:20] ah I see [14:56:22] no problem [14:56:27] So in the enwiki dataset, we have 30k obs. [14:56:29] 5k per label [14:56:29] I got it from the code in github [14:56:33] COol. :) [14:56:34] yeah [14:57:28] but how did you separate train and test set? [14:57:34] 80/20 [14:57:38] or 50/50 [14:57:40] or ... [14:58:30] 80/20 [14:59:43] I got the file [14:59:54] but could you please give me the col names :( [14:59:59] sorry if I asked too much [15:00:04] Not at all. :) [15:00:07] but now I cannot understand what is what [15:00:11] the last column is class [15:00:13] :) [15:00:16] but that's it [15:00:22] See this dir: https://github.com/wiki-ai/wikiclass/tree/master/wikiclass/feature_lists [15:00:35] Those two files contain the description of columns. [15:00:47] See wp10 lists. [15:00:48] yeah [15:00:49] I got it [15:01:04] The order in that list matches the column order in the file. [15:06:36] yup [15:29:33] hello halfak [15:29:38] are you still there :-) [15:29:50] I am :) [15:30:06] could you please provide me the dataset with content also? [15:30:25] if you don't mind :) [15:30:35] and I think I have something to discuss about the prediction [15:30:40] so again, if you don't mind :) [15:31:14] Yeah. So this one is bigger. I'll try compressing to see if we can get it down to email-size. [15:31:39] yeah, otherwise please find other host :) [21:06:02] hello everyone! [21:06:18] * yuvipanda waves too [21:07:42] I'd like to contribute to the development of used AI-tools in some way [21:07:58] blog tells to ask more info here [21:08:55] \o/ awesome [21:09:11] what kind of work/programming would you like to do? [21:10:12] I have >5yrs experience in Python and last 4 yrs actively following researches in field of ML/DL [21:11:09] Also, I participate in russian closed 'opendatascience' community, represented by data analytics from major players in the field of data analysis and ML [21:12:11] Actually, I'm in doubt if I can immediately starts with "Wiki tool developers (use ORES to build what you think Wikipedians need)", but "Modelers (Computer science, stats or math)" <- this position seems pretty suitable [21:12:59] nice! [21:13:14] I have MSc in information technology, strognly fond of OSS, control theory, DSP, etc [21:13:16] soupault: so halfak is probably the person who can help tell you what are all the things you can help with in that aspect [21:13:28] since he's the primary driver behind all the ML/Modeling stuff [21:13:30] Have couple of commits into pandas and scikit-image [21:14:01] Great! Should I leave my contacts somewhere? [21:14:28] soupault: hmm, you could hang around on IRC, or put your email here (or PM me if you don't want it to be public) [21:15:29] yuvipanda, ok, cool! I'm using IRC, but not so regularly (couple of days during a week) [21:55:01] o/ hey/ I saw pings. Reading scrollback [21:55:45] Hey soupault! [21:55:53] Glad to have you swing by. [21:55:55] hey! [21:56:03] We're definitely interested in your help. [21:56:22] We organize hack sessions on Saturdays (UTC 1300-1800ish) [21:56:28] Would you be able to join us this weekend? [21:56:34] I guess I have a lot of work to do before I can really be useful [21:57:02] I'm not so sure. I was just thinking that I wanted to show you a model tuning utility that we put together. [21:57:02] halfak, utc+00? [21:57:18] Yeah. We're geographically distributed to UTC is the way we talk :) [21:57:33] sounds nice for me [21:59:59] Cool. Also, I got the UTC time wrong :\ I meant to say UTC 1500-2000 [22:00:13] * halfak needs to sleep in a little bit :D [22:00:16] I still confirm ;) [22:00:43] Awesome. I'll be happy to give you a tour of what we've got. We're ridiculously understaffed, so I'm sure we'll find something :D [22:02:58] OK, thanks for being kind and open and talk to you later! I have to sleep now... [22:04:20] Have a good one! [22:04:22] o/ [22:04:42] Yessss. Moar collaborators... MWahahahaha [22:04:42] :)