[00:12:59] o/ [00:12:59] Finally get to focus on Wikidata stuff [00:13:26] I've got a new makefile that performs the train/test split and works with the new train/test scripts -- or so I hope. [00:13:33] Gotta run some tests to see if the whole pipeline works. [00:13:48] I also went for splitting the "all" feature set using the cut method to get feature sub-samples. [00:13:58] Wasn't as hard as I expected. [00:14:20] Now if only I could get more than 200KB/s from stat3 [00:16:02] man. 51MB of feature data so far. [00:16:23] I should really be compressing these files. [00:20:22] 100MB ugh [00:20:28] C'mon [08:06:36] (03CR) 10Siebrand: [C: 032] "i18n/L10n reviewed. Test failed on submit. Let's see if that persists." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269552 (owner: 10Reedy) [08:08:55] (03Merged) 10jenkins-bot: Fix spaces to tabs [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269552 (owner: 10Reedy) [09:13:57] (03CR) 10Ladsgroup: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269555 (owner: 10Reedy) [09:14:27] (03CR) 10Ladsgroup: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269554 (owner: 10Reedy) [09:16:10] (03CR) 10Ladsgroup: [C: 032] Add \ to global classes/functions [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269555 (owner: 10Reedy) [09:16:14] (03CR) 10Ladsgroup: [C: 032] pngcrush pngs [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269554 (owner: 10Reedy) [09:20:33] (03Merged) 10jenkins-bot: Add \ to global classes/functions [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269555 (owner: 10Reedy) [09:20:36] (03Merged) 10jenkins-bot: pngcrush pngs [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269554 (owner: 10Reedy) [09:35:29] (03PS1) 10Ladsgroup: Remove redundant array key [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269635 (https://phabricator.wikimedia.org/T126397) [09:38:18] (03CR) 10Ladsgroup: [C: 032] Remove redundant array key [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269635 (https://phabricator.wikimedia.org/T126397) (owner: 10Ladsgroup) [09:44:05] (03CR) 10Ladsgroup: "We won't move this to external databases per our discussion with Jaime in https://phabricator.wikimedia.org/T123795" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [09:44:40] (03Merged) 10jenkins-bot: Remove redundant array key [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269635 (https://phabricator.wikimedia.org/T126397) (owner: 10Ladsgroup) [09:58:34] (03PS9) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [10:06:47] (03CR) 10Ladsgroup: "Also, We don't need grouping. That was my mistake :)" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [10:33:10] (03CR) 10Hoo man: "Ah ok, didn't see that for some reason. It's to join directly then… but I'm not sure about the performance implications (would need to loo" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [10:47:51] (03CR) 10Hoo man: "* It's ok" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [11:51:35] (03CR) 10Ladsgroup: "I don't quite understand the issue here. We have two batching. First one is the whole query which is batched (using LIMIT) and then we nee" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [12:24:36] (03PS1) 10Aude: Remove unused imports [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269662 [12:24:39] (03PS1) 10Aude: Remove some unnecessary slashes and rename variable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269663 [12:29:42] (03CR) 10Ladsgroup: [C: 032] Remove unused imports [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269662 (owner: 10Aude) [12:30:35] (03Merged) 10jenkins-bot: Remove unused imports [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269662 (owner: 10Aude) [12:31:46] (03CR) 10Aude: "there is a typo in the script / class name. it should be named PopulateDatabase" (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [12:34:03] (03CR) 10Ladsgroup: "Reedy just added those yesterday: https://gerrit.wikimedia.org/r/#/c/269555" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269663 (owner: 10Aude) [12:34:53] (03CR) 10Aude: Add PopluateDatabase.php (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [12:37:58] (03CR) 10Aude: [C: 04-1] "between each select, I think we should have LBFactory::waitForReplication" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [12:40:02] (03PS10) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [14:13:21] (03PS1) 10Ladsgroup: Exclude bots [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269672 [14:19:26] (03CR) 10Hoo man: [C: 031] Exclude bots [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269672 (owner: 10Ladsgroup) [14:21:55] (03CR) 10Ladsgroup: [C: 032] Exclude bots [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269672 (owner: 10Ladsgroup) [14:23:53] (03Merged) 10jenkins-bot: Exclude bots [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269672 (owner: 10Ladsgroup) [14:27:11] halfak: do you have some time to work on the paper? [14:27:25] Amir1, yeah. Been working to get this analysis stuff done. [14:27:42] I realized that in order to generate curves for each model and compare them, I needed a common test set. [14:28:17] it wouldn't be hard [14:29:14] yeah. Just finishing extracting features for the standard test set now. [14:29:32] I split the data 400k/100k [14:30:06] Oh! Looks like I'm blocked by some processes you have running right now on ores-compute. [14:30:36] Are those feature extractors? [14:31:04] mind if we pause that job? [14:31:08] halfak: yes [14:31:21] I meant I don't mind [14:31:24] If you Ctrl-Z in that session, it should pause it. [14:31:33] I was saying it's the feature extraction [14:31:58] Weird. Looks like my process got hung. [14:32:12] Just restarted it. [14:32:30] BTW, have you had feature extraction proceed without issue? [14:32:43] I've been running into a weird revision around 408k revisions in. [14:32:51] I had to restart, it hung [14:32:52] Not sure what is up with that. [14:33:01] But it did eventually make it? [14:33:05] I had the same issue [14:33:38] It turns out there is "revid " in lines 408ish [14:33:56] search in the file [14:34:05] Oh! [14:34:14] Weird! [14:34:20] Let's kill that line! [14:34:46] (03PS1) 10Aude: Rename variable to use camelCase [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269676 [14:35:10] I did it in my repo [14:35:38] :P Push commit. [14:36:36] Must have accidentally shuffles in the header row! [14:36:49] I think so [14:37:06] but I'm not sure it's the main cause of the freezing [14:37:26] Seems like that should have just resulted in a "revision not found" error [14:37:57] yeah [14:38:10] maybe it's just a coincidense [14:38:22] Just pushed a change to the sample_subsets branch. [14:38:55] (03CR) 10Ladsgroup: [C: 032] Rename variable to use camelCase [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269676 (owner: 10Aude) [14:39:34] Arg! I just accidentally restarted the 400k edit extraction. [14:39:37] Darn Makefile [14:39:43] *sigh* [14:39:47] I did too [14:39:53] looks like this will be running for a while. [14:39:58] today morning [14:40:18] (03Merged) 10jenkins-bot: Rename variable to use camelCase [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269676 (owner: 10Aude) [14:40:32] Make doesn't know that the change to the original 500k file wouldn't affect half of the downstream files. How could it? [14:40:46] Still. I wish I'd had the forsight. That would have saved an hour or so of server time. [14:40:50] Maybe two. [14:42:53] Amir1, I just restarted the job on a different server so you are free to use ores-compute for whatever [14:43:22] ok thanks [14:43:24] I'll run some tests to figure out what is going on with revscoring extract_features. [14:43:52] how to continue a stopped task (using ctrl +z) [14:44:23] type "fg" into term [14:44:28] The same one it is paused in [14:44:36] "foreground" [14:44:41] As opposed to "background" [14:45:11] "fg" = run in the foreground [14:45:17] "bg" = run in the background [14:45:27] thanks [15:39:30] (03PS11) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [15:45:26] (03CR) 10Ladsgroup: "Based on discussions in #wikidata. I think it's good to do. I wait for some days and then +2 it (if no one beats me to it). There is one b" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [15:47:12] halfak: I'm done with the extension stuff now. I also need to wait until the feature extractor finishes its work [15:47:30] in the mean time I want to do something, specially related to the paper [16:00:01] Amir1: here :) [16:00:17] hey [16:00:36] We have several methods to work on revscoring [16:00:51] introducing: revscoring is the ML-based scoring system [16:01:05] I read about these things, but I'm unable to find a way to start, something that'll also give me at least a little but of exposure to the research background, while contributing [16:01:06] ores is the service [16:01:33] let me find some materials for you [16:02:07] codezee: https://meta.wikimedia.org/wiki/Research:Revscoring [16:20:03] Amir1: I've gone through the overview, and what its about :), are there any small tasks or feature additions that could be tried, to get myself familiar with any of the projects(ores or revscoring) [16:24:17] codezee: have you checked this? https://phabricator.wikimedia.org/tag/revision-scoring-as-a-service/ [16:25:30] ohh, missed that, thanks... [16:26:22] halfak can tell more about small tasks possible [16:27:43] o/ Amir1 [16:27:51] So, thinking... [16:29:28] This is not a rule so much as a guideline, but our bibliography is way too short. [16:30:00] We should be able to pull in some good work that discusses quality in open knowledge projects and review processes generally. [16:30:39] Check out the related counter-vandalism lit for what work they reference -- other than other wiki counter-vandalism papers. [16:30:53] Write a short report on what you find and what you think is relevant/not. [16:31:04] Amir1, if I had the time, I'd be doing that right now. [16:32:22] I was afk [16:32:24] reading [16:33:15] halfak: do you mean the "related work" section? [16:33:34] It's usually called "Related work" or is included as part of the introduction. [16:34:01] It's the part of the research paper that substantially discusses past research and makes a case for how the current work fits within and builds upon past work. [16:34:02] we have a summary of them in intro and more details in the dedicated section [16:34:36] We're doing a good job when we have ~20-40 citations. Again, this is a helpful guideline, but not a rule. [16:34:52] E.g. we could pick up Geiger's "The Banning of a Vandal" paper to talk about counter-vandalism networks. [16:35:12] This would help us motivate the critical metric of "filter-rate @ high recall" [16:35:46] Okay [16:35:56] I try to do more and read more [16:35:58] We could also pull in the "When the Levee breaks paper" to talk about how critical automated quality control is to the functioning of a wiki at scale. [16:36:24] Banning of a vandal: http://www.pensivepuffin.com/dwmcphd/syllabi/info447_wi12/readings/wk05-ConflictInCollaborations/geiger.BanningAVandal.CSCW10.pdf [16:36:39] thanks [16:36:42] When the levee breaks: http://grouplens.org/site-content/uploads/2013/09/geiger13levee-preprint.pdf [16:37:22] the latter is a good example of a paper with few citations. [16:39:39] thanks [16:40:07] I read them, I also try to find some papers using refs of these papers (and some other papers around) [16:40:31] :) cool [16:42:01] if you have time please spend some time improving the draft [16:42:13] I would really apperciate that [16:42:43] Amir1, timing is wrong for that. Need to get analysis done first and then discuss what our story is. :/ [16:42:57] okay :( [16:43:11] 340K done [16:43:18] :) [17:32:16] I've got to go [17:32:22] be back in several hours [17:35:43] OK. Talk to you later! [17:35:58] I think I figured out why our feature extractor was just sitting there hanging. [17:37:27] It looks like the problem is that multiprocessing.Pool doesn't handle a generator failing mid-generation (thin this case, running into a value that is not an integer) [17:37:40] And it will just hang rather than either failing or continuing on. [18:50:48] halfak: 457K [18:51:05] o/ [18:51:23] Just finishing up the test set now. 46/100k [18:51:57] awesome [18:56:39] Amir1, I also worked out the issue with the hanging on error for extract_features utility. [18:56:49] I need to file a bug against python's multiprocessing library [18:56:57] yay [18:57:11] that's great we know these issues [18:57:28] (Even though it wasted several hours of servers) [18:57:35] Yeah. Boo to that :\ [18:57:35] but we are fixing an upstream bug [18:57:40] Yup :) [19:04:14] See http://bugs.python.org/issue26333 [19:14:48] halfak: why it doesn't have subscribe option [19:14:58] it's really old ewwwww [19:15:25] Yup. Super duper old. [19:15:32] 481K [19:15:53] For now, I'm going to work-around by using "map" instead of "imap", so it might be a slight issue with memory usage. [19:16:07] Then again, we're going to use that amount of memory in training and testing the model anyway. [19:16:30] 91/100k [19:16:48] Now to start writing the script that will generate the scores I want. [19:16:52] one thing: I simply removed the bad line in my file [19:16:58] the tsv file [19:16:59] Amir1, me too. [19:17:05] I pushed the new file [19:17:16] so it works without any issues atm [19:17:34] So if you 'git pull', it'll try to overwrite (or maybe just notice that we removed the same line and not complain) [19:17:47] why work-around? [19:18:41] So that the next time we have such an issue in our revision sets, it errors appropriately. [19:18:45] rather than hanging. [19:20:50] cool [19:20:54] we should also check that too [19:21:07] using grep (for example) [19:22:00] I suppose we can also work-around by ignoring lines in the file too. [19:22:14] E.g. emit a warning, but skip that rev_id. [19:23:11] yeah, that's doable [19:23:28] we can't be too safe :) [19:27:21] done [19:27:29] let me train and test [19:29:35] Great! [19:29:38] I'm doing the same. [19:29:48] Working a script now that will generate scores for the entire test set. [19:30:06] I'll use those to generate a set of predictions we can load into R/Python or whatever you want to start plotting in. [19:41:33] sure [19:51:50] The models. they train. Just waiting on one to run through the testing script I have. [19:52:04] Oh wait. I have one! [19:52:06] Mwahahaha [19:57:18] (03CR) 10Hoo man: [C: 04-1] Add PopluateDatabase.php (034 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [20:00:39] Wooo works great! [20:00:59] So, it looks like 99% of edits get a 0.0 prediction [20:01:06] It's funny to see them scroll by [20:01:31] 0, 0, 0, 0, 0, ***0.0125***, 0, 0, 0, 0, 0, ... [20:02:07] Also, it looks like the RandomForest model can score about 20-30 revisions per second. [20:02:38] Yikes! At this rate, it will take 55 minutes to score the whole test set! [20:02:40] Bah [20:03:02] I wonder if the GradientBoosting classifier is any faster. [20:05:32] * halfak trains a gradient_boosting classifier [20:05:50] heh. took about 10 seconds just to load all the training data into memory. [20:10:02] halfak: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/Work_log/2016-02-10 [20:14:18] Woah. [20:14:25] Looks like that model doesn't get us much at all. [20:14:37] Most of our signal is coming from general stuff and user features. [20:15:24] The stats output suggest that the model is basically useless. [20:16:06] E.g. you have to set the threshold at 0 to get even 0.75 recall [20:16:24] Which suggests that we have a lot of positive example that get zero scores. [20:22:27] halfak: should I git add .model files and push it into github [20:22:29] ? [20:22:41] branch sample_subsets [20:24:33] Amir1, not quite yet. [20:24:54] ok [20:25:01] If you git pull, you'll notice some big changes in how I regenerated the models. It uses the train/test scripts independently now. [20:26:18] I got a *slightly* better result with my general model than you did. [20:26:29] This is why it is important that we standardize the test set [20:26:48] But, I think you can use the results we have to start writing about the general trend. [20:26:48] I totally agree [20:27:16] I.e. that general_user gets us almost everything that we get with *all*. [20:27:35] context and type do not add much by the stats, but we've shown that they work in practice. [20:28:06] oooh! Should write up our process for receiving feedback about false positives/negatives and add that to the paper. [20:28:20] I did that [20:28:26] Oh! Must have missed it. [20:28:33] maybe not in great details but I wrote [20:28:35] What do you think about including the table in the paper. [20:28:48] Amir1, should get it to a paper-quality description. [20:29:04] the table or method? [20:29:17] At least method. Not 100% sure about the table. [20:29:24] If we have space, it would be nice to also have the table. [20:29:36] Seems like this should be a subsection in "Methods" [20:32:01] what do you suggest? [20:32:02] as name of subsection [20:33:03] Feedback and iteration [20:33:30] okay cool [20:33:55] A more standard term might be "feature engineering". To some extent, we crowd-sourced feature engineering. [20:34:21] I must go, I'll be back soon [20:34:44] OK [20:34:47] I'll be arouind [22:31:50] (03CR) 10Reedy: "https://youtrack.jetbrains.com/issue/WI-30685" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269663 (owner: 10Aude) [22:33:21] (03CR) 10Aude: "@reedy thanks!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269663 (owner: 10Aude) [22:33:44] (03Abandoned) 10Aude: Remove some unnecessary slashes and rename variable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269663 (owner: 10Aude)