[11:22:55] (03CR) 10Reedy: "Yup, definitely a PhpStorm bug. I guess my patch should be (partially) reverted out..." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269663 (owner: 10Aude) [11:25:48] (03CR) 10Reedy: "Should be partially/manually reverted" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/269555 (owner: 10Reedy) [11:57:08] (03PS12) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [12:07:30] (03PS13) 10Ladsgroup: Add PopulateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [13:19:58] halfak: o/ [13:20:04] tell me when you're around [13:38:08] o/ Amir1 [13:38:10] just got up. [13:38:15] Getting coffee and stuff. [13:38:39] I'll have a dataset with scores computed from all of the models within the hour. [13:40:43] awesome [13:40:54] In the mean time I'm trying to write [13:41:07] the feature engineering section [13:59:58] halfak: Can you give me a chart like this? https://meta.wikimedia.org/wiki/File:ORES_server_response_timing.svg [14:00:02] for wikidata only [14:00:24] one of them is enough or in order to compare let's say we have en.wp too [14:00:37] (if you give me data, I'll plot it) [14:20:03] Amir1, actually, I think it would be great if you got that data. [14:20:19] You should be able to just take a random set of revisions and request scores from the API. [14:20:43] sure, how many? [14:20:56] 10k should be plenty [14:21:02] Actually 1k should be plenty [14:21:05] kk [14:21:10] I'm on it [14:21:44] I just finishing the scores for the all.rf model and then I'll dig into the plots @ stats. [14:21:51] *& stats [14:22:13] I have an all.gradient_boosting model for comparison and it doesn't seem to perform as well. [14:22:17] Which is a good sign I guess. [14:22:28] That we picked the better modeling strategy. [14:23:07] yay [14:41:09] Amir1, I just finished producing the scores dataset. See datasets/wikidata.reverted.experimental_models.test_scores.tsv in the branch we have been working from in wb-vandalism [14:41:15] I'll be generating some plots for this today. [14:41:27] yay [14:41:40] I'm about to finish getting data for server response [14:41:44] Woops! Forgot to include the label in that dataset. [14:41:47] grabbing that now. [14:42:35] Amir1, requesting 1 revision at a time or 50 revision batches? [14:42:45] Seems like it would be good to plot the timing for both. [14:43:01] sure [14:43:03] Also it would be nice to run the script over the same revisions again to get the cache timing. :) [14:43:22] ok [14:53:47] Amir1, when you get a chance, can you figure out KDD's submissions system and upload a draft PDF. [14:54:03] We'll be able to upload new versions, but it is important that you go through the whole flow in preparation for tomorrow. [14:54:13] sure [14:54:30] * halfak <3's the paste command. [14:54:47] Never used it before. It works wonderfully for joining datasets together assuming consistent ordering :) [15:15:39] running the time scorer [15:19:09] halfak: there are two tracks, "research track" and "applied data science track" which one shoud I choose? [15:19:54] research, probably. [15:20:00] Can you link me to the distinctions? [15:20:28] http://www.kdd.org/kdd2016/calls [15:22:17] Hmm... Looks like applied data science might be good for us after all. [15:22:24] "practical tasks and practical settings" [15:22:27] Fits us well. [15:22:36] It's funny that the distinction is made from "research" [15:33:19] halfak: OK, I'm able to upload and add data [15:33:32] I need you and dario's infomration too [15:33:38] but It won't be hard [15:34:30] biggest question: Subject Areas Select one primary subject area. [15:34:30] Primary: Deployed Emerging Discovery [15:34:30] Which one? I think Deployed would be a good idea [15:35:30] brb meeting [15:35:36] kk [17:23:10] halfak: I'v these data and I'm trying to make some plots but nothing good comes out of it [17:23:34] Make sure you log the request time. [17:23:43] It's generally log-normally distributed [17:24:31] I tried [17:25:34] result: file:///home/amir/test.png https://usercontent.irccloud-cdn.com/file/ugzFw2Sr/ [17:25:58] I probably need to only time [17:28:39] when only time is logged(?) [17:29:02] file:///home/amir/test.png https://usercontent.irccloud-cdn.com/file/IPEznRl7/ [17:29:19] We need more data [17:29:28] it makes the lot smoother [17:42:23] OMG the meetings are still going :( [17:42:53] Amir1, I would fit a gaussian kernel on the log-scaled data. [17:43:03] *fit is the wrong word [17:43:19] Looks like this isn't log-scaled [17:44:07] the second plot it log of response time [17:44:30] so pick is at 10^0.2 [17:44:48] peak [17:45:36] file:///home/amir/test.png https://usercontent.irccloud-cdn.com/file/uvIViScn/ [17:45:50] halfak: this one is log-scaled by the matplotlib [17:46:18] The x axis is deceiving ;) [17:47:26] maybe it onle log-scale the y axis [17:47:52] definitely, this is not log-scaled [17:48:37] Should need to log-scale the y [17:51:24] file:///home/amir/test.png https://usercontent.irccloud-cdn.com/file/TuLIBpiA/ [17:51:28] cumulative [17:54:22] file:///home/amir/test.png https://usercontent.irccloud-cdn.com/file/DqNAA6RU/ [17:54:35] log-scaled both xand y [17:59:17] I think this one is the best https://usercontent.irccloud-cdn.com/file/IPEznRl7/ [17:59:28] halfak: but it depnds on you [18:06:18] halfak: my intuition of these data is there are two peaks, one of them is for cached data (probably) the second one is normal revs [18:06:47] Agreed. Can you send me raw data. [18:07:16] sure [18:09:41] Do you have rev_id in the dataset? [18:10:20] yes [18:10:29] I've doen some tests too [18:11:02] It seems late responses are because it stuck in queue [18:11:14] the edits are not anything special [18:13:36] halfak: It's the time scorer [18:13:37] https://gist.github.com/Ladsgroup/6f54e24d3f3d1deebec4 [18:13:49] sorry, it's a little bit ugly [18:14:19] https://tools.wmflabs.org/dexbot/res2.txt <- result for one edit and wikidata [18:15:09] https://tools.wmflabs.org/dexbot/res_batched.txt <- 50 edits batched and wikidata [18:15:44] https://tools.wmflabs.org/dexbot/res_enwp.txt <- result for one edit and English Wikipedia [18:16:37] https://tools.wmflabs.org/dexbot/res_enwp_batched.txt <- result for batched and English Wikipedia [18:16:56] how to get: [18:16:57] sql wikidatawiki "select rc_this_oldid from recentchanges where rc_Type = 0 order by rand() limit 5000;" | python3 time_scorer.py wikidatawiki --model=reverted > res2.tsv [18:17:04] you get the idea [18:25:34] I've got to go, be back soon [18:27:17] Awesome. Will work on this as soon as I have finished my first pass through the fitness metric graphs. [18:50:17] I start writing the section regarding response time [18:55:49] the second one is https://tools.wmflabs.org/dexbot/res_wikidata.txt not res_batched [19:00:48] Amir1, can you write up a quick worklog on how you generated the samples and what your general observations are? [19:00:56] We can adapt that for the methods section of the paper. [19:01:04] Sure [19:01:20] I'm sorry I didn't do it sooner, I alway log everything [19:03:17] No worries. Was half expecting you to say "It's already there" :) [19:19:31] halfak: put it there. Right now I'm trying to write it in the research page [19:20:05] Great. [19:20:09] I'm just finishing up the plots. [19:20:23] Will have an update in the worklog shortly. [19:27:40] Amir1, what type of file format do you prefer for latex? [19:27:45] PS/PDF/PNG/SVG? [19:27:52] PDF [19:29:20] halfak: "In order to build a classifier usable by Wikidata users. We use Wikimedia Labs \cite{} hosted by Wikimedia Foundation. The environment is called ORES, standing for Objective Revision Evaluation Service, hosts machine learning classifiers for all projects hosted by Wikimedia Foundation including Wikipedia and Wikidata. ORES accepts two methods of [19:29:20] scoring edits. One-by-one and batched are supported. We tested ORES response time by testing 1000 randomly sampled edits. Response time in case of one-by-one request methods varies between 0.0076 and 14.6 seconds with mean of 0.66 seconds and median of 0.53 seconds. While by batching edits in packages of 50 edits response time falls between 0.56 and 13.9 [19:29:20] seconds with mean of 6.23 seconds and median of 5.58 seconds. By studying response time for one-by-one method two peaks are noticeable. First peak is around 0.3 seconds and another peak around 0.55 seconds. First peak can be cached scores [But we are not sure and let's leave it for future researches]" [19:29:30] I haven't even proofread this [19:30:26] Looks like a good start [19:30:40] Just about to make some edits to the worklog [19:30:52] We might want to have a minor section devoted to the design of ORES [19:30:58] As a realtime service [19:31:08] Might be the same section where we talk about the importance of realtime. [19:32:38] this section I just wrote is exactly after the section regarding importance of realtime [19:32:53] Awesome [19:50:25] Amir1, https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/Work_log/2016-02-11#Fitness_curves_for_our_models [19:50:37] \o/ [19:51:00] \o/ [19:51:02] \o/ [19:52:12] OK. Going to actually take a lunch break now. [19:52:20] When I get back, I'll dive into the results section [19:52:38] And work on writing up the argument about work efficiency, realtime and filter-rates. [19:52:52] Maybe also some sadness about the predictive power of being anon. :/ [19:53:50] OK :) [19:58:36] halAFK: what do you use to plot, they look really cool :D [20:17:37] Amir1, it's ggplot in R. [20:17:49] It's my favorite plotting library/system hands down. [20:18:00] the "gg" stands for "Grammar of Graphics" [20:18:50] http://www.amazon.com/The-Grammar-Graphics-Statistics-Computing/dp/0387245448 [20:19:02] Holy moly is that expensive [20:19:44] Oh yeah. Before I get to writing. I'll work on the response timing plots. [20:20:25] Amir1, I think the batch-size of 50 needs more observations. [20:20:39] Since we only "observe" one timing per 50 revisions. [20:20:45] Then again. Let me plot and assess. [20:21:49] R code is pushed to repo [20:23:46] I just got back [20:24:41] awesome, do you want to do more observations of 50 batches? [20:25:15] Let me plot first and see how table it is. [20:26:52] *stable [20:34:12] Amir1, are any of these explicitly cached? [20:34:31] E.g. run the request over the same data twice? [20:34:33] I can't say [20:34:39] every time I do random [20:34:48] so basically nothing is cached [20:34:54] (explicitly) [20:35:17] but I want to do check it with explicit caching too [20:35:36] do you want it for now halfak ? [20:36:07] Yeah. [20:36:11] Would like to have that in graph too. [20:37:27] sure, one quick question. How long do you keep cached scores? [20:37:49] e.g. I requested several hours ago, is it still okay to ask again? [20:38:03] Amir1, practically forever [20:38:03] or should I do another batch and re-try right away [20:38:25] kk [20:42:38] Amir1, I wonder if you might grab a sample from 2014 so that we know that they aren't (or shouldn't be) cached [20:43:02] that's fairly easily [20:43:40] I'm currently running the cached part [20:44:01] it's done [20:44:05] that was fast [20:44:12] :D [20:44:46] https://tools.wmflabs.org/dexbot/res2_cached.txt [20:45:07] halfak: can you do some anlaysis on this until I run the sample set on a 2014 edits? [20:46:56] Yes. [20:47:09] Is this link for wikidata singles? [20:49:04] yes [20:49:13] I can do cached batches too if you want [20:50:00] sql query is too damn slow [20:51:16] lol [20:51:25] Yeah they are. L/ [20:51:27] *:/ [20:52:11] Amir1, we have a few errored scores in the cached set that throw off the measurement. [20:52:17] Looks like it is that pywikibase bug. [20:52:22] E.g.http://ores.wmflabs.org/scores/wikidatawiki/reverted/293268611/ [20:52:32] Takes us 0.5 seconds to figure out we are going to error. [20:52:36] Since we don't cache errors. [20:53:05] I see [20:53:32] I fixed one of the most important ones but It's not deployed there probably [20:53:45] yeah, that's the same error [20:56:27] halfak: take them out or fix the pywikibase in server [20:56:37] whtever is more convient to you [20:57:10] *convenient [20:57:35] Amir1, let's take them out for now. [20:57:44] And get a new version of pywikibase in pypi [20:57:44] ok :) [20:58:33] Heh. We have a weird spike at ~1 second and [21:00:02] Looks like we get 0.5 seconds if the item is essentially empty. [21:00:13] And we get 1.1 seconds if it is relatively substantial. [21:00:48] pywikibase-0.0.4 is in pypi server now :) [21:01:15] Great! [21:03:21] halfak: btw I did this: sql wikidatawiki "select rev_id from revision where rev_timestamp like '201406%' order by rand() limit 1000;" | python3 time_scorer.py wikidatawiki --model=reverted > res2_2014.tsv [21:03:36] I used 201406 instead of 2014, because it's much faster [21:03:40] is it okay? [21:03:42] yes [21:03:45] Looks good. [21:04:10] Can you catch the error in time_scorer.py? [21:04:48] I need to modify it a little bit but possible [21:04:49] I means to say, don't produce a timing if there is an error. [21:04:58] and easy to catch [21:06:41] added [21:06:47] re-running for 2014 set [21:08:50] halfak: do we need to re-run other ones too? [21:10:12] Amir1, I don't think so. [21:10:19] kk [21:11:08] Just to be clear, I'm hoping for a set of (1) wikidata singles, (2) wikidata batches, (3) wikidata cached [21:11:23] Amir1, what machine are you running this from? [21:11:43] tools [21:12:10] because I'm not sure if we have access to replicas in ores-compute, I tried it didn't work [21:13:50] btw have you seen my comment in the phab task regarding the ores extension? if the extension gets enabled in fa.wp. It makes 1.7 calls per min. I think ORES can handle that :D [21:14:48] Yeah definitely. [21:15:05] Our precacher sends out about 10 calls per second. [21:15:30] We should note in our writeup that we did this analysis against the live system during one of the most active times of the day. [21:16:26] do you want a set of batches, singles and cached for 2014 too? [21:16:49] we already have batches, singles and cached. Correct me if I'm wrong [21:21:35] halfak: https://tools.wmflabs.org/dexbot/res2_2014.txt [21:21:50] as you can see some of them have "json error" in them :) [21:21:55] throw them away [21:23:10] That works [21:24:22] some of them are non-main ns edits [21:24:25] that's why [21:24:43] It's OK. Joining to the page table would be crazy slow. [21:25:11] How long did scoring 1k take? [21:25:26] IDK but it should be around ten min [21:25:33] or less [21:25:47] I think we should do 50k for batches [21:26:07] that'll take ~1 hour [21:27:04] but sql query would take really really long time [21:27:16] we are advised not to select more than 5K [21:27:17] Amir1, I think it'll take the exact same amount of time to query [21:27:40] Why says this? We regularly query 20k from quarry. [21:27:51] *who [21:27:55] I mean directly [21:28:04] in guidelines of mediawiki [21:28:14] Manual:Database access IIRC [21:28:16] Oh. We're not writing mediawiki code :) [21:28:30] okay :) [21:28:34] How about we do 25k for batch and call it good? [21:28:43] That should make for a good tradeoff. [21:28:54] We're already regularizing the per-edit timing. [21:29:21] I think if we query in lower time span [21:29:28] it would be much faster [21:29:37] *smaller [21:30:16] yeah. The timespan is our most efficient filter. But I'd expect the query to still be quite quick. [21:30:58] okay [21:31:02] let me give it a try [21:31:26] I just ran the query for 1k revs in 6 seconds. [21:31:57] the query for 50k took 4.5 seconds O_O [21:32:15] Must have kept in the index in memory. [21:33:13] depends on oreder [21:33:22] order by rand is a huge performance killer [21:33:41] I did order by rand() [21:33:51] I just copy-pasted your query and ran it with different limits. [21:34:17] Here's the 50k version "select rev_id from revision where rev_timestamp like '201406%' order by rand() limit 50000;" [21:36:04] thanks [21:37:06] https://commons.wikimedia.org/wiki/File:Ores_wikidatawiki_response_timing.single_batch_and_cached.svg [21:39:55] awesome [21:40:16] halfak: can you send me result of this? select rc_this_oldid from recentchanges where rc_Type = 0 order by rand() limit 50000; [21:40:27] put it somewhere and I download it [21:40:57] Will need to filter out rc_deleted rows, but yes. [21:41:09] Oh wait. no I don't :)( [21:41:33] throw them away later [21:42:15] the graph is sexy [21:42:46] :) I think we might just run with that. [21:42:57] I had to do some cleanup, but I think it turned out OK. [21:42:58] I definitely need to learn it [21:43:15] But let your next timing query finish. [21:43:22] I have code ready to update the plot with new data. [21:43:43] I'm going to pull these plots into the results section and start writing around them. [21:44:44] awesome [21:45:45] How are you on sleep? [21:47:44] not bad, slept 9 hours so I can stay awake for much more now [21:47:56] Bah! I just realized that I forgot the ROC curves! [21:48:03] Damn it. [21:55:38] I write the acknowledgements section [21:55:51] In another page to avoid edit conflict [21:57:06] No worries. Working on graphs now. [21:57:11] I'll only edit 1 section at a time. [21:57:53] the scores are being generated [22:02:12] Just added the ROC's [22:02:14] https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/Work_log/2016-02-11#Fitness_curves_for_our_models [22:02:47] I only have 1 more hour today. [22:02:53] Because I have to do other stuff. [22:03:04] * halfak dumps stuff in the paper as fast as he can. [22:04:44] \o/ [22:04:55] Okay, I try to do some polishing of my notes [22:05:48] After you're done :) [23:01:01] Amir1, https://meta.wikimedia.org/wiki/Research:Building_automated_vandalism_detection_tool_for_Wikidata#Results [23:01:14] That's as far as I got. Off to my next meeting. [23:01:21] reading [23:02:43] awesome [23:02:49] I polish everything [23:02:54] (my works) [23:03:10] I think our framing of this paper needs to be "we set a very useful baseline for vandalism detection in Wikidata" [23:03:17] Useful in terms of effort reduction. [23:09:14] okay [23:09:18] like the blog post [23:27:59] So, I have been thinking about the simple qualitative analysis we did of the hits and misses based on the prediction probability of the deployed model. [23:28:26] I think we should include that as a note at the end of the results [23:29:05] We can take the same opportunity to talk about the limitations in our methodology for generating a training and test set as well as the limitations in Martin/Stefan's strategy. [23:29:40] I think it would be worthwhile to call attention to the difficulty involved in generating a pan-like test set that had *any meaningful amount of vandalism*. [23:29:59] We'd need to review an order of magnitude more edits and that is very expensive. [23:30:28] Also, unlike Wikipedia, identifying vandalism in Wikidata requires one to know the rules and to be able to process contributions in a variety of languages. [23:31:08] I'm reading the draft now [23:31:22] you wrote "Replace the following paragraph with the new auto-labeling process. Find notes in the worklogs." [23:31:50] "while the models that include user features are able to attain very high filter rates up to 89% recall. This implies a theoretical reduction in patrolling workload down to 1.5% of incoming human edits assuming that it's tolerable to let 11% of potential vandalism by caught by other means." [23:32:00] which is the same paragraph [23:32:08] you added better numbers [23:33:33] Also it should be 1.8%, correct me if I'm wrong [23:34:12] That's right [23:34:47] but my question right now is that, you added those numbers, so can simply delete the paragraph in red box [23:34:58] *can we? [23:35:00] Oh. No. [23:35:12] or anything else should be added to results section [23:35:18] I think we should keep that and move it to the bottom of the page. [23:36:11] It should either be it's own section at the end of Results or inside of Limitations section. [23:36:32] It's not a robust/replicable analysis. [23:36:50] I see [23:36:58] But it is useful contextually [23:37:09] Especially for describing the limitations of our results. [23:37:53] I should add some stuff related limitation [23:38:26] e.g. as you said checking edits in Wikidata is hard [23:39:07] Also, our method for identifying a testing and training set involved the use of filters that are also encoded as features in our model. [23:39:45] Oh wait. [23:39:47] is_bot etc.? [23:40:07] Oh yeah. That's right. is_bot should actually be useless because it should be false for all of our edits. [23:40:22] But I'm more worried about client edits and merge edits. [23:40:35] In our analysis, we found that none of those were vandalism. [23:40:50] So we excluded them from the posibility of being labeled "reverted" [23:41:00] We also have "merge-edit" as an edit type feature. [23:41:24] So our model -- if it is working as intended -- should very easily learn that merge edits are never labeled True [23:41:53] we didn't label merges and client edits in our training system [23:42:00] we did it just here [23:42:15] In this case, we trained on the same type of data we tested on. [23:42:20] It was just split 400k/100k [23:42:26] From the original 500k set [23:42:53] So, this is circular -- but also, our qualitative analysis suggests that it'll work in practice. [23:43:03] The only problem I have is with our evaluation. [23:43:12] It's going to be limited by the circularness of this. [23:45:32] In the end the reasonableness of this rests on our qualitative analysis of the edit subsets. [23:45:38] Encoded in the worklog. [23:45:45] OK. Looks like my meeting isn't happening. [23:45:52] I'm going to go try to write that up. [23:46:25] yay [23:46:46] tell me when you're done [23:47:30] Working on "Methods" now [23:49:33] kk [23:55:03] Amir1, where are the results of your analysis of reverted edit comments? [23:55:23] Trying to work out what proportion would be missed had we relied on the Rollback-only method. [23:57:01] it should be in related works [23:57:08] because it was related to Martin [23:57:31] halfak: "However, our classifier showed out of sample 698 ..." [23:57:33] Yes. But where do I review your analysis so that I can incorporate numbes? [23:57:46] Oh! It's after methods. [23:57:52] Sorry I misunderstood [23:58:34] Move it wherever you think is better [23:58:48] No. That's great for now.