[07:07:13] wiki-ai/revscoring#518 (master - afa65c3 : Amir Sarabadani): The build was broken. https://travis-ci.org/wiki-ai/revscoring/builds/107713047 [12:11:13] legoktm: hey, around? [12:11:38] (03CR) 10Hoo man: [C: 04-1] "This is not going to work as you expect, as the rows will only be populated after all jobs ran. Thus you need to run the script, wait for " (033 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [13:21:05] o/ Amir1 [13:21:18] Just about to head into a meeting and need to prep, but thought I'd get an update. [13:21:21] I saw your email. [13:21:57] (03PS4) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [13:22:10] awesome [13:22:14] what do you think [13:22:26] I think about 1% of edits would be dropped [13:22:49] (03CR) 10jenkins-bot: [V: 04-1] Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [13:22:56] ladsgroup@ores-compute:~/wb-vandalism/datasets$ wc -l wikidata.features_reverted.general.nonbot.500k_2015.tsv [13:22:57] 180291 wikidata.features_reverted.general.nonbot.500k_2015.tsv [13:25:24] (03PS5) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [13:25:32] Amir1, so, it looks like that error is due to non-mainspace edits [13:25:36] Which I think is fine. [13:25:43] yesssssssssssssssssssssss [13:25:44] they will not appear in the output. [13:25:44] yes [13:25:48] sssssssssssssssssssssssssssssssssssssssssssssssss [13:25:50] :D [13:26:09] should we move on to writing or do some writing? [13:26:38] I'm torn because time is tight, but this analysis is critical [13:26:42] I don't know much about writing at academia [13:27:13] Maybe you can take over the analysis from here. [13:27:26] Yeah [13:27:29] I do the rest [13:27:40] Do you have much experience with a plotting library? [13:27:46] I ask question when I saw something [13:27:54] I've worked with matlab and octave [13:28:00] also matplotlib in python [13:28:32] Great. So, we'll need some nice vector plots of (at least) the precision/recall curves. [13:29:50] hmm, Do I have the test set? [13:30:58] Amir1, was planning to split the 500k [13:31:50] it'stoo late for this one [13:32:09] we can do it for other models (not the first one) [13:34:40] Oh! Because you already started extraction of features? [13:35:08] That's OK. I think we can always do it again. It only takes ~30 minutes to extract features for the entire set. [13:35:27] Let's use what you have for now and plan on a second pass. [13:37:33] ok halfak [13:37:47] what should I do now exactly [13:38:09] Get those models built. [13:38:28] We'll see what they look like. [13:39:19] BRB coffee [13:47:46] halfak: It's taking very long time for this 500K set [13:48:08] how long is that? [13:48:30] several hours has passed and we are 200K edits [13:48:43] *at 200K [13:49:55] also training models may take long time too halfak [13:50:11] Na. Should be on the order of 10 minutes. :) [13:50:19] Amir1, gotcha. [13:50:21] Doing general now? [13:50:48] yeas [13:50:51] *yeah [13:51:21] Hmm... Might take a bit longer with all. [13:51:51] Now that I think about it, we don't need to do a test split. [13:52:02] I think we want our stats from cross-fold validation anyway. [14:05:58] halfak: that's better [14:06:20] 215K [14:06:29] Amir1, will work on getting data for the "all" set right now. [14:06:46] awesome [14:06:55] You know, we don't necessarily need to extract features for all sets individually. [14:06:56] the system is working for general right now [14:07:04] We could do it once and then subset [14:07:10] But that's ticklish work. [14:07:15] That's easy to mess up. [14:07:16] exactly [14:07:25] I thought about it to [14:07:45] but I said it doesn't worth the issues [14:08:03] Yeah. Regardless, I'll extract the "all" right now and we can re-assess. [14:09:43] Amir1, how do I get the update for pywikibase? [14:10:12] Good Q [14:10:21] I haven't published another version [14:10:30] so you need to either download from master [14:10:45] or change it in the directory [14:11:02] (apply the patch individually) [14:14:58] OK. WOrks [14:16:21] awesome [14:40:04] halfak: In the mean time what I need to do [14:41:50] Amir1, do you have a summary of the OSM paper? [14:42:10] I have a very little of it [14:42:21] I thought even it's not needed at all [14:42:26] maybe one sentence [14:42:36] If you want I can check again [14:44:29] Yeah. I want to see your summary and check it out quickly. [14:51:44] halfak: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/Work_log/2016-02-03#neis2012towards [14:52:01] Thanks Amir1 [15:09:03] * halfak edits the report [15:38:00] Amir1, just wrote a bunch of new intro. Please review. [15:38:04] https://meta.wikimedia.org/wiki/Research:Building_automated_vandalism_detection_tool_for_Wikidata#Introduction [15:38:18] :yessssss [15:38:25] thanks [15:40:13] halfak: aliases are not unique per lang. [15:40:35] they are different but each lang has 0 - 10000000 aliases :D [15:45:01] halfak: I fixed some typos [16:01:02] halfak: what do you think of the image? [16:01:03] http://mw-revscoring.wmflabs.org/wiki/Special:Preferences#mw-prefsection-betafeatures [16:01:41] Looks good to me. [16:02:29] we are ready to go [16:02:41] one patch will be merged soon, [16:02:43] Amir1, I'm not seeing the orange background color in the RC feed [16:03:13] in recent changes? [16:03:48] Yeah [16:03:54] SPecial:RecentChanges [16:03:59] I see the orange "r" [16:04:01] because no edits has been done in past week [16:04:01] http://mw-revscoring.wmflabs.org/w/index.php?title=Special:RecentChanges&days=30&from= [16:04:05] But not a highlighted background. [16:04:51] https://usercontent.irccloud-cdn.com/file/8MA13KUQ/ [16:05:00] try the hard refesh [16:05:12] Yeah. Mine doesn't look like that. Hmmm. [16:06:44] can you check console log? [16:06:59] oh, have you enabled it? [16:07:26] ORES, yes. [16:08:38] I think I know why [16:08:47] halfak: are you using enhanced RC? [16:09:03] Maybe not. [16:09:16] It doesn't work in enhanced RC because the r flag is at first of the page [16:09:27] does it group edits? if yes, it's enhanced [16:09:36] it's default you need to change it [16:09:39] Whatever was enabled by default [16:09:51] Amir1, oh. We should work around that then. [16:10:39] Got it. [16:11:08] Without js it's impossible [16:11:14] I don't disagree [16:11:34] but I'm not really in favor, because the flag is already visible enough to people [16:11:51] and enhanced view is being used very seldom [16:11:52] OK. That's fine with me. [16:11:59] Amir1, enhanced view is the default [16:12:04] It mist be used by most [16:12:07] *must [16:12:09] in vagrant yes [16:12:15] but not by WMF Wikis [16:12:18] AFAIK [16:12:19] My default in enwiki is enhanced view. [16:12:47] Yup. [16:12:52] I never touched it. [16:14:20] Huh. I just registered an account and it did not have it turned on by default. [16:14:23] So maybe you are right. [16:14:47] I can do some research about it [16:16:25] I have 231k features extracted for the "all" feature set. [16:17:04] for me it's 334K [16:26:36] halfak: https://phabricator.wikimedia.org/T37785 [16:27:52] https://gerrit.wikimedia.org/r/#/c/124292/ [16:29:58] halfak: it's not default [16:30:12] I can find a work around for this but it may take some time [16:30:26] depends on you :) [16:31:12] Na. Let's move forward and leave it as an open bug [16:31:41] sure [16:32:05] I disable it for our VM too [16:33:01] Amir1, what's the behavior of the extension if we have a couple hours of downtime for ORES? [16:33:09] Also, what about a couple minutes? [16:33:33] It won't store scores in that time span [16:33:35] BUT [16:33:46] I wrote a maintenance script that fills that gap [16:33:58] and should be ran after every down time [16:34:11] nothing else [16:34:49] I thought about it, that's why the maintenance script is one of important parts of the MVP [16:34:59] and essential before deployment [16:35:07] Amir1, so we manually kick off a script for each downtime? [16:35:36] In labs, we have a 2 minute downtime (usually DNS issues out of our control) once every two days or so. [16:37:48] Wow. That was a long read. [16:38:02] halfak: two min is not big for fa.wp [16:38:23] but for Wikidata or en.wp we should do it on regular basis [16:39:02] thank you halfak :) [16:39:13] Any way we can detect this and have it happen automatically? [16:40:12] I should think about it [16:41:10] there are several ways but each of them has its own downgrades [16:41:23] we can keep jobs [16:41:29] *hold jobs [16:41:33] (automatically) [16:41:48] we can run that maintenance script automatically [16:42:05] we can attach it to a hook [16:42:12] So, we get a ping from icinga every time that the service goes down and another when it comes back up. [16:42:38] Yeah. I'm thinking that we need a variable of sorts if the server responds with a 500 error. [16:42:54] As soon as we get a non-500 error, the maintenance script should be started and the variable reset. [16:45:46] also you can put this down time into really quiet time [16:45:55] to reduce the impact [16:46:07] if it's a maintenance down time [16:47:08] Amir1, indeed, but with our targeted us-case for ORES, we'll want it to be run ASAP when ORES comes back online. [16:47:34] Given that, once an edit is out of the first couple pages of the RC, no one will see it. [16:47:51] yeah [16:47:55] I agree [16:48:06] I think about the best way [16:48:19] i.e. I mean, we should do both [16:49:28] I bet legoktm would have some good advice. [16:49:51] If you're around, our general question is about the ORES maintenance script that runs after downtime or after we deploy an updated model. [16:50:21] It seems problematic to run it manually. [16:50:35] But automatically running maintenance scripts seems like it would be a problem too. [16:51:11] I do some research about it [16:51:14] don't worry [16:51:32] I come back with an answer to you [16:51:47] kk Thanks Amir1 [16:52:23] I'm not feeling great today, so I'm going to go lay down for a bit. I'll be checking on the feature extraction stuff, but otherwise on and off IRC. [16:52:37] So, if I don't respond here, ping me on telegram. [16:52:45] sure :) [16:53:01] tell me what to do in the mean time re. the paper [16:54:08] I think we need some more substantial discussion of Martin and Stefan's methods. [16:54:36] We might also look at our list of labeled vandalism (from a work log last week) and reflect on how much would be caught with "rollback" [16:55:08] In the meantime, I'll look into how to discuss the OpenStreetMap paper and frame the argument around "efficiency of patrolling" [16:55:28] Sound good, Amir1? [16:55:34] great [16:55:43] Cool. [16:55:47] * halfak moves towards couch [17:14:36] (03PS5) 10Ladsgroup: ores extension as a beta feature [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268345 (https://phabricator.wikimedia.org/T125762) [17:34:30] (03PS6) 10Ladsgroup: ORES extension as a beta feature [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268345 (https://phabricator.wikimedia.org/T125762) [17:36:02] (03CR) 10Ladsgroup: [C: 032] ORES extension as a beta feature [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268345 (https://phabricator.wikimedia.org/T125762) (owner: 10Ladsgroup) [17:37:10] (03Merged) 10jenkins-bot: ORES extension as a beta feature [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268345 (https://phabricator.wikimedia.org/T125762) (owner: 10Ladsgroup) [17:54:30] (03CR) 10Hoo man: "I think the script should have a batch size parameter and then go over all entries on its own, otherwise it's not very usable for Wikidata" (035 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [18:23:20] Checking on stuff now. Looks like I have features extracted for 405k revisions. [18:46:22] Hmm... Seems like the script is hung there. [18:46:26] That doesn't seem right. [19:12:12] (03CR) 10Ladsgroup: "I chose 50 because ores.wmflabs.org doesn't support bigger batches." (035 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [19:14:48] (03PS6) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [19:15:51] (03CR) 10jenkins-bot: [V: 04-1] Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [19:21:49] (03PS7) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [19:27:30] (03PS8) 10Ladsgroup: Add PopluateDatabase.php [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) [19:34:27] https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/Work_log/2016-02-08 [19:37:41] that's good [19:37:51] better than most of papers I've seen [19:37:55] halfak: ^ [19:38:14] and with realtime approach [19:38:20] I'm starting on general_and_user now [19:38:37] awesome [19:38:50] let me how the general one is going [19:39:05] Not doing general. Working backwards from "all" [19:39:11] Figured you were running the "general" one. [19:39:40] yeah. I see [19:40:02] Going to go lay back down [19:40:11] ok [19:40:34] Some notes here too https://meta.wikimedia.org/wiki/Research_talk:Building_automated_vandalism_detection_tool_for_Wikidata [21:01:52] (03CR) 10Ladsgroup: "@hoo: What do you suggest to boost performance of the select query" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/268874 (https://phabricator.wikimedia.org/T123795) (owner: 10Ladsgroup) [21:27:02] Looks like this model takes a long time to test. [21:27:08] The RandomForest [21:27:27] We spend most much more time testing than training.