[00:34:32] yuvipanda: hey, if you have a minute, can you check https://gerrit.wikimedia.org/r/#/c/274912/ [00:34:37] thanks [14:30:03] legoktm: an interesting bug: compare these two: http://mw-revscoring.wmflabs.org/w/index.php?title=Special:RecentChanges&hidenondamaging=1 [14:30:08] http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:RecentChanges&hidenondamaging=1 [14:30:19] both use the same source code [14:30:33] but one of them doesn't show the "r" and highlight for new pages [14:31:13] more interestingly, scores are okay in db of both of them [14:31:25] (that's why hideondamaing=1 works) [15:06:58] o/ [15:07:00] Sorry for late. [15:07:02] Joining call shortly. [15:07:04] Hangouts or Skype? [15:07:06] Amir1 & ToAruShiroiNeko ^ [17:17:27] o/ plward13 [17:17:48] o/ halfak [17:17:50] So, you've been looking at the PCFG code. [17:18:22] Yes I have [17:18:37] It seems like the library could use some refactoring. [17:18:45] But otherwise it serves a purpose. [17:18:57] Agreed [17:19:05] aetilley used to be a frequent collaborator with us, but it seems he's mostly moved onto other stuff now. [17:19:18] Oh, I see [17:19:18] We can ping him with questions via github and he should respond. [17:19:31] Good to know [17:19:39] o/ [17:19:47] Last I heard he was working on getting good language assets for matching POS patterns. [17:19:51] So there are multiple PCFG implementations on github, but aetilley's is made for this project? [17:19:58] o/ Amir1 [17:20:21] Amir1, plward13 is hoping to look into some NLP stuff with us :) [17:20:29] So I'm working on giving him an overview of aetilley's work. [17:20:56] awesome [17:21:01] :) [17:21:05] plward13, have you found some other python-based PCFG implementations? And do any of the others look better? [17:21:18] * halfak was under the impression that aetilley looked around before starting on his [17:21:28] So there may be some reason we didn't invest in any others. [17:22:50] There is this one https://github.com/usami/pcfg, but it's 3 years old [17:23:41] Then there are a few Java implementations [17:23:58] plward13, Looks like our next major step is to get a model trained on sentences from good articles in Wikipedia. [17:24:20] then we'll want to figure out a good strategy for training on "vandalized" sentences. [17:24:29] E.g. sentences touched in an edit marked as vandalism. [17:24:44] We'll need a scheme for turning a diff into a set of changed sentences. [17:24:59] I can help with the latter as I have ideas. [17:25:11] The former I'd like to leave in your hands. [17:25:53] Cool, I'm happy to help! [17:26:13] Will we be implementing the difference between C_vandalism and C_regular as mentioned in the research paper? [17:27:14] plward13, good Q. I didn't read that carefully. Is C_vandalism trained on the whole content of a vandalized revision or just the sentences touched in the vandalized edit? [17:30:45] halfak, interesting, I had interpreted C_vandalism to mena just the sentences touched in a vandalized edit. However, when I re-read the paper, I see " Learn vandalism language by training a new PCFG parser C vandal using only those tree-banked documents in D that correspond to vandalism" [17:31:30] Would a tree-banked corpus refer just to the sentences touched by vandalism? [17:31:33] halfak: I just the made the pahb card for ORES extension, [17:31:35] *phab [17:31:52] and made a note regarding some patches [17:32:03] let me fix the translatewiki PR [17:32:09] then you review it [17:33:04] plward13, now that I look, I'm reading that as all of the sentences in a vandalized revision -- but regardless, we should train on the vandalized ones. [17:33:19] Amir1, OK. [17:33:24] I have 30 minutes left [17:34:33] \o/ [17:35:28] halfak, right, that makes sense. Just to be sure, "edit" and "revision" can be used interchangeably? [17:37:54] yes [17:38:03] plward13, ^ [17:38:16] plward13, do you have a username on phabricator.wikimedia.org? [17:38:32] No, I don't [17:39:18] This is where we coordinate most of our work. See https://phabricator.wikimedia.org/tag/revision-scoring-as-a-service/ [17:39:35] Would be great to have you join us there and we can make some "tasks" for what you'd like to look at :) [17:39:43] Awesome, thank you! [17:40:15] halfak, I'd like to look at an example of training on just vandalized sentenced. One example from the paper is "Beatrice Rosen (born 29 November 1985 (Happy birthday)), also known as Batrice Rosen or Ba- trice Rosenblatt, is a French-born actress. She is best known for her role as Faith in the second sea- son of the TV series “Cuts”." [17:40:33] We would take the first sentence to train the vandalism PCFG, but not the second sentence. Is that correct? [17:40:49] Assuming that the example is one atomic revision. [17:45:17] +1 [17:46:07] plward13, so, in the short term, I think the goal is to get the PCFG trained on real wiki content (biased samples are fine - just need to demonstrate that it works) [17:46:20] Then we can collaborate on getting you vandalized sentences to train on. [17:46:47] Then finally, we'll need to work out how to integrate the strategy into the feature extractor. [17:47:15] I think this is going to work OK. We can do it inside our feature definition files. We'll essentially have a trained sub-model within our feature extractors. [17:47:53] We'll need to experiment with how much space this sub-model takes up since "models" *know* their own "features" and therefor, the trained feature extractor will likely be stored within the model itself. [17:48:16] which is probably good to make sure that the same PCFG that trained the model is used in testing and eventual scoring of edits in practice. [17:48:22] Sound like a good plan? [17:48:27] plward13, ^ [17:50:12] Bah! Deployment of ORES is going to be messed up by the uwsgi --> uwsgi-ores-web bug [17:51:15] Yes, that sounds great! I have 2 questions just to make sure I'm on the same page. 1) Training the PCFG on real wiki content is effectively creating C_regular, and training another PCFG on vandalized sentences is creating C_vandalism? 2) What is the difference between a model and a sub-model? [17:51:34] 1) Yup. That' [17:51:36] s right [17:52:01] 2) Model would be the thing that predicts "is this edit vandalism" and the sub-model would be the PCFG that predicts "is this sentence vandalism" [17:52:25] So we'll likely use a GradientBoosting classifier or RandomForest for the "model" and PCFG for the "sub-model" [17:53:40] halfak, perfect, thanks for the clear explanation! [17:54:46] halfak: ^ [17:55:02] also have you fixed the bug in main page of ores.wmflabs.org? [17:55:22] I must note I haven't tested yet [17:55:31] I was afraid you'd go [17:55:50] Amir1, haven' [17:56:00] OK [17:56:01] t fixed the bug with the link to Revision scoring as a service [17:56:09] I assume it's easy [17:56:12] But I figured it was worth deploying anyway. [17:56:13] I do it :) [17:56:21] And let fixing that bug happen in the next iteration [17:56:24] +1 [17:56:31] kk. Working on making the web nodes work. [17:56:35] Then I'll be running away. [17:56:55] so I test those translatewiki net [17:57:01] PRs later [17:58:28] afk for a while [17:58:46] kk. Just about to deploy to ORES [17:58:53] So we should have a new home page shortly :) [17:58:58] Will ping here when we do. [17:59:23] plward13, I'll be back online tomorrow around this time. Feel free to email. [17:59:50] I'll leave my machine connected to IRC too, so I'll read the scrollback if you just want to say stuff in this channel. [18:00:06] * halfak waits for the web nodes to restart [18:00:45] halfak, great, thank you! I'll test training the PCFG. [18:01:16] halfak, if you haven't left yet, is there a place where we might store "golden" wiki content? [18:01:47] Yeah. I think that uploading it to the repo would be great if it is < 50MB. [18:02:04] If it is larger than that, let's consider using githubs Large File Storage API. [18:02:33] If it's too big for that, we have some public file servers at Wikimedia that would work OK, but there will be more process and no versioning for that. [18:02:49] halfak, I see. So if I start curating "golden" wiki content, would it also be used for training in other projects? [18:02:53] https://git-lfs.github.com/ [18:03:02] Yeah :) [18:03:17] One good option is to put it up on a site like figshare too. [18:03:31] That way you can get a DOI and citations when people use it :) [18:03:41] Which is good if you are ever looking for academic cred. [18:04:05] New homepage is live: http://ores.wmflabs.org/ [18:04:06] WOOOO [18:04:10] OK. I'm leaving now. [18:04:15] Have a good one folks! Happy hacking :) [18:04:23] Ok awesome! I'll go for a simple first pass (even if it's biased), then start thinking more carefully about getting golden wiki data. [18:04:33] halfak, take care! [19:44:25] even 50 MB might be a bit excessive imho [19:44:47] remember that the repo will need at least another 50MB [19:45:01] (probably won't compress too well…)