[17:11:01] o/ lzia [17:11:07] I'm looking for a dataset you might know more about. [17:11:30] So I remember there being a POS tagged wikipedia dataset (I thought it was branded with "Standford") [17:12:38] hey halfak. what is POS? :) [17:13:04] Parts of Speech [17:13:16] oww, that! give me a sec [17:13:26] Thanks :) [17:14:02] halfak: is this what you're talking about? (search for WIKI on that page): http://deepdive.stanford.edu/opendata/ [17:14:45] it's not really POS though [17:25:09] It looks like POS [17:25:17] I'm trying to understand what exactly is going on here [17:26:43] OK. I see that each sentence gets a block separated by \n\n [17:26:58] Now to figure out the meaning of the columns. [17:28:13] AHa! Look at the CoNLL specification [17:35:10] OK that's not quite right, but it helps a little [17:35:28] So, I think that this could be used, but it won't help me build a parser -- just a statistical tagger [17:36:57] Really, what I need is a tree-bank [17:38:49] that's a parsed corpus, but parsed with a dependency parser [17:40:23] as in https://en.wikipedia.org/wiki/Dependency_grammar [17:41:20] Hey pintoch. My ultimate goal is to apply https://en.wikipedia.org/wiki/Stochastic_context-free_grammar to changes in sentences in Wikipedia [17:41:37] To detect vandalism, look for spammy/POV statements, etc. [17:42:09] In order to train a stochastic PCFG, I need a way to turn my training samples into tagged trees. [17:43:13] I see [17:43:16] Woops. Missed with a Ctrl-W [17:43:19] :-) [17:43:41] See https://phabricator.wikimedia.org/T144636 if you are interested in following along and helping out :) [17:44:51] excellent! [17:45:21] if you want pre-trained models for stochastic CFG parsers, you can use (for instance) these: http://nlp.stanford.edu/software/lex-parser.shtml [17:47:46] Noted! [21:08:30] halfak: sorry. I was dropped out and saw your messages just now. [21:08:54] No worries. I've been looking into libraries that provide pre-trained parsers. [21:08:55] halfak: though I see that you've already answered all your questions. :) [21:09:01] yeah [21:09:05] I'm now looking at spaCy, but it seems to be insane [21:09:35] what does a pre-trained parser should do, halfak? [21:10:17] lzia, turns a sentence into a parse tree. E.g. "Every cat loves a dog" --> (S (NP (DET Every) (NN cat)) (VP (VT loves) (NP (DET a) (NN dog)))) [21:10:28] Not the hierarchical structure. [21:10:55] got you [21:11:06] halfak: you can email Chris Re or someone in his team. They can help with this. [21:13:48] Chris Re == spaCy dev? [21:13:56] Or stanford NLP dev? [21:14:58] lzia, ^ [21:15:08] halfak: why are you specifically interested in stochastic CFG ? [21:17:12] pintoch, as I said above, "My ultimate goal is to apply https://en.wikipedia.org/wiki/Stochastic_context-free_grammar to changes in sentences in Wikipedia" [21:17:19] To detect vandalism, look for spammy/POV statements, etc. [21:17:22] halfak: Chris Re re finding a library for pre-trained parser: http://cs.stanford.edu/people/chrismre/ [21:18:19] halfak: yeah, but why this specific linguistic theory? language models based on dependencies or even no grammar at all (Recursive NN) are very popular too [21:19:05] so I wonder whether you have a particular method in mind to score revisions, that relies specifically on parse trees? [21:19:06] new types of signal based on what we're already getting. I've seen no useful recursive NN for vandalism detection [21:19:20] And some past work [21:19:36] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.221.8770&rep=rep1&type=pdf#page=123 [21:20:12] thanks :) [21:25:20] Oh I suppose there's one more reason. I had someone on my team looking into PCFGs for a while so i' [21:25:30] m looking to pick up from where he left off. [21:44:19] halfak: is it possible to disaggregate the individual components from the overall quality score? [21:44:37] As I understand, ORES calculates quality on the basis of various factors. What I'd like to see is those individual factors. [21:45:19] Specific use case: taking articles with cleanup templates, and measuring them on their relative, uh, cleanliness. [21:47:27] harej, sorry, can you give me an example of an ORES response that you'd like to get [21:48:59] { "wp10": "C", "factors": { "citation_density": 0.54444, "cleanup_tags": 0.7325, "article_length": 0.4425 } [21:49:07] Or whatever you use to calculate the overall quality score [21:49:13] Also, add a closing } there [22:05:14] harej, we have that! [22:05:54] https://ores.wikimedia.org/v2/scores/enwiki/wp10/1235245?features [22:07:58] Ah [22:08:04] I don't usually use the ?features flag or knew it existed. [22:15:50] harej, Oh I think I see what you mean too... you are asking for how the features are weighted in the prediction. [22:15:58] Not necessarily [22:16:05] What you linked to is precisely what I wanted [22:16:07] I appreciate it! [22:16:53] Sweet :) [22:17:14] We can get at the weight of each feature though. Let me know if you want something like that. We could add it to the model_info output.