[17:11:01] <halfak>	 o/ lzia
[17:11:07] <halfak>	 I'm looking for a dataset you might know more about. 
[17:11:30] <halfak>	 So I remember there being a POS tagged wikipedia dataset (I thought it was branded with "Standford")
[17:12:38] <lzia>	 hey halfak. what is POS? :)
[17:13:04] <halfak>	 Parts of Speech
[17:13:16] <lzia>	 oww, that! give me a sec
[17:13:26] <halfak>	 Thanks :) 
[17:14:02] <lzia>	 halfak: is this what you're talking about? (search for WIKI on that page): http://deepdive.stanford.edu/opendata/
[17:14:45] <lzia>	 it's not really POS though
[17:25:09] <halfak>	 It looks like POS
[17:25:17] <halfak>	 I'm trying to understand what exactly is going on here
[17:26:43] <halfak>	 OK.  I see that each sentence gets a block separated by \n\n
[17:26:58] <halfak>	 Now to figure out the meaning of the columns.
[17:28:13] <halfak>	 AHa!  Look at the CoNLL specification
[17:35:10] <halfak>	 OK that's not quite right, but it helps a little
[17:35:28] <halfak>	 So, I think that this could be used, but it won't help me build a parser -- just a statistical tagger
[17:36:57] <halfak>	 Really, what I need is a tree-bank 
[17:38:49] <pintoch>	 that's a parsed corpus, but parsed with a dependency parser
[17:40:23] <pintoch>	 as in https://en.wikipedia.org/wiki/Dependency_grammar
[17:41:20] <halfak>	 Hey pintoch.  My ultimate goal is to apply https://en.wikipedia.org/wiki/Stochastic_context-free_grammar to changes in sentences in Wikipedia
[17:41:37] <halfak>	 To detect vandalism, look for spammy/POV statements, etc. 
[17:42:09] <halfak>	 In order to train a stochastic PCFG, I need a way to turn my training samples into tagged trees.
[17:43:13] <pintoch>	 I see
[17:43:16] <halfak>	 Woops.  Missed with a Ctrl-W
[17:43:19] <pintoch>	 :-)
[17:43:41] <halfak>	 See https://phabricator.wikimedia.org/T144636 if you are interested in following along and helping out :) 
[17:44:51] <pintoch>	 excellent!
[17:45:21] <pintoch>	 if you want pre-trained models for stochastic CFG parsers, you can use (for instance) these: http://nlp.stanford.edu/software/lex-parser.shtml
[17:47:46] <halfak>	 Noted!
[21:08:30] <lzia>	 halfak: sorry. I was dropped out and saw your messages just now.
[21:08:54] <halfak>	 No worries.  I've been looking into libraries that provide pre-trained parsers. 
[21:08:55] <lzia>	 halfak: though I see that you've already answered all your questions. :)
[21:09:01] <lzia>	 yeah
[21:09:05] <halfak>	 I'm now looking at spaCy, but it seems to be insane
[21:09:35] <lzia>	 what does a pre-trained parser should do, halfak?
[21:10:17] <halfak>	 lzia, turns a sentence into a parse tree.  E.g. "Every cat loves a dog" --> (S (NP (DET Every) (NN cat)) (VP (VT loves) (NP (DET a) (NN dog))))
[21:10:28] <halfak>	 Not the hierarchical structure.
[21:10:55] <lzia>	 got you
[21:11:06] <lzia>	 halfak: you can email Chris Re or someone in his team. They can help with this.
[21:13:48] <halfak>	 Chris Re == spaCy dev?
[21:13:56] <halfak>	 Or stanford NLP dev?
[21:14:58] <halfak>	 lzia, ^ 
[21:15:08] <pintoch>	 halfak: why are you specifically interested in stochastic CFG ?
[21:17:12] <halfak>	 pintoch, as I said above, "My ultimate goal is to apply https://en.wikipedia.org/wiki/Stochastic_context-free_grammar to changes in sentences in Wikipedia"
[21:17:19] <halfak>	 To detect vandalism, look for spammy/POV statements, etc. 
[21:17:22] <lzia>	 halfak: Chris Re re finding a library for pre-trained parser: http://cs.stanford.edu/people/chrismre/ 
[21:18:19] <pintoch>	 halfak: yeah, but why this specific linguistic theory? language models based on dependencies or even no grammar at all (Recursive NN) are very popular too
[21:19:05] <pintoch>	 so I wonder whether you have a particular method in mind to score revisions, that relies specifically on parse trees?
[21:19:06] <halfak>	 new types of signal based on what we're already getting.  I've seen no useful recursive NN for vandalism detection
[21:19:20] <halfak>	 And some past work
[21:19:36] <halfak>	 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.221.8770&rep=rep1&type=pdf#page=123
[21:20:12] <pintoch>	 thanks :)
[21:25:20] <halfak>	 Oh I suppose there's one more reason.  I had someone on my team looking into PCFGs for a while so i'
[21:25:30] <halfak>	 m looking to pick up from where he left off. 
[21:44:19] <harej>	 halfak: is it possible to disaggregate the individual components from the overall quality score?
[21:44:37] <harej>	 As I understand, ORES calculates quality on the basis of various factors. What I'd like to see is those individual factors.
[21:45:19] <harej>	 Specific use case: taking articles with cleanup templates, and measuring them on their relative, uh, cleanliness.
[21:47:27] <halfak>	 harej, sorry, can you give me an example of an ORES response that you'd like to get
[21:48:59] <harej>	 { "wp10": "C", "factors": { "citation_density": 0.54444, "cleanup_tags": 0.7325, "article_length": 0.4425 }
[21:49:07] <harej>	 Or whatever you use to calculate the overall quality score
[21:49:13] <harej>	 Also, add a closing } there
[22:05:14] <halfak>	 harej, we have that!
[22:05:54] <halfak>	 https://ores.wikimedia.org/v2/scores/enwiki/wp10/1235245?features
[22:07:58] <harej>	 Ah
[22:08:04] <harej>	 I don't usually use the ?features flag or knew it existed.
[22:15:50] <halfak>	 harej, Oh I think I see what you mean too... you are asking for how the features are weighted in the prediction.
[22:15:58] <harej>	 Not necessarily
[22:16:05] <harej>	 What you linked to is precisely what I wanted
[22:16:07] <harej>	 I appreciate it!
[22:16:53] <halfak>	 Sweet :) 
[22:17:14] <halfak>	 We can get at the weight of each feature though.  Let me know if you want something like that.  We could add it to the model_info output.