[16:14:35] o/ aetilley [16:16:35] My notes and work re PCFGs are here: https://phabricator.wikimedia.org/T144636 [16:19:13] So, my first question -- do we need to have CNF? [16:19:31] Or can I fudge that a bit with a non-CNF tree-building parser? [16:28:30] ^halfak test [16:28:34] o/ aetilley [16:28:38] hiya [16:28:45] Here's a paste of what you missed [16:28:48] My notes and work re PCFGs are here: https://phabricator.wikimedia.org/T144636 [16:28:49] So I'm looking at your notes now [16:28:54] So, my first question -- do we need to have CNF? [16:29:01] Or can I fudge that a bit with a non-CNF tree-building parser? [16:29:15] E.g. spaCy seems very fast, but it does not build CNF trees. [16:29:19] Instead, dependency pairs. [16:29:30] Only if you're using the CKY algorithm and Inside algorithm for parsing and scoring respectivly. [16:30:02] You might instead consider looking at general parsers/scorers [16:30:18] I'm learning quite a bit about NLTK but I'm there's a lot of it. [16:31:02] First of all, let me say that if you decide to go the CKY/Inside route, you should consider NLTK functionality: [16:31:03] https://www.cs.bgu.ac.il/~elhadad/nlp11/nltk-pcfg.html [16:31:48] But I think I should answer some of your concerns in your link [16:32:18] Is scoring more complicated than multiplying probability as you apply the PCFG to a real sentence? [16:34:29] Scoring is as easy as that once you have the parse tree. [16:34:52] What's the significance of the "inside" algorithm? [16:35:01] But scoring takes a raw sentence as input, and gives the sum of the scores of all possible parses. [16:36:05] And yes, there are namespace considerations that would need to be changed in the PCFG file format. [16:36:42] "namespace considerations"? [16:38:02] eg if you happended to have a terminal symbol with the same name as a non-terminal. [16:38:11] Ahh yeah. [16:38:15] I was thinking about this. [16:38:24] Is it possible to have a unary non-terminal? [16:38:41] Maybe it's possible, but it is unnecessary? [16:39:44] so 'unary' is a property of a rule, not a symbol [16:39:52] Undeed. [16:40:04] unary with a non-terminal production [16:40:07] *Indeed [16:40:10] unary just means the target list has length one [16:40:14] Yes [16:40:19] We're on the same page [16:40:24] So CFG format: [16:40:30] Is it not clear that I mean a unary rule to have a non-terminal production [16:40:32] ...says no [16:40:42] In a general pcfg, yes [16:40:44] So all unary rules produce a terminal [16:40:47] oh [16:41:06] sorry, I meant to say CNF [16:41:20] CNF says that all rules are one of two types: [16:41:35] 1) Binary from nonterminal to two nonterminals [16:41:44] 2) Unary from nonterminal to one terminal. [16:41:50] Cool. [16:41:58] Maybe I can not go inventing new formats then [16:42:05] If your grammar takes this form, you can use the CKY/Inside algorithms. [16:42:55] CKY is inside the parser though, right? [16:43:07] ... something we'd like to ignore and pretend is magic? [16:50:14] ^halfak back sorry [16:50:43] No worries [16:50:58] I'd asked: CKY is inside the parser though, right? [16:51:03] ... something we'd like to ignore and pretend is magic? [16:51:10] So both algorithms have the ability to score a specific parse tree. But the parser finds the parse tree with the largest score for a given sentence [16:51:13] (Preferable fast magic) [16:51:25] the inside algorithm computes the sum of all these scores. [16:51:36] This is for choosing a parse? [16:52:01] the former is yes [16:52:28] the scorer rather gives you a measure of how "characteristic" the sentence is for the grammar. [16:52:47] But not in a PCFG sense [16:53:06] because presumably, we'd have the same parser/grammar for two competing probability maps. [16:53:21] For instance I might have a sentence with multiple parses, but such that all of their parses have a relatively high vandalism-grammar score. [16:53:41] so the inside algorithm trained on the vandalism grammar would return a high score [16:54:03] I'm not sure if I understand your last question. [16:54:47] No, you would have two different pcfg objects, one for each grammar [16:54:54] each with it's own scoring method [16:54:58] its [16:55:57] In the research paper we looked at they added, things like the max/min/mean log difference of the two scores as features for the classifier. [16:56:17] Yeah. OK. I think I get this. [16:56:36] So... final question... Is spaCy's parser incompatible with this strategy? [16:56:41] max/min/mean over each sentence in the revision that is. [16:56:55] I don't know about spaCy. [16:57:25] Trust me, I'd much rather be doing research on this than what I'm doing for my 9-5. [16:57:37] But I only have so many hours in the week. [16:57:43] And startups are hell. [16:58:05] Yeah. No worries. I wish WIkipedia could grow a magical money tree and we could just hire you/be your patron [16:58:21] OK. So I can show you what I'm asking about. [16:58:23] One sec. [16:58:26] sure [16:58:43] https://gist.github.com/halfak/bfb4772ffde2c4d04d3c9a549d247abb [16:58:52] This is the best tree I can build with spaCy. [16:59:00] note the trinary rules [16:59:29] One even has a 5-ary rule [17:00:04] But I figure that on a big enough corpus, we could learn the probability of these complex rules [17:00:47] Yup. CNF is pretty artificial. [17:01:12] n-ary rules for n>2 are quite common. [17:01:21] OK. cool. [17:01:41] So, I might try to see if I can pursue this since spaCy is really quite a lot faster than the alternatives. [17:01:58] So you can use a parser/scorer that allows these, or convert these two CNF (artifically) and accept the resulting score [17:02:07] cool [17:02:21] Does it have a scorere? [17:02:24] scorer? [17:02:29] spaCy that is? [17:02:34] Good Q. I didn't explicitly look for that [17:02:42] Because that's what you want eventually [17:02:55] You want some numeric for your classifiers. [17:03:12] So, my plan is to (1) do simple probability math for all known productions and (2) have a base probability for unknown productions based on the # of examples we have seen in context [17:03:34] aetilley, was thinking I'd adapt your library to do the scoring as ^ [17:03:44] Interesting. [17:04:01] So, e.g. we see a "VBZ" but the terminal "runz" has no recorded probability [17:04:18] I'm sure that's possible. CNF makes it particularly easy, but I bet it could be done for non_cnf forms. [17:04:47] let `a` be the # of observations of VBZ that we've trained our probabilities on. p("runz"|VBZ) = 1/a [17:04:58] Right well that's another issue: [17:05:25] The library as it stands has very limited tokenizing. [17:05:47] Oh yeah. No worries there. Assume we have tokenizing and sentence segmenting solved. [17:06:11] Cool. [17:06:12] (Rather, we're going to use our own broken implementation :D ) [17:07:02] * halfak curses IRC [17:15:09] 06Revision-Scoring-As-A-Service, 10revscoring: Implement PCFG features - https://phabricator.wikimedia.org/T144636#2625714 (10Halfak) I've been thinking that our PCFG might not need to care about non `CNF` productions. This parse is useful and we could learn some probabilities with it: ``` >>> print_tree(tre... [17:24:57] Going AFK for a while