[21:59:55] hey isaacj. I was just looking over the text processing code your students worked on. I was wondering why they didn't use mwparserfromhell. Too slow? [22:00:24] When trying to parse some realworld articles, it's become clear how difficult handling templates is :| [22:04:10] Yeah, we were trying to speed things up. Especially because we were playing with only processing the lede paragraph so not parsing the whole document can save A LOT of time [22:05:34] Oh interesting. I wonder if that would be a better approach. Right now, I'm seriously considering writing an island grammar for templates. [22:05:39] And that's not a good place to be :| [22:06:02] Since templates can nest, regexes can't handle them alone. [22:08:24] Ooh. I just figured out a good strategy for dropping template stuff. [22:09:34] halfak: ^^ [22:09:41] Oops you had seen it [22:10:43] Are you trying to populate too based on the templates or just remove the syntax? [22:13:06] Just remove them. I really just want the text in the paragraphs. [22:17:17] I'm also looking at just replacing all numbers with "anumber". I don't see how "two zero one seven" makes any more sense. [22:17:28] Hmm I'll think on it then but yeah we might just have some noise. At least nested templates tend to be on longer articles I'd expect where missing some text is ok [22:18:08] Or maybe a cheap way to check for them and only use mwparserfromhell if they exist [22:30:12] Also agree with the #s so long as we're working with unigrams. But even if we fully captured a year or something, seems unlikely to tell us anything about topic. Though it also would not change the embeddings at all if someone was shifting the numbers so that's something to consider [22:39:09] good point re shifting numbers. [22:39:50] I did a bit of tuning. I can process the entire Alan Turing article in less than 0.02 seconds on average. That's pretty fast. [22:40:01] The darn thing would take 0.3 seconds to tokenize! [23:10:36] oooh that first number is good -- is that mwparserfromhell or regexes?