[14:59:13] o/ isaacj [14:59:25] Re. your question from yesterday, that number was based on regexes. [14:59:39] I found some good strategies for filtering out templatematter. :) [14:59:46] It just has to work most of the time. [15:00:19] It turns out that parsing Alan Turing takes on average 0.25 seconds with mwpfh. [15:00:35] So, a little more than an order of magnitude longer. [15:01:27] Fun story, my tokenizer that runs in ORES takes 0.18 seconds to split Alan Turing into named tokens. That's a lot so I'm suddenly re-interested in making that run faster. [15:06:29] Oh excellent re regexes - agree that a mostly perfect solution is probably what we should be aiming for unfortunately at least right now. [15:07:16] What is a named token vs. just splitting on white space / punctuation? [15:11:10] The white space is a type of token too. [15:11:38] When parsing, it's useful to know whether I have a number, a word, a piece of markup (e.g. "{{" or OPEN_TEMPLATE) [15:12:08] We'll use the "names" to generate features later. Or to segment the page into sentences/paragraphs. [15:12:44] But, I bet we have some regexes in there that have bad performance. Naming tokens isn't especially slow. [15:30:00] Oh ok, that makes sense. Hmm...maybe show me the code today because I'd be curious to at least see. I took a quick look at Keras too and it wasn't doing nearly as well as fastText out of the box so while it may be worth hacking on a bit more, it's not an immediate modeling solution