[14:59:13] <halfak>	 o/ isaacj 
[14:59:25] <halfak>	 Re. your question from yesterday, that number was based on regexes. 
[14:59:39] <halfak>	 I found some good strategies for filtering out templatematter.  :) 
[14:59:46] <halfak>	 It just has to work most of the time.
[15:00:19] <halfak>	 It turns out that parsing Alan Turing takes on average 0.25 seconds with mwpfh. 
[15:00:35] <halfak>	 So, a little more than an order of magnitude longer. 
[15:01:27] <halfak>	 Fun story, my tokenizer that runs in ORES takes 0.18 seconds to split Alan Turing into named tokens.  That's a lot so I'm suddenly re-interested in making that run faster. 
[15:06:29] <isaacj>	 Oh excellent re regexes - agree that a mostly perfect solution is probably what we should be aiming for unfortunately at least right now.
[15:07:16] <isaacj>	 What is a named token vs. just splitting on white space / punctuation?
[15:11:10] <halfak>	 The white space is a type of token too. 
[15:11:38] <halfak>	 When parsing, it's useful to know whether I have a number, a word, a piece of markup (e.g. "{{" or OPEN_TEMPLATE)
[15:12:08] <halfak>	 We'll use the "names" to generate features later.  Or to segment the page into sentences/paragraphs. 
[15:12:44] <halfak>	 But, I bet we have some regexes in there that have bad performance.  Naming tokens isn't especially slow. 
[15:30:00] <isaacj>	 Oh ok, that makes sense. Hmm...maybe show me the code today because I'd be curious to at least see. I took a quick look at Keras too and it wasn't doing nearly as well as fastText out of the box so while it may be worth hacking on a bit more, it's not an immediate modeling solution