[02:21:38] <sabya>	 o/ halfak 
[14:50:50] <halfak>	 o/
[16:49:50] <sabya>	 o/ halfak 
[16:55:04] <halfak>	 o/ sabya 
[16:55:15] <halfak>	 I see that you've got a precached running :) 
[16:55:24] <sabya>	 yes.
[16:55:34] <halfak>	 We had a confusing experience figuring out where all that traffic was coming from yesterday :) 
[16:55:46] <sabya>	 oops. 
[16:55:49] <halfak>	 *on saturday
[16:55:50] <halfak>	 No worries!
[16:55:52] <halfak>	 Was fine. 
[16:56:07] <halfak>	 Was a good training run in raising the cluster's capacity and monitoring the queues. 
[16:56:26] <sabya>	 i actually thought of that, then missed informing you folks
[16:57:53] <sabya>	 thinking what should I pick next. want to start learning ML. anything appropriate?
[16:59:02] <halfak>	 We're just starting up some new work on article quality prediction.  This is applying a method that works on English and French Wikipedia to Russian and Portuguese. 
[16:59:30] <halfak>	 On the more algorithmy side of things, I'd like someone to look into hashing vectorizers soon. 
[16:59:46] <halfak>	 It's a bag-of-words strategy that could prove useful. 
[17:01:51] <halfak>	 sabya, ^ either of those sound fun?
[17:03:06] <sabya>	 Both. Could I get more details on both? What is the problem that we would be solving with hashing vectorizers? Where exactly in the codebase it will fit?
[17:04:19] <halfak>	 Process for haching vectorizer is:
[17:04:34] <halfak>	 1. Take text and convert to words, bigrams, trigras, skipgrams, etc. 
[17:04:51] <halfak>	 2. Take all of these text chunks and hash them -- let's say with a sha1() 
[17:05:27] <halfak>	 3. Truncate the hash down to N bytes 
[17:05:59] <halfak>	 4. Make a mapping between the trimmed_hash and the count of occurrences of the word/phrase/whatever. 
[17:06:49] <halfak>	 5. Use this giant mapping of trimmed_hashes as features in a prediction model -- perferably one that deals well with a very large number of features. 
[17:06:59] <halfak>	 I've been researching XGBoost for 5
[17:07:20] <halfak>	 https://phabricator.wikimedia.org/T128086
[17:07:33] <halfak>	 It works pretty well and fits within our modeling scheme in revscoring
[17:08:07] <halfak>	 For 2,3,4, there's an sklearn utility.  See my notes here: https://phabricator.wikimedia.org/T128087
[17:08:54] <sabya>	 ok.
[17:09:30] <halfak>	 Sabya, I'm not sure how we'll manage the feature explosion yet, but it seems like it would be worth running a few tests with this. 
[17:09:57] <halfak>	 You might be able to show that we can make better edit quality predictions by comparing hash vectors before and after an edit. 
[17:10:15] <halfak>	 If that's the case, we can publish a paper and then figure out how to implement it in production :) 
[17:10:46] <sabya>	 ok
[17:12:50] <halfak>	 If you want to go this direction, I think the first bit of work is to just play with the HashingVectorizer and XGBoost to get a sense for how big the feature explosion is going to be and how we might start training a model. 
[17:13:38] <sabya>	 Is then hashing vectorizer at the level as these: damaging, goodfaith,  reverted,  wp10, etc? 
[17:14:17] <sabya>	 in terms of where it fits in the project.
[17:14:30] <halfak>	 Darn.  Github is down.  I wanted to show you a notebook that will give a good overview. 
[17:15:19] <halfak>	 sabya, a mirror!  http://paws-public.wmflabs.org/paws-public/EpochFail/projects/examples/editquality.ipynb
[17:15:22] <halfak>	 Sort of :) 
[17:15:35] <halfak>	 So, this will give you a basic overview of how we build the models.  
[17:15:45] <halfak>	 hashing vectorizer would fit in around feature engineering. 
[17:16:21] <halfak>	 (Please ignore errors in notebook.  If you scroll past them, everything works OK)
[17:17:22] <halfak>	 Right now, we store data about each individual feature within a ScorerModel -- that's probably not going to work with a huge hash map. 
[17:17:51] <halfak>	 But let's not worry about that for now and just try to see if we can extract such a mapping from text. 
[17:19:37] <halfak>	 sabya, this version of that notebook is cleaner: https://github.com/wiki-ai/editquality/blob/master/ipython/reverted_detection_demo.ipynb
[17:19:42] <halfak>	 Not that github is back online :) 
[17:19:45] <halfak>	 *now 
[17:20:09] * sabya checks the links
[17:21:21] <halfak>	 brb
[17:42:37] <halfak>	 back
[17:44:51] <halfak>	 sabya, I'd like to head out in a minute.  Do you have enough stuff to look at for now?
[17:45:44] <sabya>	 yes. halfak. but if you have other relevant links, pls share.
[17:46:16] <halfak>	 regretfully, I'm just in the beginning of my exploration of this method.  Will be interested in seeing your notes :) 
[17:46:18] <halfak>	 o/
[17:46:44] <sabya>	 o/