[02:21:38] o/ halfak [14:50:50] o/ [16:49:50] o/ halfak [16:55:04] o/ sabya [16:55:15] I see that you've got a precached running :) [16:55:24] yes. [16:55:34] We had a confusing experience figuring out where all that traffic was coming from yesterday :) [16:55:46] oops. [16:55:49] *on saturday [16:55:50] No worries! [16:55:52] Was fine. [16:56:07] Was a good training run in raising the cluster's capacity and monitoring the queues. [16:56:26] i actually thought of that, then missed informing you folks [16:57:53] thinking what should I pick next. want to start learning ML. anything appropriate? [16:59:02] We're just starting up some new work on article quality prediction. This is applying a method that works on English and French Wikipedia to Russian and Portuguese. [16:59:30] On the more algorithmy side of things, I'd like someone to look into hashing vectorizers soon. [16:59:46] It's a bag-of-words strategy that could prove useful. [17:01:51] sabya, ^ either of those sound fun? [17:03:06] Both. Could I get more details on both? What is the problem that we would be solving with hashing vectorizers? Where exactly in the codebase it will fit? [17:04:19] Process for haching vectorizer is: [17:04:34] 1. Take text and convert to words, bigrams, trigras, skipgrams, etc. [17:04:51] 2. Take all of these text chunks and hash them -- let's say with a sha1() [17:05:27] 3. Truncate the hash down to N bytes [17:05:59] 4. Make a mapping between the trimmed_hash and the count of occurrences of the word/phrase/whatever. [17:06:49] 5. Use this giant mapping of trimmed_hashes as features in a prediction model -- perferably one that deals well with a very large number of features. [17:06:59] I've been researching XGBoost for 5 [17:07:20] https://phabricator.wikimedia.org/T128086 [17:07:33] It works pretty well and fits within our modeling scheme in revscoring [17:08:07] For 2,3,4, there's an sklearn utility. See my notes here: https://phabricator.wikimedia.org/T128087 [17:08:54] ok. [17:09:30] Sabya, I'm not sure how we'll manage the feature explosion yet, but it seems like it would be worth running a few tests with this. [17:09:57] You might be able to show that we can make better edit quality predictions by comparing hash vectors before and after an edit. [17:10:15] If that's the case, we can publish a paper and then figure out how to implement it in production :) [17:10:46] ok [17:12:50] If you want to go this direction, I think the first bit of work is to just play with the HashingVectorizer and XGBoost to get a sense for how big the feature explosion is going to be and how we might start training a model. [17:13:38] Is then hashing vectorizer at the level as these: damaging, goodfaith, reverted, wp10, etc? [17:14:17] in terms of where it fits in the project. [17:14:30] Darn. Github is down. I wanted to show you a notebook that will give a good overview. [17:15:19] sabya, a mirror! http://paws-public.wmflabs.org/paws-public/EpochFail/projects/examples/editquality.ipynb [17:15:22] Sort of :) [17:15:35] So, this will give you a basic overview of how we build the models. [17:15:45] hashing vectorizer would fit in around feature engineering. [17:16:21] (Please ignore errors in notebook. If you scroll past them, everything works OK) [17:17:22] Right now, we store data about each individual feature within a ScorerModel -- that's probably not going to work with a huge hash map. [17:17:51] But let's not worry about that for now and just try to see if we can extract such a mapping from text. [17:19:37] sabya, this version of that notebook is cleaner: https://github.com/wiki-ai/editquality/blob/master/ipython/reverted_detection_demo.ipynb [17:19:42] Not that github is back online :) [17:19:45] *now [17:20:09] * sabya checks the links [17:21:21] brb [17:42:37] back [17:44:51] sabya, I'd like to head out in a minute. Do you have enough stuff to look at for now? [17:45:44] yes. halfak. but if you have other relevant links, pls share. [17:46:16] regretfully, I'm just in the beginning of my exploration of this method. Will be interested in seeing your notes :) [17:46:18] o/ [17:46:44] o/