[07:29:47] <wikibugs>	 06Revision-Scoring-As-A-Service, 10Wikilabels: Build a web app to show progress of wikilabels campaigns - https://phabricator.wikimedia.org/T139874#2445161 (10Ladsgroup)
[07:31:59] <wikibugs>	 06Revision-Scoring-As-A-Service, 10Wikilabels: Autlabel Azeri damaging campaign - https://phabricator.wikimedia.org/T139875#2445175 (10Ladsgroup)
[07:32:16] <wikibugs>	 06Revision-Scoring-As-A-Service, 10Wikilabels: Autolabel Azeri damaging campaign - https://phabricator.wikimedia.org/T139875#2445189 (10Ladsgroup)
[10:27:55] <wikibugs>	 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES: Build an entry point to store scores in ORES extension - https://phabricator.wikimedia.org/T131785#2445328 (10Ladsgroup) >>! In T131785#2440499, @Legoktm wrote: > What is the use case for this?  For example when someone wants to check user c...
[13:43:25] <wikibugs>	 06Revision-Scoring-As-A-Service, 06Community-Tech, 10CopyVio-tools: CopyPatrol should show ORES scores - https://phabricator.wikimedia.org/T139009#2445543 (10Ladsgroup)
[13:54:16] <wikibugs>	 06Revision-Scoring-As-A-Service, 06Community-Tech, 10CopyVio-tools: CopyPatrol should show ORES scores - https://phabricator.wikimedia.org/T139009#2445560 (10Ladsgroup) https://github.com/Niharika29/PlagiabotWeb/pull/18
[16:01:23] <halfak>	 o/ sabya_ 
[16:01:30] <sabya_>	 o/ halfak 
[16:02:16] <halfak>	 So, I've been thinking about next steps.
[16:02:33] <sabya_>	 ok.
[16:02:52] <halfak>	 I think that solving the fitness problem should be our top priority. 
[16:03:16] <halfak>	 i.e. it seems that the hash vector features are simply too numerous for the 77 original features to get selected. 
[16:03:39] <halfak>	 So, Justin suggested we try building a sub-model
[16:03:52] <sabya_>	 yes. stacking?
[16:03:54] <halfak>	 ... one that uses just the hash features and then makes a prediction. 
[16:03:56] <halfak>	 Yeah.  That
[16:04:04] <halfak>	 What do you think of going in that direction?
[16:04:15] <sabya_>	 I think that makes sense
[16:05:04] <halfak>	 I agree that it's most likely to be effective.  Do you already have a sense for how you'd build that into a test?
[16:05:15] <sabya_>	 is there a way to predict if a adding one feature will improve the efficiency over those 77 features. based on that that single feature's performance?
[16:05:42] <halfak>	 Hmm... I'm not sure we can do that with the kind of confidence that we'd want. 
[16:06:05] <halfak>	 We'd likely end up overfitting if we use a non-stochastic strategy. 
[16:06:28] <sabya_>	 got it
[16:06:55] <halfak>	 But then again, I think that we could -- in theory -- (1) build a gradient boosting model, (2) read the selected features used and (3) only include those selected features in a secondary model that contains the 77 base features. 
[16:07:16] <halfak>	 This would be an alternative to stacking that would allow us to do feature selection independently. 
[16:07:36] <halfak>	 It won't find "the most useful features" but it's likely to find "many of the more useful features"
[16:08:36] <halfak>	 I bet there's a set of methods we could read about in http://scikit-learn.org/stable/modules/feature_selection.html
[16:08:57] <sabya_>	 regarding #1, you mean, we should build gbc only with hv features?
[16:10:05] <halfak>	 Yes.  ANd look at this! http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel
[16:10:17] <halfak>	 It looks like sklearn already worked out "someone's going to want to do this"
[16:12:38] <halfak>	 Looks like we train a GB model, then pass it to SelectFromModel
[16:13:01] <halfak>	 And then call get_support() to get a mask that tells which hashes from the vector to include. 
[16:13:35] * halfak is really happy to see that is hare-brained idea isn't totally unreasonable. 
[16:14:10] <sabya_>	 :)
[16:15:59] <sabya_>	 how to use the array returned by get_support? 
[16:16:18] <halfak>	 Looks like you'd want to somehow limit a sparse vector with it. 
[16:16:53] <halfak>	 Or sparse matrix? 
[16:17:03] * halfak tries to remember the numpy types that are relevant here
[16:18:31] <halfak>	 http://stackoverflow.com/questions/20080332/slicing-a-scipy-sparse-matrix-using-a-boolean-mask
[16:18:50] <halfak>	 We might want the indexes instead. 
[16:18:58] <halfak>	 A boolean mask could get quite large. 
[16:19:11] <halfak>	 2.1 million-ish columns 
[16:19:20] <halfak>	 We'll probably end up selecting 1k or less
[16:20:23] <halfak>	 1k ints = 1.5% of the in-memory space of 2.1m bools
[16:20:30] <halfak>	 I think...
[16:20:47] <halfak>	 Oh we set the "threshold will be interesting. 
[16:20:54] <halfak>	 *threshold* param that is. 
[16:21:02] <halfak>	 OK.  so I think i have a plan. 
[16:22:11] * halfak goes to https://etherpad.wikimedia.org/p/sparse_features
[16:27:58] <halfak>	 sabya_, I just finished the gist in https://etherpad.wikimedia.org/p/sparse_features
[16:28:00] <halfak>	 What do you think?
[16:28:36] <sabya_>	 I think I most understand it. Except for few steps.
[16:28:44] <sabya_>	 *mostly
[16:29:34] <halfak>	 I think a little bit of exploration will be in order for mixing the matrices/vectors together. 
[16:29:35] <sabya_>	 the plan is to filter high impact features from the very large hashed vectors. use them along with original 77 features
[16:29:42] <halfak>	 +1
[16:29:58] <halfak>	 Setting the threshold for "high impact" will be hard, so the histogram will be important. 
[16:30:11] <halfak>	 Have you used matplotlib or any of the other python plotting libraries?
[16:30:32] <sabya_>	 nope.
[16:31:06] <halfak>	 OK.  When we get to that point, if you can dump a dataset, I can work on it with you. 
[16:31:18] <halfak>	 Or I can help get a code snippet together that'll do the plotting for you. 
[16:31:42] <sabya_>	 sure. 
[16:31:57] <halfak>	 Looks like this would work for us: https://plot.ly/matplotlib/histograms/
[16:32:09] * halfak usually doesn't plot in python 
[16:32:15] <halfak>	 I do the stats and plotting in R
[16:32:20] <halfak>	 and the fun programming in python :) 
[16:32:54] <sabya_>	 :)
[16:32:54] <halfak>	 I'll try to make one in my ipython notebook
[16:33:00] <halfak>	 And we'll see how that goes.  
[16:33:20] <sabya_>	 any ideas/links for how to select the threshold?
[16:33:26] <sabya_>	 from the histogram?
[16:33:56] <halfak>	 My thought right now is that I want to see if there obvious a cluster of high-low fitness
[16:34:09] <halfak>	 So that would mean there would be a *step* between high and low fitness hashes
[16:34:22] <halfak>	 Then we just set the threshold so that the step crosses it. 
[16:34:34] <halfak>	 but if there's no obvious location then we can consider some tradeoffs 
[16:34:50] <halfak>	 E.g. Let's just select the top N where N = 100, 500, 1000 and see how that goes. 
[16:35:13] <sabya_>	 got it.
[16:35:48] <sabya_>	 so the histogram is supposed to give a scientific value for N, rather than us guessing it?
[16:36:00] <halfak>	 Yeah.  Maybe.  If we're lucky. 
[16:36:01] <halfak>	 :D
[16:36:21] <halfak>	 We might have to empirically discover N. 
[16:36:29] <halfak>	 That'll be fun, but more complicated. 
[16:36:42] <halfak>	 Still would be interesting. 
[16:37:54] <sabya_>	 as for me, I am mostly getting it "intuitively" need to "understand" the science behind it.
[16:39:01] <halfak>	 Maybe we can science this anyway.  It would be good experience for you and probably very interesting for others. 
[16:39:16] <halfak>	 We can make some decisions based on how the histogram ends up looking
[16:39:36] <halfak>	 E.g. if there's a step, but it's at 10k features, do we really need all of those hashes included?
[16:39:40] <halfak>	 Science can tell us!
[16:39:42] <halfak>	 :) 
[16:39:59] <sabya_>	 :)
[16:41:20] <sabya_>	 I think we have clear plan ahead. Looks like this could work!
[16:41:48] <halfak>	 \o/
[16:42:29] * halfak compiles matplotlib
[16:42:49] <sabya_>	 also, our first assumption is (hv of revision 2 - hv revision 1) is supposed to give hv only for the words added, modified, or deleted. right?
[16:44:41] <halfak>	 Yeah.  Or at least the hashes that do not change will just have a zero value. 
[16:46:43] <psychoslave>	 Does ORES provide a way to evaluate the readability of a given chunk of text? For example through an API :)
[16:47:52] <halfak>	 psychoslave, we don't have that yet, but it would be possible to set up a readability scorer in the system
[16:48:20] <halfak>	 It wouldn't take much to implement https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests
[16:48:43] <halfak>	 The hardest part would be implementing a sentence parser.
[16:48:54] <halfak>	 And I guess we already have one of those for western languages in `deltas`
[16:49:09] <halfak>	 Oh!  Even better!  I bet there's a python library for this. 
[16:49:22] <halfak>	 psychoslave, do you want to apply this to a wiki page or something more specific?
[16:49:50] <halfak>	 https://pypi.python.org/pypi/readability
[16:50:05] <halfak>	 We could, for example, build a scorer that will score a whole article as well as individual sections. 
[16:51:48] <halfak>	 psychoslave, any interest in hacking on this?
[16:51:48] <sabya_>	 halfak: just curious. what score could be claimed as an improvement on predictability of the current production models?
[16:52:11] <halfak>	 sabya_, I'd like to see the ROC-AUC and PR-AUC 
[16:52:29] <halfak>	 Once we're ready, we can actually do a statistical significance test based on 10-fold cross-validation. 
[16:53:38] <sabya_>	 ok
[16:58:46] <halfak>	 OMG matplotlib is actually installing!
[17:03:15] <sabya_>	 \o/
[17:07:59] <sabya_>	 Oops My experimental instance is gone. I'll need red-download the revisions and prepare the data.
[17:08:21] <sabya_>	 *re-download
[17:08:58] <halfak>	 Darn.  I'm sorry about that.  We did some cleanup recently and weren't sure about a few instances. 
[17:09:04] <halfak>	 I hope that it won't be too much of a pain. 
[17:12:38] <sabya_>	 Nope, but it would take several hours. I'll try to get the gbc by Wed.
[17:13:09] <halfak>	 Gotcha.  I've almost got the histogram code finished. 
[17:13:15] <halfak>	 I'm working on some display issues. 
[17:16:24] <halfak>	 sabya_, https://github.com/wiki-ai/revscoring/blob/master/ipython/hashing_vectorizer.ipynb
[17:16:27] <halfak>	 See the bottom
[17:16:41] <halfak>	 This is the original example I put together that makes a toy classification prediction
[17:16:50] <halfak>	 On the very bottom you can see the feature weight frequency. 
[17:17:23] <halfak>	 It looks like there is obviously a step at 0.001
[17:18:10] <halfak>	 And that cuts out about 95% of the hashes
[17:18:13] <halfak>	 :) 
[17:18:55] <halfak>	 Essentially it's ~ 100 features that end up being useful here :) 
[17:19:36] <sabya_>	 got it
[17:21:47] <wikibugs>	 06Revision-Scoring-As-A-Service, 10revscoring, 07Spike: [Spike] Investigate HashingVectorizer - https://phabricator.wikimedia.org/T128087#2445766 (10Halfak) I just worked with @Sabya to put together these notes about next steps for the work: https://etherpad.wikimedia.org/p/sparse_features   I also updated t...
[17:21:58] <halfak>	 I just added my notes to the card. 
[17:22:11] <halfak>	 So you can find links & the thoughts I posted here later :D
[17:22:35] <sabya_>	 thanks!
[17:23:28] <halfak>	 OK and with that I'm going to head out and run some errands.  
[17:23:38] <halfak>	 Thanks for hacking today, sabya_  :) 
[17:24:09] <halfak>	 Oh!  And psychoslave, if you want to talk more about readability scoring, ping me again later.  I think it could be a good idea :)  
[17:24:19] <sabya_>	 sure :) i'll start on this from tomorrow morning. time for bed! 
[17:24:25] <sabya_>	 o/
[17:24:33] <halfak>	 It'll be a lot easier than training probabilistic prediction models. 
[17:24:40] <halfak>	 Good night sabya_! o/
[19:02:00] <psychoslave>	 halfak: hep, sorry I was away, thank you for your answers and hints
[19:03:50] <psychoslave>	 for the context : https://github.com/psychoslave/catscore http://tools.wmflabs.org/catscore/?cat=Women%20philosophers