[07:29:47] 06Revision-Scoring-As-A-Service, 10Wikilabels: Build a web app to show progress of wikilabels campaigns - https://phabricator.wikimedia.org/T139874#2445161 (10Ladsgroup) [07:31:59] 06Revision-Scoring-As-A-Service, 10Wikilabels: Autlabel Azeri damaging campaign - https://phabricator.wikimedia.org/T139875#2445175 (10Ladsgroup) [07:32:16] 06Revision-Scoring-As-A-Service, 10Wikilabels: Autolabel Azeri damaging campaign - https://phabricator.wikimedia.org/T139875#2445189 (10Ladsgroup) [10:27:55] 10Revision-Scoring-As-A-Service-Backlog, 10MediaWiki-extensions-ORES: Build an entry point to store scores in ORES extension - https://phabricator.wikimedia.org/T131785#2445328 (10Ladsgroup) >>! In T131785#2440499, @Legoktm wrote: > What is the use case for this? For example when someone wants to check user c... [13:43:25] 06Revision-Scoring-As-A-Service, 06Community-Tech, 10CopyVio-tools: CopyPatrol should show ORES scores - https://phabricator.wikimedia.org/T139009#2445543 (10Ladsgroup) [13:54:16] 06Revision-Scoring-As-A-Service, 06Community-Tech, 10CopyVio-tools: CopyPatrol should show ORES scores - https://phabricator.wikimedia.org/T139009#2445560 (10Ladsgroup) https://github.com/Niharika29/PlagiabotWeb/pull/18 [16:01:23] o/ sabya_ [16:01:30] o/ halfak [16:02:16] So, I've been thinking about next steps. [16:02:33] ok. [16:02:52] I think that solving the fitness problem should be our top priority. [16:03:16] i.e. it seems that the hash vector features are simply too numerous for the 77 original features to get selected. [16:03:39] So, Justin suggested we try building a sub-model [16:03:52] yes. stacking? [16:03:54] ... one that uses just the hash features and then makes a prediction. [16:03:56] Yeah. That [16:04:04] What do you think of going in that direction? [16:04:15] I think that makes sense [16:05:04] I agree that it's most likely to be effective. Do you already have a sense for how you'd build that into a test? [16:05:15] is there a way to predict if a adding one feature will improve the efficiency over those 77 features. based on that that single feature's performance? [16:05:42] Hmm... I'm not sure we can do that with the kind of confidence that we'd want. [16:06:05] We'd likely end up overfitting if we use a non-stochastic strategy. [16:06:28] got it [16:06:55] But then again, I think that we could -- in theory -- (1) build a gradient boosting model, (2) read the selected features used and (3) only include those selected features in a secondary model that contains the 77 base features. [16:07:16] This would be an alternative to stacking that would allow us to do feature selection independently. [16:07:36] It won't find "the most useful features" but it's likely to find "many of the more useful features" [16:08:36] I bet there's a set of methods we could read about in http://scikit-learn.org/stable/modules/feature_selection.html [16:08:57] regarding #1, you mean, we should build gbc only with hv features? [16:10:05] Yes. ANd look at this! http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel [16:10:17] It looks like sklearn already worked out "someone's going to want to do this" [16:12:38] Looks like we train a GB model, then pass it to SelectFromModel [16:13:01] And then call get_support() to get a mask that tells which hashes from the vector to include. [16:13:35] * halfak is really happy to see that is hare-brained idea isn't totally unreasonable. [16:14:10] :) [16:15:59] how to use the array returned by get_support? [16:16:18] Looks like you'd want to somehow limit a sparse vector with it. [16:16:53] Or sparse matrix? [16:17:03] * halfak tries to remember the numpy types that are relevant here [16:18:31] http://stackoverflow.com/questions/20080332/slicing-a-scipy-sparse-matrix-using-a-boolean-mask [16:18:50] We might want the indexes instead. [16:18:58] A boolean mask could get quite large. [16:19:11] 2.1 million-ish columns [16:19:20] We'll probably end up selecting 1k or less [16:20:23] 1k ints = 1.5% of the in-memory space of 2.1m bools [16:20:30] I think... [16:20:47] Oh we set the "threshold will be interesting. [16:20:54] *threshold* param that is. [16:21:02] OK. so I think i have a plan. [16:22:11] * halfak goes to https://etherpad.wikimedia.org/p/sparse_features [16:27:58] sabya_, I just finished the gist in https://etherpad.wikimedia.org/p/sparse_features [16:28:00] What do you think? [16:28:36] I think I most understand it. Except for few steps. [16:28:44] *mostly [16:29:34] I think a little bit of exploration will be in order for mixing the matrices/vectors together. [16:29:35] the plan is to filter high impact features from the very large hashed vectors. use them along with original 77 features [16:29:42] +1 [16:29:58] Setting the threshold for "high impact" will be hard, so the histogram will be important. [16:30:11] Have you used matplotlib or any of the other python plotting libraries? [16:30:32] nope. [16:31:06] OK. When we get to that point, if you can dump a dataset, I can work on it with you. [16:31:18] Or I can help get a code snippet together that'll do the plotting for you. [16:31:42] sure. [16:31:57] Looks like this would work for us: https://plot.ly/matplotlib/histograms/ [16:32:09] * halfak usually doesn't plot in python [16:32:15] I do the stats and plotting in R [16:32:20] and the fun programming in python :) [16:32:54] :) [16:32:54] I'll try to make one in my ipython notebook [16:33:00] And we'll see how that goes. [16:33:20] any ideas/links for how to select the threshold? [16:33:26] from the histogram? [16:33:56] My thought right now is that I want to see if there obvious a cluster of high-low fitness [16:34:09] So that would mean there would be a *step* between high and low fitness hashes [16:34:22] Then we just set the threshold so that the step crosses it. [16:34:34] but if there's no obvious location then we can consider some tradeoffs [16:34:50] E.g. Let's just select the top N where N = 100, 500, 1000 and see how that goes. [16:35:13] got it. [16:35:48] so the histogram is supposed to give a scientific value for N, rather than us guessing it? [16:36:00] Yeah. Maybe. If we're lucky. [16:36:01] :D [16:36:21] We might have to empirically discover N. [16:36:29] That'll be fun, but more complicated. [16:36:42] Still would be interesting. [16:37:54] as for me, I am mostly getting it "intuitively" need to "understand" the science behind it. [16:39:01] Maybe we can science this anyway. It would be good experience for you and probably very interesting for others. [16:39:16] We can make some decisions based on how the histogram ends up looking [16:39:36] E.g. if there's a step, but it's at 10k features, do we really need all of those hashes included? [16:39:40] Science can tell us! [16:39:42] :) [16:39:59] :) [16:41:20] I think we have clear plan ahead. Looks like this could work! [16:41:48] \o/ [16:42:29] * halfak compiles matplotlib [16:42:49] also, our first assumption is (hv of revision 2 - hv revision 1) is supposed to give hv only for the words added, modified, or deleted. right? [16:44:41] Yeah. Or at least the hashes that do not change will just have a zero value. [16:46:43] Does ORES provide a way to evaluate the readability of a given chunk of text? For example through an API :) [16:47:52] psychoslave, we don't have that yet, but it would be possible to set up a readability scorer in the system [16:48:20] It wouldn't take much to implement https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests [16:48:43] The hardest part would be implementing a sentence parser. [16:48:54] And I guess we already have one of those for western languages in `deltas` [16:49:09] Oh! Even better! I bet there's a python library for this. [16:49:22] psychoslave, do you want to apply this to a wiki page or something more specific? [16:49:50] https://pypi.python.org/pypi/readability [16:50:05] We could, for example, build a scorer that will score a whole article as well as individual sections. [16:51:48] psychoslave, any interest in hacking on this? [16:51:48] halfak: just curious. what score could be claimed as an improvement on predictability of the current production models? [16:52:11] sabya_, I'd like to see the ROC-AUC and PR-AUC [16:52:29] Once we're ready, we can actually do a statistical significance test based on 10-fold cross-validation. [16:53:38] ok [16:58:46] OMG matplotlib is actually installing! [17:03:15] \o/ [17:07:59] Oops My experimental instance is gone. I'll need red-download the revisions and prepare the data. [17:08:21] *re-download [17:08:58] Darn. I'm sorry about that. We did some cleanup recently and weren't sure about a few instances. [17:09:04] I hope that it won't be too much of a pain. [17:12:38] Nope, but it would take several hours. I'll try to get the gbc by Wed. [17:13:09] Gotcha. I've almost got the histogram code finished. [17:13:15] I'm working on some display issues. [17:16:24] sabya_, https://github.com/wiki-ai/revscoring/blob/master/ipython/hashing_vectorizer.ipynb [17:16:27] See the bottom [17:16:41] This is the original example I put together that makes a toy classification prediction [17:16:50] On the very bottom you can see the feature weight frequency. [17:17:23] It looks like there is obviously a step at 0.001 [17:18:10] And that cuts out about 95% of the hashes [17:18:13] :) [17:18:55] Essentially it's ~ 100 features that end up being useful here :) [17:19:36] got it [17:21:47] 06Revision-Scoring-As-A-Service, 10revscoring, 07Spike: [Spike] Investigate HashingVectorizer - https://phabricator.wikimedia.org/T128087#2445766 (10Halfak) I just worked with @Sabya to put together these notes about next steps for the work: https://etherpad.wikimedia.org/p/sparse_features I also updated t... [17:21:58] I just added my notes to the card. [17:22:11] So you can find links & the thoughts I posted here later :D [17:22:35] thanks! [17:23:28] OK and with that I'm going to head out and run some errands. [17:23:38] Thanks for hacking today, sabya_ :) [17:24:09] Oh! And psychoslave, if you want to talk more about readability scoring, ping me again later. I think it could be a good idea :) [17:24:19] sure :) i'll start on this from tomorrow morning. time for bed! [17:24:25] o/ [17:24:33] It'll be a lot easier than training probabilistic prediction models. [17:24:40] Good night sabya_! o/ [19:02:00] halfak: hep, sorry I was away, thank you for your answers and hints [19:03:50] for the context : https://github.com/psychoslave/catscore http://tools.wmflabs.org/catscore/?cat=Women%20philosophers