[00:04:49] wiki-ai/revscoring#316 (travis_shuddup - c2527ff : halfak): The build was fixed. https://travis-ci.org/wiki-ai/revscoring/builds/90211763 [00:08:08] halfak it is something I need to figure out [00:08:25] wiki-ai/revscoring#318 (master - b63070b : Aaron Halfaker): The build passed. https://travis-ci.org/wiki-ai/revscoring/builds/90212602 [00:08:46] STFU WTF? BBQ travis. [00:18:36] \o/ [00:27:21] halfak: o/ [01:44:24] wiki-ai/wb-vandalism#77 (travis - 94da75a : amir): The build has errored. https://travis-ci.org/wiki-ai/wb-vandalism/builds/90224644 [01:50:44] halfak: hey around? [01:53:48] wiki-ai/wb-vandalism#80 (travis - d83b2a2 : amir): The build passed. https://travis-ci.org/wiki-ai/wb-vandalism/builds/90225446 [01:54:15] \o/ [01:58:08] wiki-ai/wb-vandalism#82 (master - df7a0d0 : Amir Sarabadani): The build passed. https://travis-ci.org/wiki-ai/wb-vandalism/builds/90225860 [03:11:28] http://www.wired.com/2015/11/google-open-sources-its-artificial-intelligence-engine/ [03:12:47] \o/ [03:12:52] Hey Amir1 :) [03:13:04] Hey [03:13:18] I fixed wb-vandalism travis too [03:13:20] :) [03:13:23] thanks to you [03:13:48] Woot! Yeah. I was hoping to get to both, but I had to run away. [03:14:15] Did you copy my changes to travis.yml in revscoring? [03:15:49] Anyway, I have to run away again. I'm glad it worked out. Have a good one! [03:15:50] o/ [03:16:08] :( [05:34:37] wait, halfak responded [05:34:48] halfak, is that thing happening actually a big deal? [13:58:18] HareJ, regarding your question about the google tensorflow, maybe. There are a lot of deep/recurrent NN libraries/frameworks. Googles might be better. [14:01:14] I wouldn't say that it is some sort of breakthrough and I don't think we'll be digging into it soon on the revscoring project. However, someone who is working with NN right now would probably want to scope this out and I'd like to read a summary of what they find. [17:23:59] Woo! We fixed *something* with idwiki and are now getting 93.4 AUC :D [17:45:50] o/ Krinkle [17:45:58] heya [17:46:00] What is AUC [17:46:09] We're getting 93.3 :D [17:46:17] Which is pretty darn good. [17:46:19] It's dropped already? [17:46:20] Better than enwiki [17:46:40] It's not online yet. I'm still building a big new set of models. [17:46:56] We'll be deploying for 13 languages + Wikidata in one shot [17:47:14] Oh! Sorry. I was talking about nlwiki right now [17:47:34] And when you said AUC, i thought you were asking for the stat -- not the meaning of it. [17:47:38] * halfak gets article [17:47:54] https://en.wikipedia.org/wiki/Receiver_operating_characteristic [17:47:58] Krinkle: No, lol, what do I know [17:48:28] I would guess, since you're mentioning it in this channel, that it has to do with how accurate ores is retroactively compared to real reverts? [17:48:41] AUC is the area under the curve for a the receiver operating characteristic curve -- a plot of false positives vs. false negative with varying sensitivity. [17:48:55] We use AUC as our primary measurement of "signal" that the classifier has. [17:49:29] It's a good way of asking "How good is this classifier at sorting edits by how likely to be damaging they are." [17:49:33] ^? [17:49:34] Hm.. do you measure only against new edits and their revert status? Otherwise it could be biased as it has already seen other edits in the creation of its model. [17:49:47] But what do you use as comparison [17:49:51] actual revert status? [17:49:52] We withhold a random test set [17:49:56] Revert status, yes. [17:50:13] We limit it, so the revert must happen in 48 hours, within 3 edits and by someone else. [17:50:22] This removes a lot of self-reverts. [17:50:26] Right [17:50:53] so the objective (within the context of that metric) is that ORES must predict when that happens, based on its training model, which is based on the sample set of edits from the past. [17:51:04] That's right [17:51:58] So what have you done for nlwiki so far? (asking from the angle of, I want to know everything you did, I'm not impatient) [17:53:08] We've built a set of features specific to Dutch. We've extracted a random sample of reverted/not-reverted edits from the last year. We've trained a model to predict which edits will be reverted and shown that it works pretty well. [17:53:18] We'll soon be deploying this model to ORES so that people can use it. [17:53:35] We'll be looking to have someone help us make an announcement to nlwiki. [17:54:04] Depending on YuviPanda's availability, we might not be able to get this into production until the weekend/next week. [17:54:19] But people will be able to test on our staging server by the end of the day. [17:55:47] halfak: I can schedule time on monday [17:55:49] err [17:55:51] thursday [17:55:56] That'd work for me. [17:56:13] Assuming all goes well, I'll have it running on staging by the end of the day. [17:56:29] BTW, YuviPanda we have new dependencies on myspell packages. [17:56:41] Is there a puppet thingie you can point me to? [17:56:42] halfak: ok [17:56:52] halfak: yeah was just going to do that [17:57:00] halfak: in 'operations/puppet' repo in gerrit [17:57:09] you can find it under 'modules/ores/manifests' [17:57:15] look around and you'll find a package list [17:57:20] just add it to that and submit a patch [17:57:22] OK. I'll try to make the change myself. I need to start doing this anyway. [17:57:24] halfak: OK. I'm still very new to all this, so forgive my questions :) [17:57:37] Not a problem Krinkle :) [17:57:50] Happy to have you here asking for stuff :D [17:57:54] halfak: So the badword list, where does that come into play. Is that used by ores, or used by the process that creates the model for ores, or is it used to populate wikilabels? [17:58:19] And the process you used to train the model, where can I learn more about that? [17:58:40] Krinkle, it's all about the "features" used to detect damage. [17:59:01] Krinkle, depends on what level you want to come in on. I have a make file that will show you the process step-by-step :) [17:59:27] ORES uses the badwords to extract "features" (really, statistics about the change made in an edit). [17:59:44] These features are used both to train the model and to apply it to new data. [18:00:18] wiki-ai/wikilabels, that is used for manual training only? [18:00:19] See https://github.com/wiki-ai/editquality/blob/master/Makefile [18:01:01] Krinkle, yes. that's right. With that, we're showing people a sample of (here, edits) and asking them to use their judgement so that we can train the model to replicate that judgement. [18:01:09] In wikilabels ^ [18:01:24] The feature extraction part (build a model and using it) comes afterwards. [18:01:26] Does that feed into ores (in)directly in some way? [18:01:38] Krinkle, right now, model building is somewhat manual. [18:01:45] Because I assume for nlwiki, wikilabels hasn't been used much there [18:01:58] So the model is mostly based on automated training. [18:02:02] But if you look in that makefile, you can see that we use wikilabels' API to gather training sets. [18:02:10] Which I gather is rather impressively accurate, so no complaint there :) [18:02:22] Krinkle, right. I have the data to load into wikilabels for nlwiki, but I haven't gotten to it yet. [18:02:34] Right now, I am the SPOF for too many aspects of revscoring. [18:03:06] Ah, I see. So the model building is always automated. the difference is where the revisions come from (from wikilabels, or from sampling just any reverted revisions as described earlier). [18:03:31] But once it has the set of revisions and what their (whether or not inferred) label is, then it goes auto from there [18:03:47] to build the model and make it usable by ores to judge new revisions [18:03:51] score [18:04:00] (I know you don't like 'judge') [18:04:22] Don't judge halfak [18:04:23] Krinkle, that's right. [18:04:25] :D [18:04:33] "Strong suggestions" ;) [18:04:44] Yeah [18:05:06] As far as ORES is concerned, MachineLearning trainers are black boxes. [18:05:16] Right [18:05:17] We evaluate their effectiveness, but not how they go about doing their work. [18:05:56] So the badword list confuses me somewhat. [18:06:33] If it uses wikilabels and revert status when training the model, how does badword list come in? and presuably, once in ores (which also uses the same badword list) I assume there it doesn't use this as a boolean trigger to assume damage? [18:07:01] Here. Let's look at how we use it for nlwiki [18:07:04] * halfak gets a link [18:07:18] https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/nlwiki.py [18:08:03] So, this file looks like it is doing a lot of simple math. Interpret that math in a way that is intuitive to you, but know that there's a dependency injection system that translates this into coherent goodness for extracting from the API. [18:08:54] Note that we look for english badwords in every wiki because they show up in every wiki. [18:09:20] Hm.. this is used to create the model with the sampel revisions, or this is used at runtime? [18:09:44] Both [18:09:58] Same feature list is extracted to train the model, test it and then apply it to new revisions [18:10:28] Ah, right. [18:10:34] So the model takes an array of values that correspond to this list of features, does and out comes a prediction. [18:10:51] So the array contains all the different factors the model will use to try to correlate similar intent essentially? [18:11:01] Yes [18:11:41] There are lots of strategies. All the way from naivebayes (match each feature to a normal distribution and figure out how it correlates with the outcome, stack probabilities) [18:12:08] To support-vector-machines that use distances in a multi-dimensional feature space to detect thresholds. [18:12:12] Can you give an example of how these factors play together? For example (looking at enwiki feature list) would it be able to correctly predict an edit is good if it uses a bad word in a talk namespace, if past edits show that that is commonly accepted, but not in the content namespace? [18:12:35] (since it includes page_is_content_namespace in the array) [18:12:54] Krinkle, we make predictions across all namespaces and rely on the "content_namespace" feature being False to do that. [18:13:18] A naive bayes classifier wouldn't be able to do that interaction, but an SVM should be able to do it. [18:17:04] Krinkle, You can think of the training process as something like this: [18:17:05] "Given this set of 20k edits [e.g. from wikilabels], and for each one [18:17:05] 1. the number of bytes added [18:17:05] 2. the number of badwords removed [18:17:05] ... [18:17:07] 15. the number of misspelings removed [18:17:09] and also a 'label' indicating which ones are damaging the article, [18:17:11] LEARN which features are present in a typical 'damaging' edit" [18:17:41] halfak: Right. Bayes would wrongly attribute damage intent to the namespace, words etc. directly. Not the combination of them. [18:18:15] Then the model learned (by some blackbox process) will be added to ores server [18:18:15] and be able to answer questions like this: [18:18:15] "Given this recent edit which [18:18:15] 1. added N bytes [18:18:15] 2. removed M badwords [18:18:16] ... [18:18:18] 15. removed X misspelings [18:18:22] do you think it is damaging the article?" [18:21:32] Krinkle, right now, our evaluation strategies are simple, so we don't really know if we're doing a good job in talk namespace. We do know that we're doing a good job overall. [18:21:46] We're working on new ways to evaluate the classifiers. [18:21:58] One way to do this is subsets. [18:22:02] E.g. edits to talk pages. [18:22:15] I also want to look critically at edits by anons and newcomers. [18:22:19] halfak: Sure, I wasn't asking whether we know it is doing well in a particular namespace. I understand. [18:22:39] I meant, is it able to understand that one of the many factors can be bad, but only bad if one of the other factors has a certain value. [18:23:02] I think so [18:23:07] e.g. added badword 'shit' is considered damaging if content_ns=True, but considered okay if content_ns=False [18:23:16] (if that is what the sample data told us) [18:23:22] Yeah. That's where a SVM should be able to do that -- assuming the correlation of those two variables is common enough and carries enough unambiguous signal in the training set. [18:23:28] Yeah [18:24:22] especially given that we rarely revert anything in talk namespaces (maybe revdelete or block, but not revert, and none of the bots should be reverting there, though typically things like ClueBot have their own restriction to make sure of that) [18:24:43] OK. I've got enough for the moment. This has been very useful. [18:25:39] halfak, did I miss something or we are not using the namespace as a feature for ptwiki (anymore)? [18:25:39] https://github.com/wiki-ai/ores-wikimedia-config/blob/master/feature_lists/ptwiki.py [18:26:19] I only see "page.is_content_namespace" for enwiki [18:26:59] Cool. Krinkle, would be interested in helping us with the nlwiki announcement next week? [18:27:01] halfak: One last thing, back to enwiki for a minute, where things are a bit further along. Is there any talk with ClueBot (and perhaps other bots, if any) making use of revscoring in favour of its own model? It wouldn't be very valuable in the short term as people consider ClueBot quite effective, but the motivation is to make sure ClueBot (or its [18:27:02] replacement) is in a state where we can also enable it on other wikis. Several wikis are quite eager to finally get a ClueBot for non-english non-wikipedia, since traditionally ClueBot has proven difficult to set up for other wikis, either due to design or resources. [18:27:31] Helder, see enwiki.damaging + [...] in the beginning. [18:27:55] hah [18:27:58] that is it :) [18:28:06] Since english language vandalism is so prolific, I've been using "enwiki's feature set" + "language specific features for this wiki" as my standard pattern. [18:28:28] I am not optimizing features. I'm really just making sure we have a good model. Someone could do a lot with about 5 hours of work on a per-wiki basis. [18:29:17] no problem, I was just checking :) [18:29:17] Krinkle, I've not been able to get a response from the ClueBot folks. Though, in fairness, I haven't tried recently. [18:29:29] Helder, :) [18:29:50] Krinkle, I'm not sure that our model is better than ClueBot's yet. [18:30:05] We need to do an analysis of recall with a set level of precision [18:30:09] halfak: I'd be happy to help with the announcement and to also advocate it a bit. [18:30:24] Once it is in ores, it will automatically be picked up by RTRC, of which nlwiki is the single biggest consumer. [18:30:29] E.g. max false-positive rate set at 2%, how many positives can we catch? [18:30:50] Krinkle, cool! Good to know. :) [18:30:53] It fetches an array of dbnames from http://ores.wmflabs.org/scores/ before deciding whether to query it. [18:30:58] https://tools.wmflabs.org/usage/?action=usage&group=Krinkle [18:31:08] Krinkle, you can install https://github.com/he7d3r/mw-gadget-ScoredRevisions and it will start coloring nlwiki once the model is setup on ores [18:31:33] (e.g. in the watchlist and recent changes) [18:32:09] halfak: Yeah, it'd be interesting to compare ClueBot's AUC if possible. If anything, just as a research detail of its own. [18:32:24] And further, I can help to try and poke their creators, and perhaps they can help improve revscoring's ability. [18:32:35] I'm sure there's a lot of valuable and salvaegable lessons learned from them. [18:32:41] +1 Krinkle [18:33:08] I spoke with one of the creators a while back in 2011, I think you'd like him. He has a similar background in statistics and analytical research and all that stuff I don't understand :) [18:33:55] I run into the problem as "assumed quackhood" a lot. [18:34:08] I get this when contacting other researchers, Jimmy, etc. [18:34:41] So I send an email and (I suspect) it gets lumped in with the "ClueBot is going to become skynet!" emails. [18:35:08] This happened? [18:36:00] Well, I send an email and get no response -- try a talk page posting -- no response. [18:36:01] as for assumed quackhood, I appreciate your ability to break it down for me. Certainly makes me feel it's a lot more accessible than I thought, but make no mistake, this is complicated matter and you're very good at it :) [18:36:37] Na. All engineering is complicated. Anyway, I'm glad that it seemed simple in explanation. That's the sign of good communication. :D [18:37:13] halfak: k, gotta go. Please send me links to some of your communications, I'll see what I can do to get things moving. [18:37:25] Will do. :) [18:39:09] did that blog post happend already? [18:40:07] Helder, it did not. [18:40:18] I'm still waiting for the comms people to help me with my communication style :D [18:40:27] They are supposed to take a pass on the intro and get back to me. [18:40:30] ah ok [18:40:34] This is a good reminder to ping them. [18:40:40] DId I share the draft with you? [18:41:06] I don't think so... I think I saw a link to it somewhere, but then I didn't have access [18:41:54] OK. I'll go fix that. [18:43:10] {{done}} [18:44:05] Helder, here's my personal version of the blog http://socio-technologist.blogspot.com/2015/10/ores-hacking-social-structures-by.html [18:44:24] It's a more coherent read than the document as it is. [18:45:11] hey, could you comment on this idea: https://phabricator.wikimedia.org/T118302 [18:45:25] halfak, ^ [18:45:35] I was thinking about this today [18:45:57] Yeah. I'll comment quick. Thanks for pointing to it. [18:46:52] what is a "shoe-string team"? [18:46:55] heh [18:47:13] "shoe-string" is a colloquialism for suggestion something has a minimal budget [18:47:47] "A small sum of money; capital that is barely adequate" http://www.thefreedictionary.com/shoestring [18:49:24] ok thanks [18:55:05] * halfak runs off for lunch [18:55:07] back in a bit [19:07:03] it was a nice reading :) [20:40:43] Hm.. /me tries to linkbait-ify that title. [20:40:44] Machine learning software that classifies quality of Wikipedia edits can improve social structures [20:41:55] :D [20:42:08] YuviPanda, https://gerrit.wikimedia.org/r/252277 <-- I think I did OK. [21:14:59] halfak, can we have some kind of disambiguation text shown in the ROC-AUC stats? E.g.: [21:14:59] False True (<- actual) [21:15:00] -- ------- ------ [21:15:00] 0 119 709 [21:15:00] 1 20 3134 [21:15:01] ( ^-- predicted ) [21:15:13] +1 [21:15:16] I always forget if the predictions are in the column or the row [21:15:21] Been meaning to do that. [21:15:32] also, 0 vs False and 1 vs True [21:16:40] https://github.com/wiki-ai/revscoring/blob/master/revscoring/scorer_models/scorer_model.py#L320 [21:17:11] Helder, yeah. I don't know why that happens. I'm guessing it has something to do with numpy's True and False not actually being the same thing as pythons standard True and False. [21:29:35] Well..... that went smoothly. [21:29:47] ores-staging now has all the newest models. [21:32:39] halfak: I see you got it merged too [21:32:41] halfak: congrats [21:37:09] halfak: I hope I didn't get to this page via something you initiated, but have you heard of Google Tensorflow? It's something they open sourced recently supposedly the world's best machine learning, now open sourced. [21:37:16] C+ code with Python bindings [21:37:26] http://www.tensorflow.org/ [21:57:37] Yeah. I saw that. Looks like another NN library. [21:57:47] Not something we're looking to pick up right away. [21:57:55] This is newsworthy because it is google. [21:58:25] But it might be pretty good too. [21:59:35] halfak: so I think their deal is that they want it to be a 'runs on all the things', from mobile apps to large number of clusters [21:59:47] if they do get the 'large cluster' thing right this will be a good deal [21:59:57] since it can do the scaling stuff for you [22:00:00] and sharding and what not [22:00:01] Not sure I want to run any addition CPU intensive stuff on my phone. [22:00:09] halfak: right, so it's running on the GPU [22:00:12] and it's not CPU intensive [22:00:16] That either [22:00:17] which seems like the important things about it [22:00:25] halfak: it's actually super battery efficient these days [22:00:41] Gotcha [22:00:48] but anyway, the premise is super promising [22:01:02] ofc, I know nothing of the actual thing it is doing to be useful :) [22:01:28] but a 'write once and run in different sizes, from phone to one server to one cluster of servers' will be nice [22:02:00] +1 Will need to expand the revscoring project beyond off-the-shelf models to take advantage of that. [22:02:09] Know anyone who wants to do that for an internship or IEG? [22:02:17] halfak: I don't think it's relevant for revscoring tbh [22:02:34] well, the scaling aspects at least don't seem useful for revscoring :) [22:03:20] Yeah [22:21:04] halfak: Hm.. how does accuracy differ from AUC. I know its differnet but trying to understand. Is it that AUC only includes false positives, rather than false negatives or someting? [22:21:14] RE: https://phabricator.wikimedia.org/T116937#1797319 [22:21:51] e.g. 9% of responses was different from the actual revert status after the fact (per 0.91 accuracy). [22:21:55] Krinkle, accuracy is the proportion of classifications that are right. If we just predict False all the time, we'll have ~97% accuracy [22:22:24] Yeah [22:22:38] But how does ROC/AUC translate back to classifications? [22:23:23] ROC is a measure of signal. AUC essentially tells us "how good is this at sorting things by the probability that they are damaging or not" [22:23:44] The cleaner the threshold between damaging and not in that sorted list, the higher the AUC [22:24:07] Still feels vague to me, I'm missing some part of the puzzle. [22:24:59] Yeah. AUC is really just a metric that is hard for us to cheat. That's a big reason why I like it. [22:25:01] When you say 94% AUC, you mean on a ROC graph, 94% of X data points is under the curve of Y [22:25:07] Yes [22:25:08] But I'mnot sure I know the X and Y axis in this context. [22:25:30] Let's say we set the classifier at a particular probability [22:25:51] At 0% we'll have 100% false positives and no false negatives. [22:26:10] At 100%, we'll have no falso positives and no false negatives. [22:27:09] Hm.. so AUC is the average response as a percentage of the average of real reverts? [22:28:41] Check out the first sentence of https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve [22:31:07] Hm.. so this requires that the formula knows which side of the curve you want to prefer. [22:31:25] It turns out it doesn't :) [22:31:30] We get to make that choice later [22:31:40] But you're right to know that the curve shape matters. [22:31:45] AUC doesn't capture the shape [22:32:29] So if ORES would predict randomly based on how often it knows something is reverted (e.g. 3% / 97% per your earlier stat). [22:32:43] Then Accuracy would presumably be ~ 0.5 [22:32:57] .... err AUC would be .5 [22:33:16] or maybe lower, since it's harder than 1/2 to hit 3% out of 100% [22:33:20] Accuracy would be ... um ... I have to look at the raw data and do some math. [22:33:23] Yeah [22:33:26] Probably lower [22:33:31] OK. I get it. [22:33:52] AUC is essentially the accuracy without the influence of how rare a feature is in general. [22:34:24] Does that make sense? [22:34:34] Hmmm... Not sure if that's quite right, but the statistics should generally move in the same direction assuming you're making an informed prediction. [22:36:43] halfak: If the revert rate was 50% instead of 3%, then accuracy and AUC of a random response would be 0.5 both, right? [22:36:55] OH. Yes. [22:37:49] So in my limited understanding, I'm hoping that considering AUC to be kind of like accuracy, but scaled to the scope within the range of how rare the feature is (3% in this case), feels kind of right. But maybe it's just a coincedence. [22:38:11] Krinkle, that seems like a fine way of thinking about it to me. [22:38:55] Make the AUC go up. Be skeptical of accuracy. Neither will tell you what proportion of edits you can revert if you were ClueBot NG directly. [22:39:03] But the ROC curve will help you work that out. [22:40:16] So the one variable missing here for me is the reference curve. The one ORES responses data points are usually under. That curve is what? e.g. If AUC is 1.0, what is ORES doing? [22:40:41] (aside from cheating) [22:45:01] If AUC is 1.0, ORES is right all the time (in the training set) [22:45:05] *test set