[02:53:02] wiki-ai/revscoring#504 (master - aaeb421 : Amir Sarabadani): The build was broken. https://travis-ci.org/wiki-ai/revscoring/builds/105321444 [04:07:18] o/ [04:07:35] Just hopping online to check things out before I call it a night,. [04:07:54] Hopefully, I have new models with more reasonable accuracy scores. [04:08:19] Well, it looks like the process finished. [04:09:04] wikidatawiki.reverted gets 89% with a balanced test set. Not bad. We really need to test this with a representative test set. [04:09:07] ^ Amir1 [04:09:16] Re. my note about putting together a test set. [04:09:22] For wikidata. [04:09:53] I'm hoping that by filtering out bot edits and client edits, we can get a much larger proportion of damaging edits to try to catch. [04:10:08] I wonder if there is a better way of catching client edits than comment regexes. [04:10:16] Maybe a change tag? [04:11:54] PR-AUC is a finicky statistic. I'm not sure it is telling us what we need to know -- or sklearn is computing it weirdly. [04:15:17] Looks like it behaves weird with our goodfaith models too since the "true" class is common. [04:15:26] We often get PR-AUC of nearly 1.0 [04:16:52] yeah... filter rate is weird too when we have "false" as the interesting class. [04:17:02] We may want to flip that one and call it "badfaith" [04:19:30] OK. These accuracies look good to me, so I'm going to push this up. [04:32:26] I just came back, I took a shower [04:33:51] halfak: Can you send me list of bad scored edits so I put it somewhere and check it [04:34:27] I highly doubt catching client would be possible another way [04:34:46] I can ask wikidata team but I really really doubt that [04:35:52] maybe we need to read some university text books or review articles regarding AI in skewed classes [04:36:27] I do some lit. review [04:45:20] Amir1, can you look into how we could catch all client edits via comment matching? [04:45:31] I want to see if we can get this golden test set together soon. [04:45:39] I think it will form the basis of our stats for the paper. [04:45:54] I'd like to load it into Wikilabels. [04:46:40] I'm hoping to look into generating this dataset tomorrow to see if we get a reasonable revert-rate. [04:51:40] If we get within an order of magnitude for enwiki, I think that would be good for running through wikilabels. [04:51:54] Especially since I suspect that a lot of wikidata damage might not show up as a reverted edit. [04:55:36] It looks like we already have a couple of the regexes spec'd out here: https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/wikidatawiki.py#L33 [04:56:13] If you don't have time, I could probably run a few queries tomorrow to try to get a sense for what variations exist. [04:59:46] Actually, I think I can do this pretty easily. [05:00:08] Assuming that all client comments start with "/* client", we should be able to run this query quickly. [05:00:16] * halfak hacks on that. [05:00:33] http://quarry.wmflabs.org/query/7098 [05:09:10] Huh. looks like there's only two client actions. sitelink-update (aka pagemove) and sitelink-delete (aka pagedelete). [05:09:17] Cool. That will make it easier. [05:09:45] Amir1, do you think we should include mergeinto and mergefrom? [05:10:06] sure [05:10:11] they aren't client actions, but it seems like they should be considered different from regular edits. [05:10:25] regarding comment matching [05:11:23] halfak: you we should include mergefrom and mergeinto [05:11:34] these are pretty predictive features [05:12:00] Could someone do mergefrom and mergeinto as vandalism? [05:12:07] And would we have any hope of catching that? [05:12:28] It seems like this is sort of like catching a item-page-deletion as vandalism. [05:12:40] It could be, but it's not really an edit,. [05:13:02] I'm thinking about what set of edits we want to use to evaluate our classifier. [05:13:06] hmm [05:13:10] valid point [05:13:14] it seems like these might need to be excluded. [05:13:25] It's much more clear to me that client edits should be excluded. [05:13:31] merging, I'm not so sure. [05:14:22] in order to evaluate the classifier I think mergefrom and mergeinto should be excluded [05:15:00] OK. Cool. So if there is downtime tomorrow, I'm going to write a database query to gather a random sample of edits that are (1) not bots, (2) not client edits and (3) not merges. [05:15:09] And then run those through the revert detector. [05:15:13] And see what we get. [05:15:32] If the prop is not 0.1% or less, we can draw our test set from it. [05:16:16] When it comes to learning our filter rate, we should keep in mind all of the "edits" that show up on the recent changes page that we are excluding. [05:16:28] but there's no reason that we should spend the time to *label* them. [05:16:40] one thing: set aside every merges [05:16:51] let's see if we can find any vandalism in it [05:16:53] Want to assess them separately? [05:16:55] Sure! [05:17:24] sorry, at first I wanted to say "they should not be excluded" [05:17:34] Maybe I'll draw a complete sample and add some fields to flag the nature of each edit. [05:17:48] That way, we can explore subsets ad-hoc before loading them into wikilabels. [05:26:11] halfak: please see this comment: https://phabricator.wikimedia.org/T123795#1970561 [05:26:24] Oh yeah. [05:26:40] So varchar, unlike char, uses a variable length storage format [05:26:43] Right now I'm trying to estimate db size [05:26:47] hence the "var" [05:27:11] I see [05:27:15] There's a table here that gives you storage sizes: http://dev.mysql.com/doc/refman/5.7/en/char.html [05:27:42] It takes 1 byte + the number of chars you want to store [05:27:51] so 'a' takes two bytes [05:28:00] 'ab' takes three. [05:28:28] So, if I were to back-of-the-envelope this for English Wikipedia, I'd start with the number of rows in recentchanges. [05:28:55] Which is currently ~8.1m [05:28:57] 'true' takes 5 bytes [05:29:03] Meh. more like 8.2 [05:29:04] Yeah. [05:29:16] 'true' takes a substantial amount of bytes. [05:29:29] I dig much deep and get result soon [05:29:35] don't worry about that [05:29:43] But I have a proposal. Let's have a per-model configuration where we specify what class probability we are interested in. [05:29:52] 'true' for damaging and 'false' for goodfaith [05:29:54] but regarding cutting rows with "false" in them [05:29:55] what do you think? Do you agree? [05:29:58] And then drop the class column. [05:30:19] That will allow us to cut the "false" rows. [05:30:59] I don't think cutting ores_class column would be a good diea [05:31:02] *idea [05:31:14] Why is that? [05:31:22] because we will add wp10 model (and non-binary models) later [05:31:25] Essentially, we'd have that column in the config. [05:31:40] and it would be impossible to work without the class [05:31:46] +1 [05:31:52] and changing the database schema at that point [05:31:54] Then again, i'm not sure we should worry about that right now. [05:31:55] is much harder than now [05:32:08] Maybe we'll want a different table structure (or memcached or something) for wp10 [05:32:43] we can have another table for non-binary ones [05:33:02] Yeah. And with wp10, we don't want that table to match the recentchanges table anyway :) [05:33:16] but biggest problem right now is that if we don't make a flexible database for wp10 now, it'll come to us later [05:33:55] One bit of good news is that we need to build in full-table loading scripts anyway. [05:33:59] I don't know but maybe our database will grow and store data for more than one month [05:34:08] Since every update to ORES will require that we re-generate the ORES tables. [05:34:29] So if we ever want to do a schema change, that won't be more expensive than updating a model. [05:34:31] (specially since user contribs should be supported and that comes from revs table) [05:34:57] yeah. that's a tough one. but on the other hang, we can rely on ORES' cache to a large extent. [05:35:06] no, the database is designed to store values for different version of models too [05:35:42] "ores_model"? [05:36:06] for example you can have for rows for each of revision one is damaging 1.0.1 and other one damaging 1.0.2 [05:36:18] We might have a bit of trouble sorting on that column. [05:36:35] e.g. with ascii sort, 1.10.0 comes before 1.2.0 [05:36:38] yes, oresc_model is a foreign key to oresm_id [05:37:04] we do some thing else [05:37:09] can you link me to the schema SQL quick? [05:37:20] * halfak should do a better job of reviewing this. [05:37:27] we set damaging 1.0.1 the current version [05:37:31] yeah sure [05:37:46] I need to get better at working with gerrit. [05:37:53] Or push people to use phab's review system [05:38:02] https://github.com/wikimedia/mediawiki-extensions-ORES/blob/master/sql/ores_model.sql [05:38:30] note the "oresm_is_current" [05:38:36] I see. We'll use "is_current" [05:38:56] So, when we run the update script we flip that flag. [05:39:13] then we use it in our queries [05:39:26] (using left join) [05:39:28] So we need to join against the ores_model table. [05:39:36] That's going to be somewhat expensive. [05:39:48] it'll save us lots of database storage [05:39:48] But we can probably work the indexes in a nice way. [05:39:51] +1 [05:39:54] no, I asked [05:40:09] * halfak is amazed that *storage* space is a serious concern. [05:40:18] Hoo told me that joining primary keys is neglible [05:40:19] *negligible [05:40:38] he actually suggested these changes [05:40:47] Yeah.... Try joinin the recentchanges table to the user table ;) [05:41:02] one sec to give you a link [05:41:06] https://phabricator.wikimedia.org/T124443 [05:41:06] But really, i think that if we have a filter on time and model ID, that will work pretty well. [05:41:29] Yeah. It's not wrong. [05:41:34] It'll probably work in practice. [05:42:18] * halfak tries to imagine the likely query plan and hits the sleep wall. [05:42:24] the discussion is here: https://gerrit.wikimedia.org/r/#/c/265944/ [05:42:24] I should call it a night pretty soon. [05:43:29] OK. To summarize: we need to trim the number of rows and the size of columns in the ores_classification table and then come up with an estimate. [05:44:13] yes [05:44:15] We should be able to drop the "false" rows by setting a config variable about which "classes" we want to store and use. [05:44:15] you go and sleep, I'll give results by when you wake up [05:44:39] We're not sure if we want to change the class name varchar to a tinyint [05:44:45] OK [05:44:48] * halfak go to sleep. [05:45:10] Have a good day! [05:45:35] I give you number for each of improvement separately, so we can judge which one is feasible [05:45:39] you too [05:45:41] o/ [05:45:49] Sounds great. [05:45:50] o/ [17:50:00] halfak: https://www.wikidata.org/wiki/Wikidata:ORES/List_of_features [17:50:11] I just wrote that as part of the paper