[00:40:01] Barebones mwdb 0.0.1: https://github.com/mediawiki-utilities/python-mwdb [00:40:12] And I'm off to enjoy the weekend in other ways. [00:40:13] o/ [12:11:19] halfak: ping me if you are around :) [14:07:06] halfak: around? [14:27:58] o/ Amir1 [14:28:13] o/ halfak :) [14:28:19] lots of things to discuss [14:29:01] first of all, there is something we should consider [14:29:31] halfak: check this out [14:29:31] https://www.wikidata.org/w/index.php?title=Q879275&action=history [14:29:33] * halfak reads intently [14:29:39] Adil2015adil [14:30:00] is vandalising [14:30:29] but not in one edit, it's a very common thing in Wikidata since they doesn't an interface a whole change, [14:30:59] using js people can add a sitelink, change something once in every edit (that's why my bot has 23M edits there :D) [14:31:16] Looks like Sjoerddebruin should have reverted back to the Aug 8th edit. [14:31:31] yeah, I know that [14:32:32] he didn't know it was vandalized, that's why he gave me this example [14:32:51] because it confuses people [14:33:16] but my point is, it's hard to catch all edits by one user at once [14:33:19] in revscoring [14:33:33] Sorry. I'm still confused as to why this is a problem? [14:33:40] technically it is not hard [14:33:43] Shouldn't each individual edit score badly? [14:34:09] each one should get a rather high number but not so much [14:34:41] since removing a sitelink is a common healthy practice in wikidata but removing all of sitelinks not [14:35:07] Indeed. This is a limitation of a revision-at-a-time model. [14:35:24] It would be nice to have a feature for the score of the user's last revision too. [14:35:43] Or maybe we can stand up good-faith/bad-faith models on top of the scores (my plan) [14:35:49] E.g. Snuggle [14:36:03] hmm [14:36:11] interesting [14:37:58] let's see how good/bad my model on wikidata works and if it was not really good, I add features for last revision before edit of the user [14:38:22] Amir1, I'd advise two models. [14:38:35] good-faith/bad faith? [14:38:54] Yeah. So one would be "is this edit damage" and another would be "is this editor bad-faith?" [14:39:05] dang [14:39:08] Do you think do we have good-faith damaging edits in Wikidata? [14:39:11] had to spill coffee on keyboard [14:39:19] Amir1, not really, just separating concerns. [14:39:45] because editing interface is completely straightforward, comparing to Wikipedia [14:39:52] If we decide to use the scores later to build a more robust good-faith/bad-faith predictor, we don't want historical info gumming up the signal for our historical info. [14:39:55] ok, sure:) [14:40:22] So, we might have a third classifier "is this vandalism?" [14:40:38] I see [14:40:41] Which would be informed by "is this damage" and "is this editor probably bad-faith" [14:41:10] you're somehow are building an ANN [14:41:18] Kinda, yeah. [14:41:27] But each of those levels will be useful :) [14:41:32] big fan of ANNs :D [14:41:32] In their own way. [14:41:46] hmm [14:41:50] Dimensionality reduction is a useful strategy for a lot of problems. [14:41:52] o/ ToAruShiroiNeko [14:42:00] Sorry about your keyboard. [14:42:04] I know how that goes. [14:42:08] dimention recution is very good for performance [14:42:19] well my keyboard had its q key not working for a while [14:42:23] good excuse to dishwash it [14:42:24] * Amir1 says hello to ToAruShiroiNeko too :) [14:42:25] Does it work now? [14:42:27] :D [14:42:32] another thing: It seems adding name of language (like "English", etc.) is a common vandalism in Wikidata [14:42:39] it works fine it just types something random :p [14:42:48] I am on my laptop so no real loss of efficiency [14:42:52] Amir1, can you link to an example of that easily? [14:42:55] Do you think it would a be good thing to add as feature(s) [14:43:00] We should have a library of common types of vandalism. [14:43:05] of course https://www.wikidata.org/wiki/Special:AbuseFilter/8 [14:43:15] +1 for adding any type of change that is common to vandalism. [14:44:01] check the log [14:45:29] but I should add a huge regex for that [14:46:02] Yeah... Maybe. [14:46:27] Yeah... I think so. [14:46:47] It seems like we might want to use that utility to detect language names in the future too. [14:47:11] a(frikaa?ns|lbanian ... [14:47:15] Why do people do this? [14:47:23] It's makes the language name unreadable. [14:47:35] The state machine that the regex builds is the same! [14:47:37] haha [14:47:50] I will fix this in my code [14:47:50] afrikaans|albanian [14:48:10] I recommend doing something like the badwords detector. [14:48:26] You give it a list of language/country regexes and it will extract any one. [14:48:45] See https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/meta/regex_extractors.py#L10 [14:48:58] I just make a group regex by concatenating the options together. [14:49:05] It works nice and it's super easy to maintain. [14:49:12] Compared to a giant regex. [14:49:29] halfak: last thing before I go back to work, Do you think it would be good to have a fuzzy system for changes in descriptions, labels, etc. I got lots of vandalism in Wikidata which are a complete relabeling, e.g. chaning label of someone to "shit" [14:50:02] Yeah. I was testing out Shilad's new API to see if it would work for that. [14:50:11] Regretfully, it didn't perform as expected. [14:50:38] E.g. http://como.macalester.edu/wikibrain/similarity?lang=simple&phrases=Barack%20Obama%20is%20a%20President|Barack%20Obama%20is%20a%20Terrorist [14:50:57] Compares "Barack Obama is a President" with "Barack Obama is a Terrorist" [14:51:02] And gets high similarity :( [14:51:54] It's really fast though. So long as we can rely on it, it would make for a good feature. [14:52:43] * halfak lost shilad's readme [14:54:22] hmm, It's good to use it but I think of something easier and less resource consuming (a library like this: https://github.com/seatgeek/fuzzywuzzy) [14:55:08] Yeah. That would be easier to get to prod than an external service. [14:55:22] Is it just edit distance? [14:56:20] edit distance? [14:56:34] https://en.wikipedia.org/wiki/Edit_distance [14:57:07] anyways [14:57:15] so I am ready to rumble [14:57:24] sory for the hickup due to my coffee love :p [14:57:36] No worries. [14:57:42] So, did you find that script I sent to you? [14:57:43] I thought you were talking about distance between edits [14:57:52] :))) [14:57:57] ok [14:58:10] I t has lots of other features that we probably won't use [14:58:28] like token_sort_ratio or something like that which is a simple edit distance [14:58:31] https://pypi.python.org/pypi/python-Levenshtein/0.12.0 [14:58:51] no they were pisplaced somehow :( [14:59:00] kk will look ToAruShiroiNeko [14:59:07] do you know the date you sent em? :) [14:59:08] ok [14:59:20] halfak: fuzzywuzzy is actually made on top of Levenshtein [14:59:38] https://gist.github.com/halfak/456da74cd98ca9f199bd [14:59:45] ToAruShiroiNeko, ^ [14:59:49] Amir1, gotcha :) [15:00:01] I'd do a performance/signal analysis to make sure they aren't slowing us down too much [15:00:12] Should be super easy with a little script. [15:00:45] ToAruShiroiNeko, we're using mw.api and I think we should switch to mwapi [15:01:00] https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-distance [15:01:05] We also need to figure out how a flagged revision plays out if it is rejected. [15:01:12] >>> ratio('Brian', 'Jesus') [15:01:12] 0.0 [15:01:12] Really? I thought there was some similarity. [15:01:18] :D [15:01:34] lol [15:01:53] For the lurkers, see https://en.wikipedia.org/wiki/Monty_Python%27s_Life_of_Brian [15:01:59] Awesome movie [15:02:52] lol [15:03:16] * halfak does some reverse engineering of the DB [15:05:26] mw.api -> mwapi ? [15:05:38] you mean what exactly? [15:08:30] Different libraries [15:08:52] * Amir1 gets back to work [15:09:13] https://pythonhosted.org/mediawiki-utilities/core/api.html#mw-api vs. http://pythonhosted.org/mwapi/ [15:09:40] mwapi is more basic, but it's the new way. [15:09:47] And it will serve our purposes nicely. [15:10:05] Actually, I think I should get these bits into mwreverts. [15:10:12] http://pythonhosted.org/mwreverts/ [15:12:30] halfak okay [15:13:01] I'll take a look at mwreverts as soon as I figure out what we need to do for flagged revs. [15:13:11] I do not have a strong opinion for tiehr version, I am inclined to trust you more than I am inclined to trust myself on the matter [15:13:18] so I am looking at this script [15:13:28] what exactly is missing? [15:13:41] I dunno. I never ran it. [15:13:45] I made stuff up. [15:13:52] oh ok [15:13:53] E.g. the 'pending' field in the revision document [15:13:59] I was on a plane without internet [15:14:16] You should consider it structurally useful. [15:14:16] I thought this was the queries you were using for autolabelling in the past [15:14:23] Nope [15:14:36] Did it much more manually/ad-hoc [15:14:45] its lack of psql queries were quite interesting :p [15:15:35] Why would we have psql queries? [15:15:59] to autolabel? were you not using postgresql for that? [15:18:35] Oh! Well when we autolabelled, yeah, but I don't know if we should autolabel now. [15:18:38] That would be OK though. [15:18:59] I was thinking that we'd just load the revisions we want labeled into wikilabels from here forward. [15:19:25] indeed [15:19:30] but it is the same logic [15:19:50] I want to pull something like 50k and defacto label them [15:20:06] maybe I will get 37k good ones or maybe 38k good ones [15:20:25] Yup [15:20:35] random sample 2000k from the bad ones and random sample 18000 from the good ones. [15:21:02] let people deal with the 2000k bad ones [15:21:46] we discussed that to death before [15:21:56] merely rephrasing consensus from before [15:22:06] any disagreements with the above strategy? [15:22:36] Nope [15:22:52] First things first anyway, we need a robust way to find the 2000k potentially bad edits. [15:46:13] ToAruShiroiNeko, pairjam? [15:55:46] pairjam.com/#bshb95 [15:55:47] yep [15:56:31] do we want to rely on quarry for the initial 50k revisions? [15:56:45] because ideally it should be more straightforward than that in terms of execution [15:56:59] Halfak and all: http://librarybase.wmflabs.org [15:58:11] Oh... Isn't this one of the things Wikisource does [15:58:23] Either way. Huge +1 to this idea. Let's do more of this. [15:59:32] Denny wants me to integrate with Wikidata. Okay, if they're fine with me creating items of URLs because they appear as Wikipedia citations [15:59:34] I think my work can help with mass imports and then building up the cross-wiki relationship with whatever WikiBase is hosting this. [15:59:44] Yes [16:00:02] Conceptually a "work", but the URL might be all we have. [16:00:10] And we'll need to do *A LOT* of merging [16:00:16] But we can build merging tools :) [16:00:18] It'll be fun [16:00:47] The entire thing could also blow up in our faces, which is why I'm sand boxing the project before even considering Wikidata [16:00:57] umm [16:01:11] harej, how is this different from Wikisource? [16:01:24] * halfak is not 100% clear on the scope of Wikisource [16:01:27] Wikisource is a collection of texts. [16:01:30] I remember a group talk on wikimania that was trying to do something like this [16:01:36] wikisorce is commons for texts [16:01:44] texts already created and are freely licensed [16:01:46] Like, you have copies of really old public domain books and government publications. [16:01:47] like US constitution [16:01:54] or internaitonal treaties [16:02:13] or possibly transcripts of audio of jet pilots [16:02:34] Wikisource is a really cool project that doesn't get enough love [16:03:49] In principle, Librarybase could point to a Wikisource page as a full text resource. [16:03:54] harej, wikisources is missing a huge opportunity to document the texts that it can't legally collect. [16:04:16] Wikisource should document all works and collect text for those it can. [16:04:37] wikisource suffers from not being able to convert scanned documents to wiki text [16:04:38] E.g. we could have copyright horizon events where we go collect a bunch of documents that just fell out of the copyright window. [16:04:47] ToAruShiroiNeko, that too. [16:05:00] ToAruShiroiNeko, BTW, I'm learning about flaggedrevs. I'll post a report shortly. [16:05:11] neat [16:05:24] there is hardly anyone working on wikisource and it is by far more important than wikipedia [16:05:27] Wikisource is missing a lot of things. But I think it's decidedly out of scope to include stuff it can't legally collect. Librarybase doesn't have that problem :) [16:05:38] since it is laws and treaties that govern ever living breathing minute of our lives [16:05:57] *every [16:06:57] There is a Wikisource conference in Vienna [16:07:44] I would attend it if I had the time :p [16:07:54] then again I would be speaking the same thing like a broken record... [16:07:55] Vienna isn't that far for you! [16:07:58] AI AI AI AI AI AI! [16:08:31] Hmm... I think I can make a good Steve Balmer impression with that [16:08:40] I want to go but only if they pay. I've already paid for two international trips this year and it really eats into your budget. [16:08:56] budgets are overrated. p [16:09:07] "Intelligence intelligence intelligence inteligence" [16:09:11] real men spend cash without any kind of planning - and end up hobos. :p [16:09:24] harej that would trigger the NSA more. [16:14:37] my chant: "Science Hypothesis Evidence Theory, Science Hypothesis Evidence Theory, ..." [16:14:55] * halfak does science against the mediawiki DB [16:15:04] Isn't that just the scientific method? [16:15:14] science is the scientific method. [16:15:39] The theory doesn't make sense without the context of method. :) [16:17:20] I.e., theory, being a constructed thing, is best understand in the context in which it was constructed -- by whom, for what reason and using what methods. [16:17:31] * halfak feels philosophical recently. [16:23:48] * ToAruShiroiNeko facepalms everyone dismissing scientifict fact under the basis that "it is just a theory" [16:24:01] "theory" != baseless guess. [16:24:08] gravity is a theory -_- [16:24:12] There are two sides to this. [16:24:27] and newtons theory on gravity was wrong. [16:24:33] There's also the assumption of theory as fact and the lack of nuanced treatment to the concept of "truth". [16:24:45] indeed [16:24:55] theory always has a hint of uncertainty [16:25:06] which reasonable people do not see as an excuse to dismiss it entirley [16:25:57] oh I was curious what you thought about the gtalk remarks I sent you [16:25:59] https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/Work_log/2015-09-06 [16:26:01] we can discuss here too [16:26:05] whatever is convenient for you [16:26:08] TL;DR: Pending revisions doesn't matter. look for the revert. [16:26:47] ToAruShiroiNeko, -->