[00:40:01] Barebones mwdb 0.0.1: https://github.com/mediawiki-utilities/python-mwdb [00:40:12] And I'm off to enjoy the weekend in other ways. [00:40:13] o/ [12:11:19] halfak: ping me if you are around :) [14:07:06] halfak: around? [14:27:58] o/ Amir1 [14:28:13] o/ halfak :) [14:28:19] lots of things to discuss [14:29:01] first of all, there is something we should consider [14:29:31] halfak: check this out [14:29:31] https://www.wikidata.org/w/index.php?title=Q879275&action=history [14:29:33] * halfak reads intently [14:29:39] Adil2015adil [14:30:00] is vandalising [14:30:29] but not in one edit, it's a very common thing in Wikidata since they doesn't an interface a whole change, [14:30:59] using js people can add a sitelink, change something once in every edit (that's why my bot has 23M edits there :D) [14:31:16] Looks like Sjoerddebruin should have reverted back to the Aug 8th edit. [14:31:31] yeah, I know that [14:32:32] he didn't know it was vandalized, that's why he gave me this example [14:32:51] because it confuses people [14:33:16] but my point is, it's hard to catch all edits by one user at once [14:33:19] in revscoring [14:33:33] Sorry. I'm still confused as to why this is a problem? [14:33:40] technically it is not hard [14:33:43] Shouldn't each individual edit score badly? [14:34:09] each one should get a rather high number but not so much [14:34:41] since removing a sitelink is a common healthy practice in wikidata but removing all of sitelinks not [14:35:07] Indeed. This is a limitation of a revision-at-a-time model. [14:35:24] It would be nice to have a feature for the score of the user's last revision too. [14:35:43] Or maybe we can stand up good-faith/bad-faith models on top of the scores (my plan) [14:35:49] E.g. Snuggle [14:36:03] hmm [14:36:11] interesting [14:37:58] let's see how good/bad my model on wikidata works and if it was not really good, I add features for last revision before edit of the user [14:38:22] Amir1, I'd advise two models. [14:38:35] good-faith/bad faith? [14:38:54] Yeah. So one would be "is this edit damage" and another would be "is this editor bad-faith?" [14:39:05] dang [14:39:08] Do you think do we have good-faith damaging edits in Wikidata? [14:39:11] had to spill coffee on keyboard [14:39:19] Amir1, not really, just separating concerns. [14:39:45] because editing interface is completely straightforward, comparing to Wikipedia [14:39:52] If we decide to use the scores later to build a more robust good-faith/bad-faith predictor, we don't want historical info gumming up the signal for our historical info. [14:39:55] ok, sure:) [14:40:22] So, we might have a third classifier "is this vandalism?" [14:40:38] I see [14:40:41] Which would be informed by "is this damage" and "is this editor probably bad-faith" [14:41:10] you're somehow are building an ANN [14:41:18] Kinda, yeah. [14:41:27] But each of those levels will be useful :) [14:41:32] big fan of ANNs :D [14:41:32] In their own way. [14:41:46] hmm [14:41:50] Dimensionality reduction is a useful strategy for a lot of problems. [14:41:52] o/ ToAruShiroiNeko [14:42:00] Sorry about your keyboard. [14:42:04] I know how that goes. [14:42:08] dimention recution is very good for performance [14:42:19] well my keyboard had its q key not working for a while [14:42:23] good excuse to dishwash it [14:42:24] * Amir1 says hello to ToAruShiroiNeko too :) [14:42:25] Does it work now? [14:42:27] :D [14:42:32] another thing: It seems adding name of language (like "English", etc.) is a common vandalism in Wikidata [14:42:39] it works fine it just types something random :p [14:42:48] I am on my laptop so no real loss of efficiency [14:42:52] Amir1, can you link to an example of that easily? [14:42:55] Do you think it would a be good thing to add as feature(s) [14:43:00] We should have a library of common types of vandalism. [14:43:05] of course https://www.wikidata.org/wiki/Special:AbuseFilter/8 [14:43:15] +1 for adding any type of change that is common to vandalism. [14:44:01] check the log [14:45:29] but I should add a huge regex for that [14:46:02] Yeah... Maybe. [14:46:27] Yeah... I think so. [14:46:47] It seems like we might want to use that utility to detect language names in the future too. [14:47:11] a(frikaa?ns|lbanian ... [14:47:15] Why do people do this? [14:47:23] It's makes the language name unreadable. [14:47:35] The state machine that the regex builds is the same! [14:47:37] haha [14:47:50] I will fix this in my code [14:47:50] afrikaans|albanian [14:48:10] I recommend doing something like the badwords detector. [14:48:26] You give it a list of language/country regexes and it will extract any one. [14:48:45] See https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/meta/regex_extractors.py#L10 [14:48:58] I just make a group regex by concatenating the options together. [14:49:05] It works nice and it's super easy to maintain. [14:49:12] Compared to a giant regex. [14:49:29] halfak: last thing before I go back to work, Do you think it would be good to have a fuzzy system for changes in descriptions, labels, etc. I got lots of vandalism in Wikidata which are a complete relabeling, e.g. chaning label of someone to "shit" [14:50:02] Yeah. I was testing out Shilad's new API to see if it would work for that. [14:50:11] Regretfully, it didn't perform as expected. [14:50:38] E.g. http://como.macalester.edu/wikibrain/similarity?lang=simple&phrases=Barack%20Obama%20is%20a%20President|Barack%20Obama%20is%20a%20Terrorist [14:50:57] Compares "Barack Obama is a President" with "Barack Obama is a Terrorist" [14:51:02] And gets high similarity :( [14:51:54] It's really fast though. So long as we can rely on it, it would make for a good feature. [14:52:43] * halfak lost shilad's readme [14:54:22] hmm, It's good to use it but I think of something easier and less resource consuming (a library like this: https://github.com/seatgeek/fuzzywuzzy) [14:55:08] Yeah. That would be easier to get to prod than an external service. [14:55:22] Is it just edit distance? [14:56:20] edit distance? [14:56:34] https://en.wikipedia.org/wiki/Edit_distance [14:57:07] anyways [14:57:15] so I am ready to rumble [14:57:24] sory for the hickup due to my coffee love :p [14:57:36] No worries. [14:57:42] So, did you find that script I sent to you? [14:57:43] I thought you were talking about distance between edits [14:57:52] :))) [14:57:57] ok [14:58:10] I t has lots of other features that we probably won't use [14:58:28] like token_sort_ratio or something like that which is a simple edit distance [14:58:31] https://pypi.python.org/pypi/python-Levenshtein/0.12.0 [14:58:51] no they were pisplaced somehow :( [14:59:00] kk will look ToAruShiroiNeko [14:59:07] do you know the date you sent em? :) [14:59:08] ok [14:59:20] halfak: fuzzywuzzy is actually made on top of Levenshtein [14:59:38] https://gist.github.com/halfak/456da74cd98ca9f199bd [14:59:45] ToAruShiroiNeko, ^ [14:59:49] Amir1, gotcha :) [15:00:01] I'd do a performance/signal analysis to make sure they aren't slowing us down too much [15:00:12] Should be super easy with a little script. [15:00:45] ToAruShiroiNeko, we're using mw.api and I think we should switch to mwapi [15:01:00] https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-distance [15:01:05] We also need to figure out how a flagged revision plays out if it is rejected. [15:01:12] >>> ratio('Brian', 'Jesus') [15:01:12] 0.0 [15:01:12] Really? I thought there was some similarity. [15:01:18] :D [15:01:34] lol [15:01:53] For the lurkers, see https://en.wikipedia.org/wiki/Monty_Python%27s_Life_of_Brian [15:01:59] Awesome movie [15:02:52] lol [15:03:16] * halfak does some reverse engineering of the DB [15:05:26] mw.api -> mwapi ? [15:05:38] you mean what exactly? [15:08:30] Different libraries [15:08:52] * Amir1 gets back to work [15:09:13] https://pythonhosted.org/mediawiki-utilities/core/api.html#mw-api vs. http://pythonhosted.org/mwapi/ [15:09:40] mwapi is more basic, but it's the new way. [15:09:47] And it will serve our purposes nicely. [15:10:05] Actually, I think I should get these bits into mwreverts. [15:10:12] http://pythonhosted.org/mwreverts/ [15:12:30] halfak okay [15:13:01] I'll take a look at mwreverts as soon as I figure out what we need to do for flagged revs. [15:13:11] I do not have a strong opinion for tiehr version, I am inclined to trust you more than I am inclined to trust myself on the matter [15:13:18] so I am looking at this script [15:13:28] what exactly is missing? [15:13:41] I dunno. I never ran it. [15:13:45] I made stuff up. [15:13:52] oh ok [15:13:53] E.g. the 'pending' field in the revision document [15:13:59] I was on a plane without internet [15:14:16] You should consider it structurally useful. [15:14:16] I thought this was the queries you were using for autolabelling in the past [15:14:23] Nope [15:14:36] Did it much more manually/ad-hoc [15:14:45] its lack of psql queries were quite interesting :p [15:15:35] Why would we have psql queries? [15:15:59] to autolabel? were you not using postgresql for that? [15:18:35] Oh! Well when we autolabelled, yeah, but I don't know if we should autolabel now. [15:18:38] That would be OK though. [15:18:59] I was thinking that we'd just load the revisions we want labeled into wikilabels from here forward. [15:19:25] indeed [15:19:30] but it is the same logic [15:19:50] I want to pull something like 50k and defacto label them [15:20:06] maybe I will get 37k good ones or maybe 38k good ones [15:20:25] Yup [15:20:35] random sample 2000k from the bad ones and random sample 18000 from the good ones. [15:21:02] let people deal with the 2000k bad ones [15:21:46] we discussed that to death before [15:21:56] merely rephrasing consensus from before [15:22:06] any disagreements with the above strategy? [15:22:36] Nope [15:22:52] First things first anyway, we need a robust way to find the 2000k potentially bad edits. [15:46:13] ToAruShiroiNeko, pairjam? [15:55:46] pairjam.com/#bshb95 [15:55:47] yep [15:56:31] do we want to rely on quarry for the initial 50k revisions? [15:56:45] because ideally it should be more straightforward than that in terms of execution [15:56:59] Halfak and all: http://librarybase.wmflabs.org [15:58:11] Oh... Isn't this one of the things Wikisource does [15:58:23] Either way. Huge +1 to this idea. Let's do more of this. [15:59:32] Denny wants me to integrate with Wikidata. Okay, if they're fine with me creating items of URLs because they appear as Wikipedia citations [15:59:34] I think my work can help with mass imports and then building up the cross-wiki relationship with whatever WikiBase is hosting this. [15:59:44] Yes [16:00:02] Conceptually a "work", but the URL might be all we have. [16:00:10] And we'll need to do *A LOT* of merging [16:00:16] But we can build merging tools :) [16:00:18] It'll be fun [16:00:47] The entire thing could also blow up in our faces, which is why I'm sand boxing the project before even considering Wikidata [16:00:57] umm [16:01:11] harej, how is this different from Wikisource? [16:01:24] * halfak is not 100% clear on the scope of Wikisource [16:01:27] Wikisource is a collection of texts. [16:01:30] I remember a group talk on wikimania that was trying to do something like this [16:01:36] wikisorce is commons for texts [16:01:44] texts already created and are freely licensed [16:01:46] Like, you have copies of really old public domain books and government publications. [16:01:47] like US constitution [16:01:54] or internaitonal treaties [16:02:13] or possibly transcripts of audio of jet pilots [16:02:34] Wikisource is a really cool project that doesn't get enough love [16:03:49] In principle, Librarybase could point to a Wikisource page as a full text resource. [16:03:54] harej, wikisources is missing a huge opportunity to document the texts that it can't legally collect. [16:04:16] Wikisource should document all works and collect text for those it can. [16:04:37] wikisource suffers from not being able to convert scanned documents to wiki text [16:04:38] E.g. we could have copyright horizon events where we go collect a bunch of documents that just fell out of the copyright window. [16:04:47] ToAruShiroiNeko, that too. [16:05:00] ToAruShiroiNeko, BTW, I'm learning about flaggedrevs. I'll post a report shortly. [16:05:11] neat [16:05:24] there is hardly anyone working on wikisource and it is by far more important than wikipedia [16:05:27] Wikisource is missing a lot of things. But I think it's decidedly out of scope to include stuff it can't legally collect. Librarybase doesn't have that problem :) [16:05:38] since it is laws and treaties that govern ever living breathing minute of our lives [16:05:57] *every [16:06:57] There is a Wikisource conference in Vienna [16:07:44] I would attend it if I had the time :p [16:07:54] then again I would be speaking the same thing like a broken record... [16:07:55] Vienna isn't that far for you! [16:07:58] AI AI AI AI AI AI! [16:08:31] Hmm... I think I can make a good Steve Balmer impression with that [16:08:40] I want to go but only if they pay. I've already paid for two international trips this year and it really eats into your budget. [16:08:56] budgets are overrated. p [16:09:07] "Intelligence intelligence intelligence inteligence" [16:09:11] real men spend cash without any kind of planning - and end up hobos. :p [16:09:24] harej that would trigger the NSA more. [16:14:37] my chant: "Science Hypothesis Evidence Theory, Science Hypothesis Evidence Theory, ..." [16:14:55] * halfak does science against the mediawiki DB [16:15:04] Isn't that just the scientific method? [16:15:14] science is the scientific method. [16:15:39] The theory doesn't make sense without the context of method. :) [16:17:20] I.e., theory, being a constructed thing, is best understand in the context in which it was constructed -- by whom, for what reason and using what methods. [16:17:31] * halfak feels philosophical recently. [16:23:48] * ToAruShiroiNeko facepalms everyone dismissing scientifict fact under the basis that "it is just a theory" [16:24:01] "theory" != baseless guess. [16:24:08] gravity is a theory -_- [16:24:12] There are two sides to this. [16:24:27] and newtons theory on gravity was wrong. [16:24:33] There's also the assumption of theory as fact and the lack of nuanced treatment to the concept of "truth". [16:24:45] indeed [16:24:55] theory always has a hint of uncertainty [16:25:06] which reasonable people do not see as an excuse to dismiss it entirley [16:25:57] oh I was curious what you thought about the gtalk remarks I sent you [16:25:59] https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/Work_log/2015-09-06 [16:26:01] we can discuss here too [16:26:05] whatever is convenient for you [16:26:08] TL;DR: Pending revisions doesn't matter. look for the revert. [16:26:47] ToAruShiroiNeko, --> ? [16:26:57] yes [16:27:14] since label can be true/false or multi dimentiona if need be [16:27:21] where label would be an object in such a case [16:27:27] ToAruShiroiNeko, https://github.com/wiki-ai/ores/blob/master/ores/utilities/label_reverted.py [16:27:57] Or do you mean something like this: https://github.com/wiki-ai/revscoring/blob/master/revscoring/utilities/score.py [16:28:12] umm [16:28:20] so for auto labelling we would be getting some 50k revisions [16:28:21] That script should be able to read a file containing rev_ids [16:28:29] I want to label these before processing them onward [16:28:38] How would you label them? [16:28:43] As in, what labels? [16:28:50] we have established several strategies [16:29:03] "if edit was maade by admin or some other user with access" = good [16:29:11] "if edit wasnt reverted" = good [16:29:17] So, "good" and "bad"? [16:29:30] ? [16:29:31] in the case of our reverted model, that would be the case [16:29:41]

? [16:29:45] yes [16:30:02] That's what this does: https://github.com/wiki-ai/ores/blob/master/ores/utilities/label_reverted.py [16:30:04] then the script would simply pick random samling from each group [16:30:07] reverted|not-reverted [16:30:17] yeah I vividly remembering working on this before [16:30:34] so why is autolabel stuck? [16:30:35] * halfak goes back to the pairjam. [16:30:45] What? [16:31:19] I am a bit confused as to what is missing. It seems like the work is there, is the task to just recycle it? [16:31:30] ? [16:31:39] Well it doesn't look for trusted user groups [16:31:48] and it doesn't fit into a workflow for loading a campaign into wikilabel;s [16:31:57] And it doesn't belong in ores [16:32:25] I think we ought to start another project like 'editquality' or something like that. [16:32:31] And have this script there. [16:33:09] On wikilabels side, I have a script that converts a TSV to an SQL database insertion for the tasks table. [16:33:20] We probably want to extend that to create a whole campaign. [16:52:49] ToAruShiroiNeko, I think we want to consider including revisions by newcomers too [16:52:58] Like, pre-autoconfirmed [17:04:54] hmm [17:05:20] I think not [17:05:24] it falls under wikilabels [17:05:28] a subdirectory for scripts [17:05:39] adand those would not be integrated in the code directly [17:05:47] just a library of scripts of campaigns [17:06:07] wikilabels does a lot of things -- not just editquality campaigns. [17:06:08] I do not think it is sensable to create an entire repo for each type of labelling we do [17:06:11] indeed [17:06:18] each campaign can have its own script [17:06:24] its just a list of scripts [17:06:26] Well, then we're going to have to enforce code review for the cleanup scripts that people write. [17:06:32] yes [17:06:40] if it is going to be hosted in our repo [17:06:42] Also, my scripts are often multiple files [17:06:46] if not its their fork [17:06:56] I don't think it's a good idea to have it in our repo [17:07:01] And long-term forks are a bad idea. [17:07:02] sure each campaign can have a subdir inside the scripts folder? [17:07:06] Na. [17:07:15] I'm stongly against that idea. [17:07:24] then we want a general scripts repo and not just edit quality [17:07:35] much like config repos [17:07:38] There are great ways to organize these things that don't require that they all use the same repo [17:07:41] e.g. a wiki page. [17:07:53] Yeah. General scripts +++ [17:08:00] And utilities that people will want [17:08:03] but general to wikilabels [17:16:03] sure [17:16:16] We need documentation of our structure at some point [17:16:25] its begining to become less self evident :) [17:16:45] intro to what we are doing kind of a page perhaps [17:23:18] ToAruShiroiNeko, R:Revscoring? [17:23:35] Or how we structure peripheral projects within our framework? [17:23:53] e.g. so you want to run a wikilabels campaign. Do 1,2,3... [17:23:53] yeah [17:24:07] we need to help procedures be easy [17:24:13] I did my best on meta for tanslations and such [17:24:14] so you want to stand up a new scorer model in ORES? Do 1,2,3... [17:24:18] but we want more than that [17:24:22] indeed [17:24:38] we need to have a procedure so a) we do not need to explain and b) we get engagement on the fly [17:24:40] Sure. Say, I'm nearly done with my work on the autolabeller. OK if I make some big changes to the pairjam? [17:24:41] ToAruShiroiNeko, ^ [17:24:50] sure [17:24:52] feel free too [17:25:00] *to [17:25:15] I am looking at it on a local notepad++ anyways [17:25:57] I kinda think we should have a core script for autolabling that would handle the same stuff :/ [17:26:06] its just as easy to copy paste of course [17:27:21] I don't see how that's going to work. [17:27:30] What's common to autolabelling? [17:28:55] Between different projects that is? [17:38:42] well [17:38:59] each autolabelling will need to handle revids if they involve revisions [17:39:08] so that function would be called [17:45:39] What would that function do? [17:45:54] for me, this is a one-liner: rev_ids = (int(line) for line in sys.stdin) [17:48:05] hmm [17:48:07] I suppose [17:48:25] we'll worry about it with more campaigns [17:48:28] no need to voerthink [17:48:31] overthink even [17:51:28] I think this script is just about done. [17:51:43] okay [17:51:45] <3 [17:52:00] this will make things nice for us, particularly with all these campaigns we should be starting soon [17:52:16] Amir1 do you have a few minutes? Can you update me on our status with chinese? [17:52:30] One two lines is more than enough, I will pass it to the community [17:53:08] no progress is also fine, if you have quesitons you want me to ask, I will :) [17:53:11] chinese is hard [17:53:15] its all chinese to me. :p [17:54:43] Chnise is the tricky one to tokenize, I'm trying different approaches to get best results, it will be there by September 10 [17:55:58] Amir1, in 'deltas' I have a wikitext_split tokenizer that split CJK by char into a special token type with the name 'cjk' [17:56:25] Do you think that might work for us or do we need to combine characters? [17:56:28] I'm looking into this approach too :) [17:56:32] cjk = chinese, japanese, korean [17:56:41] https://github.com/halfak/deltas/blob/master/deltas/tokenizers/wikitext_split.py#L45 [17:56:51] ToAruShiroiNeko: told me about this one a while ago [17:56:58] it sounds like a language code [17:57:06] ye [17:57:13] I love the abreviation :p [18:12:16] * halfak extends mwreverts for our purposes. [18:41:26] aaron halfak er, which of your tools do I want? mwcites? mwrefs? what's the difference? [19:10:42] mwcites focuses on IDs (like DOI, ISBN, PuBMed, etc.) [19:11:04] mwrefs extracts tags and does very simple diffing to know when they changed. [19:11:17] harej, ^ [19:11:26] (sorry was deep in mwreverts) [19:17:28] * halfak gets used to self-merging more often [19:19:54] I learned about 'python setup.py upload_docs' [19:20:00] So much better than going to pypi [19:38:44] ToAruShiroiNeko, https://github.com/wiki-ai/editquality [19:39:09] oh looking [19:39:10] neat [19:39:39] Aaron I will be merging mroe often though :P [19:40:01] :/ no pull requests there D [19:40:02] :D [19:41:27] Oh. I just set up the repo. [19:41:36] I'm writing a quick guide on how to use the prelabeler [19:43:36] Something is wrong with the package. I am investigating. [19:50:19] ToAruShiroiNeko, https://gist.github.com/halfak/5be14cbb1b0de6afdae8 [19:50:23] That should explain it all [19:50:45] Docs look a little weird in the example [19:51:00] looking [19:51:32] oh I like the direction you have there [19:54:29] an option for all user groups except the absence of it could be nice [19:54:51] Yeah. If you don't specify then there are no trusted user groups [19:55:01] or if it can output the existing user groups [19:55:04] Oh I see. you mean untrusted-user-groups [19:55:16] so that I can elevate it to the community [19:55:24] instead of asking them to figure it out [19:55:29] Well there's a page on the wiki, dude :P [19:55:38] I know that [19:55:45] https://en.wikipedia.org/wiki/Special:ListGroupRights [19:55:54] it could be a link in the documentation for it I mean [19:55:54] * YuviPanda waves [19:55:59] o/ YuviPanda [19:56:04] Thanks for merging the mwapi patch halfak [19:56:12] Feel free to self merge prs [19:56:16]