[12:57:06] 10Quarry, 10Patch-For-Review: Quarry should remember my login - https://phabricator.wikimedia.org/T164390#3489031 (10zhuyifei1999) 05Open>03Resolved a:03zhuyifei1999 It should now remember login for 31 days (default). If this is too long or too short, or if a "remember my login" checkbox should be added,... [15:03:21] halfak: Wanted to get your thoughts. Related to our conversation last week about identifying semi-automated Wikidata edits from tools, I split revision comments into words and some pretty obvious indicator words of tool edits were prevalent: “#quickstatements”, “petscan”, and “autolist2”. The first is applied to 50 million revisions. I’m thinking that identifying tool edits based on these [15:03:21] three words should identify a pretty large chunk of semi-automated tool edits. [15:03:43] Any thoughts? There's always going to be some semi-automated revisions that get classified as non-automated I'm guessing. But I think we can be pretty confident that what we classify as semi-automated is, in fact, semi-automated…and we can for example look at whether highly viewed items have a large percentage of bot or semi-automated edits, etc... [15:05:03] btw, I just looked at the part of comments outside of "/*...*/" [15:05:35] That's good. Are only those three terms very common? [15:05:46] Do you have a list of the most common terms? [15:08:07] Yeah, I do have a list. I think those are the three most common clear indicators of tool edits but there are also some variations on spelling...e.g. "#quick-statements" [15:08:47] will have to handle those. Can I see the list? [15:10:01] halfak: do you have a canonical version of the rise & decline graph somewhere I can grab it? [15:10:16] You want the old one or a new version? [15:10:39] Original: https://commons.wikimedia.org/wiki/File:The_English_Wikipedia_Decline.png [15:10:40] hmmm, I think I’d prefer the old one since I’ll refer to the paper [15:11:00] OK cool. [15:11:10] awesome, thanks! [15:11:26] halfak: Sure, the list has 54 million words in it. It's a long tail distribution so most rarely occur. Want me to send all or just part of the list? [15:12:19] Its sorted in descending order [15:13:28] top 1k [15:13:38] Maybe just post it on a wiki :) [15:13:47] Like in a work log on your meta page [15:13:47] Sounds good, one moment [15:14:05] Sure, I'll do that and point to it from here [15:28:07] halfak: had issues uploading the file to a work log...something about the file name. So, I just put it here: https://github.com/hall1467/wikidata_usage_tracking/blob/master/results/top_1000_word_counts_for_wikidata_page_revisions_20170501.tsv [15:28:34] Oh. I was suggesting you past it onto a wiki page. [15:28:48] Ah okay, my bad [15:29:12] I'd match "Bot" [15:29:19] Just in case there's something unflagged [15:29:33] What comments match "python"? [15:30:14] For "Bot", classify those revisions as bot edited? [15:30:30] Let me query for "python" in my database [15:30:46] I used the word query correctly that time haha [15:30:59] lol [15:31:26] "importing" seems like it might be interesting too [15:31:49] [[MediaWiki:Gadget-Merge.js|merge.js]] [15:32:01] WD-system-description.py [15:32:07] #item-creator [15:32:40] Here's a few rows with "python" in it [15:32:43] [[User:Harej|]] [15:32:45] lol [15:32:53] https://www.irccloud.com/pastebin/u1XndoY4/ [15:32:58] harej, you're a very common term in Wikidata edit comments! [15:33:19] haha [15:33:56] Looks like URLs with goo.gl are pointing to code a lot. [15:34:00] Could you check on that? [15:34:03] Maybe it's a norm [15:35:23] Comments with "importing" in it seem to indicate an import from some location [15:35:30] https://www.irccloud.com/pastebin/GimZWrWG/ [15:35:54] Let me check on the goo.gl [15:41:54] Looked for "goo.gl" and limited results to 10,000. Appears they all are related and maybe done by the same bot [15:41:59] https://www.irccloud.com/pastebin/ArgtcJzd/ [15:42:58] They seem to essentially all point to the log file: https://goo.gl/BezTim [15:44:43] hall1467, what would it take to rebuild the dataset after removing edits by flagged bots? [15:45:56] If you're running again, I'd also recommend normalizing case and removing punctuation. [15:46:26] The word count dataset? Just one sql query and then a rerun of my python script to generate the counts....so wouldn't be too much work [15:47:06] Is there an easy way to remove all punctuation? The script is in python [15:47:10] halfak: makes sense I would be common in the revision table in general, but edit comments? [15:55:06] harej: I was just looking at the part of revision comments outside of "/*...*/" [15:55:26] Here are a couple examples [15:55:37] https://www.irccloud.com/pastebin/QLCO1oN1/ [15:55:46] There we go. [18:04:20] * yuvipanda waves [18:04:31] it's yuvipanda! [18:06:01] hi harej [18:39:39] people who use PAWS! I've setup a newer better version at https://paws.tools.wmflabs.org! I'll switch it over sometime later this week [18:39:49] please test it out and let me know if there are any bugs :) [20:32:08] For today's history lesson, I'd like to present the task that almost killed Quarry: https://phabricator.wikimedia.org/T104308 [20:33:39] I think this says something about how we ought to approach technology -- a quote from yuvipanda: "My current setup for dealing with abuse is: 1. Ban user from Quarry, and 2. Trout them on their talk page. And I've so far not had to use it."