[00:11:06] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Generate spam and vandalism new page creation dataset - https://phabricator.wikimedia.org/T135644#2681382 (10Halfak) https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_draft_quality/Work_log/2016-09-28 https://meta.wikimedia.org/... [00:14:37] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Generate spam and vandalism new page creation dataset - https://phabricator.wikimedia.org/T135644#2681383 (10Halfak) I've got about 3k concerning deletions per month. There are about 80k total article creations per month. ``` $ cat enwiki.draft_qual... [00:17:29] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Generate spam and vandalism new page creation dataset - https://phabricator.wikimedia.org/T135644#2681384 (10Halfak) Actually, it looks like it fluctuates a little bit ``` $ grep -v OK enwiki.draft_quality.201508.tsv | wc 2374 11870 124306 $ grep... [00:22:34] 06Revision-Scoring-As-A-Service, 10rsaas-editquality: Implement new json-lines pattern in editquality - https://phabricator.wikimedia.org/T146410#2681386 (10Halfak) OK. All done, but I found an issue in the ruwiki datasets so I'm regenerating those models. After that, I think we're all done! [16:16:26] o/ doing some errands those morning. I'll get to hacking in about 15 mins [16:31:04] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : [Discuss] Hosting the monthly article quality dataset on labsDB - https://phabricator.wikimedia.org/T146718#2681754 (10jcrespo) @Halfak So my suggestion would be, if this can wait 3 months, wait for the labsdb pending work, were we will have a more stab... [17:29:29] o/ [17:29:33] OK. Way later than I thought [17:29:41] I was packing up to travel to AMS today [17:47:09] Had an error in the ruwiki model building scripts. I'm working on it now. [17:55:51] Darn. Looks like we're extracting features again [17:56:03] It also looks like the ruwiki sample got messed uo. [17:56:05] *up [17:56:12] The random sample link is just plain wrong. [17:56:21] It looks like it was generated for a different time period. [17:56:40] So we have to do some backflips in order to merge the data from wikilabels in with the sample we've got. [17:57:32] Yikes! it's labeled 2016, but it is certainly generated in 2015. Hmm [18:05:00] I think I found the original sample :) [18:23:30] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Generate spam and vandalism new page creation dataset - https://phabricator.wikimedia.org/T135644#2681897 (10Halfak) ``` $ wc datasets/enwiki.draft_quality.201508-201608.tsv 909053 4544642 49006910 $ cat datasets/enwiki.draft_quality.201508-201608.t... [18:33:03] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Generate spam and vandalism new page creation dataset - https://phabricator.wikimedia.org/T135644#2681912 (10Halfak) Luckily, I could clean this all up on the command line with a little bit of sed. I also found that I was getting multiple rows for some... [18:40:31] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Generate spam and vandalism new page creation dataset - https://phabricator.wikimedia.org/T135644#2681925 (10Halfak) So thinking of down-sampling, and it looks like the "attack" class might get under-sampled if we go too low. It might make sense to jus... [19:03:43] 06Revision-Scoring-As-A-Service, 10rsaas-articlequality : Generate spam and vandalism new page creation dataset - https://phabricator.wikimedia.org/T135644#2681930 (10Halfak) Dataset and new Makefile stuff is in https://github.com/wiki-ai/draftquality [19:12:32] Aaaand. it looks like we didn't do eswikibooks. [19:12:36] So off that goes. [19:12:37] ARG!