[11:38:24] revi: Hey, can you check this one? https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ko [11:38:56] 18 to 46 is Japanese [11:39:24] 16 and 17 is the language I don't recognize (which isn't Korean) [11:39:47] 16 and 17 are Persian [11:39:49] :D [11:39:50] oh. [11:40:04] 1-15 has to be excluded too? Not Korean [11:40:14] Yeah, we will exclude all of them [11:40:19] 239 and 240 is almost identical [11:40:22] 투자재 [11:40:22] 투자재를 [11:40:30] difference is just a postposition in 240 [11:40:47] same for 228 and 229 [11:40:53] hmm, We pick all forms [11:40:57] 188-191 [11:41:02] it's okay [11:41:04] hmm [11:41:46] https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/en [11:41:52] same for 54 and 16 [11:42:10] do we really want to have "CommonsCat" in "Generated Common Words"? [11:42:50] nope, we exclude all of them too [11:42:59] It's AI it's stupid :P [11:43:35] revi: my biggest question is are the list of badwords are really bad words? [11:43:43] most of them are no [11:43:54] (Check English for example) [11:43:57] only few are [11:44:18] I can see less than 10 bad words [11:44:31] hmm [11:44:42] Do you think you can write a list down? [11:44:48] take those 10 from this list [11:44:59] and add anything you want [11:45:29] read the list again [11:45:33] and it was two :P [11:45:39] 81 and 181 [11:45:46] yeah [11:45:49] I'll make one [11:45:59] one thing: we need two lists: 1- informal words. Words that are not okay in article namespace but okay other places like "hey" "lol" 2- swear words: words that are not okay to use anywher [11:46:02] e [11:47:05] revi: ^ [11:47:10] ack [11:47:16] Thanks! [11:49:39] I forgot we have "Bad words" AbuseFilter [11:49:40] LOL [11:50:19] Persian badwords in ores came from the bad words abuse filter :P [11:50:57] just looked it up, found it's incomplete [11:51:12] but obviously good place to start :D [12:05:33] <3 [12:06:33] hmm [12:06:50] if 'shit' is included, does it fetch 'shitposter' or such? [12:07:13] I mean, if some word is included, does it catch the words with that bad word? [12:08:14] revi: it doesn't fetch shitposter [12:08:18] ㄷ그 [12:08:19] erm [12:08:34] size tripled :cries: [12:08:36] revi: but you can make regexes [12:08:47] that's sooooo good to hear [12:09:06] :D [13:21:16] 10Revision-Scoring-As-A-Service-Backlog, 10rsaas-editquality, 07Spanish-Sites: Deploy edit quality models for eswiki - https://phabricator.wikimedia.org/T130279#3112881 (10Liuxinyu970226) [13:22:04] 10Revision-Scoring-As-A-Service-Backlog, 10Wikilabels, 10rsaas-editquality, 07Spanish-Sites: Complete eswiki edit quality campaign - https://phabricator.wikimedia.org/T131963#3112882 (10Liuxinyu970226) [13:22:10] 10Revision-Scoring-As-A-Service-Backlog, 10Wikilabels, 10rsaas-editquality, 07Spanish-Sites: Complete eswiki edit quality campaign - https://phabricator.wikimedia.org/T129701#3112883 (10Liuxinyu970226) [15:36:02] made a list of (incomplete) bad words http://pastebin.com/wDpuzvzs [15:36:23] pasting the link before shutting down my laptop (so I can resume to work by tomorrow) [15:36:39] maybe phab task should be the place to log things but who cares? nobody! [15:51:27] o/ [15:51:58] Thanks revi. I'll use this to get started. [16:02:02] 10Revision-Scoring-As-A-Service-Backlog, 10Bad-Words-Detection-System, 10revscoring: Add language support for Korean - https://phabricator.wikimedia.org/T160757#3109896 (10Halfak) @revi put together this list based on some other sources: P5073 [16:03:26] Amir1, it looks like we might be giving up on BWDS for korean. Is that right? [16:04:00] this badwords list will help get us started, but we'll also want some informals. Maybe revi can help is with that later. [16:04:08] halfak: hey, yes it seems [16:04:23] revi told me that he will do it too [16:04:29] I wonder if limiting the char set would be helpful for Korean. [16:04:33] Sounds great :) [16:04:54] I really like the idea of having char sets. That will help us get some interesting signal. :) [16:19:00] * halfak is going to try to polish up his work on bwds and then maybe start moving over some of the cool stuff with alphabets. [16:59:29] I actually made P5072 after posting pastebin lol [17:00:18] and I'll work on informals, but my new semester just started, I have things to do in real life so my availablility on working on it is limited [17:00:19] :-p [17:05:25] woops [17:05:40] revi, thanks and no worries. You've been very responsive and helpful. We really appreciate it :) [17:05:50] I'm working on Korean char sets right now :) [17:09:09] :D [18:42:47] OK I now have normalizers implemented for the following: [18:43:04] removing lang codes from inter-wiki links. [18:43:17] removing tokens that do not have at least one char from the target alphabet [18:43:25] converting everything to lower case. [18:43:36] removing 1337 5p3ak [18:43:44] stemming [18:43:56] and de-repeating characters like loooooooooooooool --> lol [18:44:15] But the output will include the originals. We'll just make counts based on normalized, filtered tokens. :D [18:48:37] I'm pretty well positioned to use all of this in the dump processor too. [22:10:16] halfak: hey o/ [22:10:25] hey glorian_wd [22:10:50] are you still here for some time? [22:10:57] or are you gonna go offline shortly? [22:11:07] Offline shortly [22:12:53] https://usercontent.irccloud-cdn.com/file/ivrlVO2v/looks%20good%3F [22:15:24] halfak: I only added a single line hack: http://pastebin.com/MvqASTfj [22:16:18] not really elegant solution, but hopefully it gets the job done *hopeful* [22:16:35] Not done yet. make it good. Also, don't overwrite the old PageAsOfRevision :P [22:17:04] Oh, also that's not loading the page up from Wikidata. [22:17:29] halfak: Concerning to overwriting pageasofrevision, yeah I will create a new method for that, maybe something like PrintablePageAsofRevision [22:17:46] what do you mean with "make it good"? [22:18:35] Make it render as nicely as you can :) [22:18:43] oh with CSS stuff? [22:18:49] Yes [22:18:55] Also, load the iframe from Wikidata [22:19:04] hmm [22:19:25] from here: https://www.wikidata.org/wiki/?oldid=424539371&printable=yes ? [22:22:14] yeah [22:22:57] Oh okay. Then, I have to delve into your code again to find how you access the mediawiki API [22:23:25] I bet I should look at api.js ? [22:27:06] or could you give me a hint about how do you pull the Mediawiki API to get all of the item attributes (e.g. statements, sitelinks, etc.) [22:27:08] ? [22:40:40] halfak: ? [22:41:09] Hit that exact URL ^ [22:41:16] interpret content as HTML [22:41:18] :) [22:41:24] api.js is a good place to put it. [22:46:50] okay, I will work on this again tomorrow [22:46:56] thanks for the hint halfak