[05:50:56] leila! [05:50:56] ;D [05:50:58] * Ironholds hugs [05:51:24] It is 1:51am and I am copyediting a paper Scott and I are submitting to CHI. Hi! Welcome back to the U.S. of....dammit, I knew that letter.. [05:51:52] U.S. of E? It was around that bit of the alphabet. Eh, it's gone. [05:52:07] Hi Ironholds. [05:52:51] how goes? [05:52:54] I've arrived in Frankfurt, will take me some more time to get to the U.S. [05:52:58] ahh [05:53:06] well, you know. Frankfurt has good stuff too. [05:53:13] yeah [05:53:19] ...actually I know nothing about Frankfurt, except that at least at some point, they did sausages. [05:53:39] hahaha! that's a safe bet around here. [05:53:58] also: I left town, then you left town, then half of the org quit. [05:54:05] I'll be online for the next 6 hours or so, unless I need to move around. [05:54:13] ...I think the conclusion here is, there is a minimum density of researchers without which nobody can survive. [05:54:23] we have to have >=2 of us in the office at any one time. [05:54:28] I mean, these emails are quite overwhelming. someone should explain to me what's going on in a map [05:54:28] :D [05:54:44] haha! I like your conclusion [05:55:01] summary: Heather from Recruitment left, Terry left, Sumana left, Benny left, Alolita left. [05:55:16] what this means to us: ...we have to go through Emily for the traffic analyst position, I guess? [05:55:26] oh, and Steven left [05:55:34] so...we get a bit more of Aaron's time, or something. [05:55:45] I have no idea how many are gone and how many are going on [date], because I am remote and all. [05:56:30] yeah, this seems to be an accurate summary. ;-) [05:57:08] so yeah, more emily, more aaron. [05:57:20] ...if more people leave what other fun stuff do we get? [05:57:36] I like Emily and Aaron. If this is the only direct consequence of mass departures... [05:57:38] I'd recommend you continue copy-pasting. ;-) [05:57:47] huh? [05:57:52] your paper [05:58:58] you mean copy-editing? [05:59:07] one being academic plagiarism and the other being fixing typos? :D [05:59:17] yes. :D [06:02:24] b.t.w., what is Ellery's IRC? does he use it? [06:02:53] Ironholds, ^ [06:03:22] leila, ewu...however you spell his last name. [06:03:29] all I know is that it's far too long to not contain any vowels. [06:03:55] ah! got it [06:03:56] It's like a dive bar without any PBR. I'm not saying it's not okay, I'm just saying I've never been to a dive bar that didn't have PBR. [16:59:59] Hey HenriqueCrang! [17:21:47] hi halfak ! [17:22:41] i am just catching up with my emails now [17:22:50] an reading Revision scoring as a service [17:24:17] :) [17:24:45] I'm just writing up the tests for the features related to bad words and misspellings. :) [17:26:53] you deserve a medal! [17:28:40] Woot! I think the feature extractor is pretty close. I still haven't trained a model yet. :/ [17:28:51] How much trouble do you think it would be to construct a train/test set for ptwiki? [17:33:10] halfak, I think the right guy to help us just entered the room! [17:33:12] hi Helder [17:33:29] o/ Helder [17:34:44] Ideally, we'd train/test models with a large (~10k) random sample of recent revisions hand-coded for damage/good-faith/etc. [17:35:10] It would be great if we could start testing with a dataset that is already available. [17:35:24] There are good options for enwiki, but I don't know about ptwiki. [17:36:27] * Ironholds thinks [17:36:41] halfak, didn't Maryana and Steven do some work on ptwiki waybackwhen? [17:36:49] It was while they were halfway between Community and Product [17:36:56] Ironholds, nothing I remember. [17:36:59] they might have produced something, or know if it exists [17:37:08] I can't remember what they were studying, unfortunately [17:38:36] halfak, we have some smaller samples [17:38:46] i think we will need to evaluate a bigger one now [17:38:55] HenriqueCrang, might be fine to run a few tests. [17:38:59] I was talking with Helder on friday just about it [17:39:31] We might also consider building a nice hand-coding tool so that we can ask a crowd of volunteers to help us. [17:40:10] halfak, we had a gadget on ptwiki to do something like this [17:40:13] let me find it [17:40:56] Oooh! Awesome [17:42:23] but, AFAIR, we were evaluating if the revisions should be reverted or not, but didnt went through good and bad faith [17:42:46] That's cool. [17:43:04] I think we should have a classifier for good-faith, but damage is more important [17:45:19] halfak, i just mixed histories in my memory [17:46:02] we did this evaluation with 400 revisions that lived for 10 or more days without being patroled [17:46:27] https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Projetos/AntiVandalismo/Pesquisas [17:47:04] the evaluation gadget was used in another thing, to evaluate filter action and look for false positives [17:47:07] https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Filtro_de_edi%C3%A7%C3%B5es/An%C3%A1lise#Detec.C3.A7.C3.B5es_realizadas [17:48:06] but I definitely thing we can use this experiences to create a gadget to generate our sample [17:48:21] How does the gadget work? [17:52:40] https://meta.wikimedia.org/w/index.php?title=User:He7d3r/Tools/AbuseLogStatus.js [17:53:38] Gotcha. It seems like we might be able to do a little bit better on the cheap. [17:53:49] it this case of the abuse filter, it ads a question in the log pages and asks the user to answer "yes" or "no" for the question "this action was corect?" [17:54:09] I'll write up a proposal for a gadget that works like this one and a random sample set hosted somewhere (probably tool labs). [17:54:12] and then saves the answer in a support page [17:54:34] I wonder if we should do it on-wiki rather than interacting with labs. ... that could work. [17:55:11] i thought something like this... a page that shows random revisions and ask questions like "valid edition?" "good or bad faith?" [17:56:08] +12 [17:56:11] or just 1 [17:56:16] na. I like 12 [17:56:18] +12 [17:58:17] we can create pages on-wiki to do it on the Wikipedia domain. A temp database on tool labs would also do the trick, but may be less transparent [17:58:39] sorry, didnt understood the "+12" [18:04:01] was a typo. I intended to do "+1". [18:05:21] ow [18:05:22] :) [18:05:33] +13 [18:05:34] :D [18:06:14] Check it our HenriqueCrang: https://gist.github.com/halfak/0b362fae3ce143bd3877 [18:10:10] One more thing. I need a BADWORDS list for Portuguese. See the badwords list I have for English: https://github.com/halfak/Revision-Scoring/blob/master/revscores/language/english.py [18:10:12] great. we need to i18n badwords and misspelingg [18:10:37] :) [18:10:56] I think I'll be able to find a corpus for misspellings. Let me poke around. [18:11:18] we have some regex used in the filters [18:13:04] https://pt.wikipedia.org/wiki/Especial:Filtro_de_abusos/18 [18:13:15] Good starting place. [18:13:43] i am thinking here, If the abuser filter log could be used for train bad words [18:13:53] It looks like I can do stemming in Portuguese with pythons nltk. [18:13:59] But I'm struggling to find a good dictionary [18:14:07] but, in the other hand, this bad words arent allowed to be saved at all now [18:14:14] Gotcha [18:14:22] Even for experienced editors? [18:14:35] ops, you are right [18:14:49] confirmed users bypass it [18:15:15] just for the record, we have now 4 filter in ptwiki about bad words [18:15:17] https://tools.wmflabs.org/ptwikis/Filtros:18&68&7&70 [18:15:56] That's a pretty graph [18:15:59] but 2 are just sending warnings and labeling revisions [18:16:38] you can generate then for enwiki to if you want [18:16:39] https://tools.wmflabs.org/ptwikis/Filters:enwiki [18:16:40] ;) [18:17:27] you can select the filters you want and click "show graph" [18:17:27] '"Your mom" Vandalism' [18:17:48] lol [18:18:13] https://tools.wmflabs.org/ptwikis/Filters:enwiki:11&320 [18:18:21] It's very nice that vandals are mostly unoriginal. [18:18:29] "your mom" beats "it sucks"! [18:25:31] Oh! It looks like I *can* use wordnet for ptbr :D [18:27:20] * halfak downloads the Open Multilingual Wordnet [18:27:30] Woo! It works! [18:28:13] never used, ill google it! [18:28:27] Looks like the only think I really need now is a bad word list. [18:28:58] I'll start with this: https://en.wikipedia.org/wiki/Portuguese_profanity [18:31:17] wow, never seen this article [18:31:59] It's giving me ideas for additions to the english bad word list. [18:32:23] some of those can create false positives, would like me to review it in some way? [18:32:29] Yes please. [18:32:40] False positives are OK to an extent since we're looking for signal. [18:33:01] The features that I expect to use will control for the overall "badwordiness" of the article before the edit took place. [18:33:36] words like "monkey" or "horn" are only bad in some contexts [18:34:06] before? so you are thinking about a filter? [18:34:39] +1 It would probably be best to have you generate the list. [18:35:09] For controlling for badwordiness, check out: https://github.com/halfak/Revision-Scoring/blob/master/revscores/feature_extractors/added_badwords_ratio.py [18:35:32] It compares the badword/word proportion of the added text with the badword/word proportion of the text before the change. [18:35:59] cool [18:36:32] now I got it [18:36:32] This should reduce false positives, but it means that including something like "monkey" would reduce the usefulness of this feature. [18:38:08] Daww. Thanks for the wikilove. :) [18:38:24] you deserve it! :) [18:39:05] so, about the bad words list [18:39:17] in witch pattern you need it? [18:39:30] https://github.com/halfak/Revision-Scoring/blob/master/revscores/language/portuguese.py [18:40:05] ^ Just added the portuguese language file. Note that it will make use of a stemmer. [18:40:48] In order to test, you'll need to download some python nltk corpi [18:41:14] But, if you simply fill in the list of badwords, I can probably take it from there. [18:42:47] I've got to run. [18:42:48] i'll just create the list now [18:42:55] Have a good sunday! [18:43:00] o/ Ironholds [18:43:07] hey halfak :) [18:43:09] take care! [18:43:17] when you get back (if you get back) I have an RT to ask of ya [18:43:32] thanks a lot, halfak . Have a great sunday! [22:51:19] halfak: hey! [22:51:33] Ironholds: what are the valid device classes? [22:51:46] Guerillero, "phone", "tablet" [22:51:48] it seems I missed some talking... will check the log [22:51:55] but I may have found a way to automate it, so...hold off for a bit [22:52:00] ok [22:52:10] you have the Wii on there [22:52:15] and it is a game system [23:05:43] Guerillero, well, yes [23:05:51] but it's one people can browse the internet on [23:07:28] nice! random sets of revisions waiting for volunteers to train some system [23:08:22] LOL... Portuguese profanity [23:20:41] halfak: Here is another list of badwords (with badness indexes :-): [23:20:41] https://pt.wikipedia.org/w/index.php?title=Usu%C3%A1rio%28a%29:Salebot/Config&oldid=39984868 [23:21:27] (and this might be useful for viewing that page: [23:21:27] https://github.com/he7d3r/mw-gadget-FormatSaleBotRegexes/blob/master/src/FormatSaleBotRegexes.js [23:21:28] ) [23:22:04] halfak: there is also this: [23:22:05] https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Software/Anti-vandal_tool/badwords [23:22:13] and this [23:22:14] https://pt.wikipedia.org/wiki/Usu%C3%A1rio%28a%29:Alchimista/Express%C3%B5es.css [23:22:28] and this analysis: [23:22:28] https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Projetos/AntiVandalismo/Express%C3%B5es_problem%C3%A1ticas