[00:00:31] * YuviPanda is about to fly off [00:00:40] any reccomended reading folks? [00:00:51] * YuviPanda contemplates reading the DNS RFC [00:01:16] YuviPanda, I never got the notes form that last paper. [00:01:19] Umm.... [00:01:22] * halfak thinks [00:01:40] halfak: oh, Borg. [00:01:45] Yeah! [00:01:46] Worth it? [00:02:00] YuviPanda, http://www.pensivepuffin.com/dwmcphd/syllabi/info447_wi12/readings/wk05-ConflictInCollaborations/geiger.BanningAVandal.CSCW10.pdf [00:02:02] tldr is 'omg guys, we had so much money we threw a lot of people at the 'network as a machine' problem and have some pretty elegant solutions!' [00:02:20] I think so - it's a fairly easy read [00:02:33] and talks about stuff that are great optimizations [00:02:57] YuviPanda, http://www.researchgate.net/profile/Carolyn_Miller4/publication/232915504_Review_of_Sorting_Things_Out_Classification_and_Its_Consequences/links/55218a110cf2f9c130528363.pdf [00:05:00] halfak: downloaded! [00:05:19] halfak: that looks awesome! [00:06:38] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.8928&rep=rep1&type=pdf [00:06:58] ^ If you're still around, that's not my favorite from Baryam, but it's the best I could find really fast [00:07:03] * halfak digs for that one paper. [00:07:35] This one! http://www.necsi.edu/projects/yaneer/Civilization.html [00:07:50] YuviPanda, ^ [00:07:56] Much more fun to consume :) [00:08:26] wheee [00:08:36] done [00:08:53] vice versa, do read https://en.wikipedia.org/wiki/Maher_Arar :) [00:10:17] Best lead I ever read [00:11:02] I mean, terrifying, but very clear. [00:12:01] heh, it's technically possible that I try to enter germany, get deported to US, get denied entry to the US, and then get deported to India... :) [00:12:06] 'technically' [00:12:33] Stop looking so much less like the TSA! [00:12:53] halfak: I will hold off on roadmapping for now and continue indefinite iteration. But my plan for renewal mostly consists of taking what already exists and making it better, plus better integration with stuff that already exists on Wikipedia. [00:13:09] heh [00:18:22] *MediaWiki ;) [00:18:30] If you're going to start thinking cross-wiki. [00:21:12] I was thinking of services offered specifically on English Wikipedia, but yes, cross-Wikimedia project integration. [00:22:05] now my challenge is expressing my requests in terms of sprints ;] [00:22:15] but james! you can't specify ten sprints in advance! sprints don't work like that! [00:24:19] * halfak cracks himself up writing curse filled sentences for his unit tests. [00:24:47] For me, SCRUM is a ritual and nothing more. [00:25:07] So are grant proposals [00:25:34] "We're going to " stops no one when they realize that is *way* more interesting and easier to do. [00:26:16] I hadn't bothered with a project management strategy because there was only two of us. So it all worked very well. [00:26:28] (Two of us, with occasional collaborators.) [00:27:55] Though, I want to bring on a third person to do JavaScript! [00:30:14] [No definite plan to do so, of course; I need to collect feedback from my pilot projects] [00:32:09] I'm amazed that I haven't been banned from everything. [00:32:22] I submit a huge amount of offensive words in many languages to github. [00:33:04] You think anyone would care? Programmers are a notoriously salty crowd. [00:33:22] It's somehow different when I use the word in a sentence. [00:33:56] "Commit summary: Fuck you!" [00:34:04] I need to make sure that the curses get parsed out correctly. [00:34:13] 'I work an association. Of stupidasses.' [00:34:25] I needed to make sure that words that start with 'ass' wouldn't get picked up. [00:34:42] But that I could find an 'ass' word that didn't start with 'ass'. [00:34:45] Except for when they should: "assclown," "asshat," "asshole" [00:34:56] Yeah. I have a lot of variants of ass [00:35:01] ...my list is in alphabetical order and i didn't even do that on purpose [00:35:22] * halfak scrutinizes the list to verify that. [00:35:24] confirmed [04:09:05] halfak you mean like an association? [04:09:17] assistant [14:05:15] Hey folks! [14:05:17] o/ ToAruShiroiNeko [14:18:55] hey halfak [14:19:07] what is the difference between "informals" and "badwords"? [14:19:26] are the two lists supposed to be mutually exclusive? [14:19:48] Helder, "informals" are words that you would use in conversation that aren't really appropriate in a formal text. [14:19:59] I think they'll work better if they are exclusive. [14:20:11] We can always sum the features together if we want them to intersect. [14:32:38] Helder, I think that we'll get good signal from informals in articles and less good signal in talk pages. [14:32:51] So it's good to differentiate [14:33:11] We'll get *way* better signal for badwords in talk pages if they don't include informals. [14:33:16] I was trying to understand the difference between the two by looking into existing languages in revscoring [14:33:21] but then I found intersections =/ [14:34:31] E.g. "pene": https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/spanish.py#L40 [14:35:30] Yeah. I think those are mistakes and I've been trying to clean them up. [14:35:45] This gets some of them: https://github.com/wiki-ai/revscoring/pull/169 [14:36:07] I have another PR in progress that'll include some more fixes. [14:36:18] I'm turning languages into simple feature collections. [14:36:31] Something I'm a little more worried is that since the introduction of RegexLanguage the library might not be considering some badwords as being badwords, because if I remember correctly it seems we don't use stemming anymore [14:36:59] (in portuguese, I mean) [14:37:00] Helder, yeah. That's right, but I saw AUC go up. [14:37:19] Sub-1%, but still [14:37:31] We could also use stemming. [14:37:46] Stemming +regex will be weird. [14:38:23] I mean, when I constructed the list, it trhow away many badwords which Salebot matches because they had the same stem, and since we were using stemming, only one of the words would be enough in the list [14:38:55] *I threw away [14:39:33] Helder, we can manually stem in the regex for terms that need that. [14:39:46] E.g. for english, I implement the following regex: [14:40:05] r"\w*f+u+c*k+\w*" [14:40:26] Which will match "motherfucker" and "fuuuuuuuckkkk" [14:40:29] yeah, but since the list was huge, I didn't do that filtering manually [14:40:42] Gotcha. We'll likely want to take another pass. [14:41:03] I'm also sad about our testing. We need to include more variants in our tests so that we know when these types of things fail. [14:41:19] Anyway, we can always do both stemming and regex matching. [14:45:37] halfak, what is the procedure for adding a test which demonstrate a regression? [14:45:44] I mean, can a failing test be added? [14:46:00] Helder, generally, you should add the test and then solve the problem. [14:46:14] Either that or you can demonstrate it on the commandline and then file a bug. [14:46:32] "add the test" and let if failing until a fix is commited in the future for fixing it? [14:47:27] I don't think we want to add tests and just let them fail. [14:48:31] https://gist.github.com/halfak/681dd5a51630c0d416df [14:48:35] Helder, ^ [14:48:41] Do that and submit it as a bug. [14:50:22] halfak, could you run that for the words "gozar" and "gozei" in Portuguese? (both should be considered badwords, and I believe they were before the change to regexes) [14:50:32] (my setup is a little messed up right now) [14:51:55] Appended to https://gist.github.com/halfak/681dd5a51630c0d416df [14:52:15] Could you propose a better regex for the word? [14:52:23] Or maybe list a set of expected variants? [14:52:33] yey! So we probably have more than 50% of the badwords matched by Salebot not being matched anymore [14:54:17] For example, this line shows the list of words whose stem is "goz", by number of removals (edits which removed that word): [14:54:17] https://gist.github.com/he7d3r/7e3718a43f5ce65e0dab#file-salebot-stems-words-stats-txt-L152 [14:54:43] it has many variants, and we are only considering the variant "gozar" [14:55:14] the same happens for all other lines in that file (i.e. all our Portuguese badwords), which have their own variants [14:56:52] halfak, ^ [15:05:17] Helder, OK. Did you ever confirm that stemming was working for this before? [15:05:29] Also, let's demonstrate the problem and then fix it. [15:05:38] yep, when I first generated the list [15:07:40] halfak, reported as https://github.com/wiki-ai/revscoring/issues/170 [15:07:42] I ran into some quarry fans in Stockholm [15:07:50] YuviPanda, \o/ [15:08:02] Helder, should have been in the tests :\ [15:08:03] I didn't even know I was going to be in Stockholm [15:08:35] helder, why is 'como' a badword? [15:09:49] halfak, depends. [15:13:15] Hmm... Seems like a bad idea then. [15:13:22] Could it be more 'informal'? [15:13:39] Is there a good reason that como should appear in reference material outside of a quote? [15:15:40] sure [15:15:44] But see https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Badwords [15:16:09] The whole badwords list is full of things which are not necessarily obcene [15:16:50] Sure. That's an automatically extracted list. It contains a lot of words that aren't "badwords" that got picked up by mistake. [15:17:31] The portuguese badwords will need to work outside of Wikipedia too. [15:17:42] that is what I said when I generated it, but since we din't have a clear criteria for filtering it, we never did it [15:18:05] Hmm.. We've been having people filter these lists for a while. [15:18:13] as for the word "como" ("to eat") it can be a verb or an adverb (e.g. "like") [15:18:15] I regret that the criteria was unclear when you looked at it. [15:19:12] Here's the instructions that I gave for Spanish: "One list of offensive words. Another of informal words (that don't belong in a wiki)" [15:23:11] IS this still the current list for ptwiki? https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-ptwiki [15:25:12] https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/pt [15:25:39] o/ bearloga [15:27:29] halfak: ahoy there! how goes it? [15:27:45] halfak, if I have a file "pages-meta-history.xml.7z", is there a command I can use in the terminal to pipe with a grep command? [15:27:50] Not bad. just thought I'd wave and say goodmorning :) [15:28:11] 7z e -so | less [15:28:20] Might be '7zr' [15:28:26] depending on how it installed. [15:30:00] thanks [15:40:26] Helder, are you planning to take a pass over the portuguese badwords to turn them into regexes? [15:40:42] If not, I'll do my best with what's there [15:41:02] Also, it will be nice to have a good test support for this. [15:41:24] I'm considering taking all of the words in the dict and passing them through a regex so that you can see what words you match. [15:45:57] Yeah... that might not work. [15:46:18] Looks like you can't just ask a dict for all its words. :\ [15:50:23] why not just copy Salebot page then? They are already regexes [15:50:42] I only converted it to words because that was what revscoring would use [15:51:16] Helder, not sure. Does salebot do a good job of identifying offensive words? [15:51:37] We should probably grab their list and curate it. [15:51:59] that is what I "did" [15:52:40] I mean, I tooke the lines with the greatest "score", and looked for matches in the dumps, sorting by number of removals [15:54:21] Yes. It seems that we should either re-implement a stemmer (what I suggested earlier) or do that again targeting regex matching. [15:54:40] Either way, it seems that we need to capture these concerns in the tests. [15:57:27] I was hopping that when the "badwords as features" were implemented (what you call bag of words) we could finally filter my ptwiki list by keeping the ones with the highest weights learned by the models [15:57:33] https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/2015#.28Bad.29Words_as_features [15:57:50] Helder, seems like that's slow to arrive. [15:57:51] "In the context of vandalism detection, top predictors could be used to improve the lists of badwords used by abuse filters, Salebot and similar tools which do not use machine learning. In the specific case of the Salebot list, we could even use the learned weights to fine tune the weights used by the bot." [15:58:24] Also, learned weights from Wikipedia, may not work for Wictionary or WikiHow. [15:58:52] I guess that is OK. We can repeat the process there. [15:59:21] the same way the badwords lists are generated from Wikipedias, and may not be valid for other wikis =P [16:00:49] I suspect that the "badword" peculiarities for a wiki will primarily result in overfitting. [16:07:20] * Helder is sad for having to maintain too many badwords lists, each in a different format [16:07:36] They are all in the same format. [16:07:38] https://pt.wikipedia.org/wiki/Usu%C3%A1rio(a):Salebot/Config#Ver_tamb.C3.A9m [16:07:51] Oh... that. [16:08:03] Yeah, I don't see a good proposal for a unified format. [16:08:45] that [16:11:37] Could write one up. :) [16:11:51] What I really want is a format that includes: [16:12:11] , , [16:20:56] It would be great to have a DB of this stuff that would generate matches and non-matches after a an update. [16:44:04] halfak, can you explain roughly the idea Amir used to generate the lists? [16:44:29] TF/iDF [16:44:38] Take the words that are added in revisions. [16:45:02] TF if the proportion reverted edits that a particular word appears in. [16:45:17] iDF is the inverse proportion of all edits that a particular word appears in. [16:45:26] TF == How common is this word to reverted edits [16:45:33] DF == How common is this word to all edits. [16:45:55] It's really TF/DF or TF * iDF [16:47:12] thanks! [16:53:26] I was trying to understand why a username appeared in the ptwiki list [17:14:40] halfak, so for consistency, I filtered Amir's list and added it here: [17:14:41] https://meta.wikimedia.org/w/index.php?diff=13088354 [17:15:04] I won't have the time to move it into revscoring for now [17:15:12] Great! I can work from this. [17:15:33] * Helder needs to go [17:15:40] * Helder o/ [20:58:48] o/ ToAruShiroiNeko [20:58:58] We were PMing about handing Chinese and Japanese. [20:59:13] yes [20:59:35] so I think Amir's strategy would work well. [20:59:41] I have a lot to learn, but I've already done some work in deltas to make tokenization make more sense. [20:59:42] https://github.com/halfak/deltas/releases/tag/v0.3.2 [21:00:05] wikitext_split now splits CJK (chinese, japanese, korean) chars as tokens. [21:08:36] halfak also we can translate between simplified chinese and traditional [21:08:39] wiki does this somehow [21:08:47] the two used to be seperate wikis [21:09:00] use of the character set is a good feature [21:09:02] Is it just a character mapping? [21:09:06] yes [21:09:10] OK [21:09:23] each traditional and simplified should have mathching euavalents [21:09:33] howeever some vandalism will be in one or the other character set [21:10:14] someone from mainland china/PROC (traditiona?) will use bad words differently from taiwan/ROC (modern?) [21:10:37] Can we even tell what character set an edit was made in? [21:11:38] we should be able to. I think they translate it based on the character set pref [21:11:48] not only the UI changes but all text [21:12:03] how does mediawiki handle it would be an interesting uestion [21:12:16] mixing the two character sets could be good signal too [21:12:26] I imagine both have seperate unicode ranges [21:55:39] ToAruShiroiNeko, indeed they do. I have the ranges labeled in the source file in deltas. [21:55:50] https://github.com/halfak/Deltas/blob/master/deltas/tokenizers/wikitext_split.py#L45 [22:34:02] halfak I am not sure which one refers to what range. [22:34:12] perhaps you could denote that with a comment [22:34:14] ? [22:35:47] \u4E00-\u9FFF is mostly simplified Chinese [22:36:20] This is the best I've got: http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode [22:36:55] If you're interesting in doing the work to figure this out, submit a pull request to deltas and we can have different token types for simplified/traditional etc. [22:45:16] halfak: I'm having success replicating our last session together on my other machine. But could you remind me what we installed after we successfully installed ores? [22:45:40] and from which repositories? [22:45:48] Once you are able to ssh into the machine, I had you run the following commands: [22:46:04] Oh! I made a gist for this. [22:46:07] Virtualenv: https://gist.github.com/halfak/9f4830895496af9e9731 [22:51:22] Then set up your 'projects' folder with this: [22:51:23] https://gist.github.com/halfak/5146e66178fadd8d3ac8 [22:51:45] These things should migrate to the wiki eventually. [22:54:13] Ok. I had done everything from the first gist. [22:54:28] So is the (venv) shell to be used? [22:55:08] (ever)? [23:18:51] So you prefer I clone rather than that I fork and clone? [23:19:15] (Second gist step 2) [23:29:57] Ok, I just cloned directly. That's all for now, thanks.