[06:27:20] lzia, you're home! you can go to sleep now! [10:21:24] I wonder if there will be more or less researchers for "us" :| HP and Microsoft fired hundreds/thousands http://spectrum.ieee.org/view-from-the-valley/at-work/innovation/hewlettpackard-splits-again-but-what-about-the-labs [12:38:34] Nemo_bis, ooh, we should try and grab..hmn. [12:39:38] oh damn, she became an academic. Nevermind. [15:44:36] are we on or off for today's R&D meeting? [15:53:24] Nettrom, I don't know, but you'll be pleased to know that I'm following up on "Party in the R&D" with "this is how we 'doop" [15:53:42] * Ironholds is going to get an IEG grant to form a research cover band called We Are CHIentists [15:53:53] More seriously, I'll poke halfak when he gets in [16:11:54] as long as "this is how we 'doop" doesn't become "this is how we 'noop", it should be good [16:13:18] 'noop? [16:13:23] NOOP [16:13:46] https://en.wikipedia.org/wiki/NOP [16:14:31] * Nettrom admits that was a bit cryptic [16:15:42] * YuviPanda fills Nettrom's code with NOPs, and then patches them at runtime to do malicious things [16:15:56] ah, that explains why SuggestBot is down [16:16:05] haha [16:23:00] Hey Ironholds. [16:23:18] (Nothing to say. Just saying hello.) [16:23:19] :) [16:23:53] Oh! What patch are you talking about? [16:27:07] check the github badwords thingy [16:27:12] I britishised it! [16:34:03] halfak: any ideas how to make mw.xml_dump work with 7z files in the case of a xml dump which is split into 4 parts? [16:34:47] My code only used the first file when I tried it like this: https://gist.github.com/he7d3r/f99482f4f54f97895ccb [16:35:04] It's designed to do that. [16:35:10] The "map()" function [16:35:22] enwiki is 167 individual files :) [16:35:42] In my test I passed "ptwiki-20140928-pages-meta-history1.xml.7z" but the other 3 files were ignored =/ [16:35:44] http://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw.xml_dump.map [16:37:24] Ironholds, britishizing is not necessary when stemming. [16:38:03] halfak, fair! [16:38:04] Ironholds, it seems we have missed to add you to an important meeting. :-\ [16:38:09] leila, oh? [16:38:14] me and Nettrom are sat in an awesome meeting [16:38:14] we just figured it out with DarTar. You have an invite now [16:38:17] join if you can. [16:38:19] it's got me and nettrom! [16:38:23] okay! You want a nettrom? ;p [16:38:28] no [16:38:30] just you [16:38:32] it's goals [16:38:39] gotcha [16:39:16] Helder, do you have a gist that includes a call to "map()" [16:39:20] Maybe I'm missing something. [16:39:31] leila, when did the meeting actually kick off? [16:39:39] half an hour ago [16:39:42] like 40 minutes ago [16:39:44] so no R&D meeting today, thanks for the heads up on that [16:39:44] :-( [16:39:54] Sorry Nettrom. [16:39:56] * Nettrom goes back to his CSCW paper [16:39:58] halfak: I don't think so... That is the code I was using [16:40:03] I declined it last night Nettrom. [16:40:28] I'm planning to stick to keeping my calendar as a source of updating others, emails are too much for me. ;-) [16:40:41] Gotcha. Check out map() It is designed to serve your needs. :) It will spool up individual processors and process dump files in parallel. [16:40:49] ah, I see, I'll have to check the participant list to know if it's on or not? [16:40:54] You can always tell it to only start one thread too if you want it to be sequential. [16:41:04] Nettrom, not usually. This week is weird. [16:41:04] * Ironholds nods [16:41:06] That's what I do, Nettrom. not sure what others do [16:41:18] so, quick request, for the whole team then [16:41:18] nice! I was wondering how to use more than one processor too :-) [16:41:30] this is the Nth meeting that has started substantially late or not started at all ororor this week [16:41:35] I'll take a look into map then [16:41:46] which one Ironholds? [16:41:56] Ironholds, this meeting started on time. [16:42:18] yeah, I agree with halfak. it's just you were missed in invitation [16:42:18] fair. But the lack of an invite and I guess, the general problem of miscommunication or not-communication... [16:42:24] halfak: why do you say this is not necessary? https://github.com/Ironholds/Revision-Scoring/commit/a6172c315a2eac50d636233517667983ed6f05af [16:42:26] any how, join if you'd like. sorry. [16:42:28] Ironholds, no one is in charge. [16:42:43] " Ironholds, britishizing is not necessary when stemming." [16:43:00] I thought it was spelling. (british vs. american) [16:43:11] Adding some words isn't really "britishizing" [16:43:30] ah, makes sense [16:43:32] The title is "De-americanize list" [16:43:39] *look of disapproval* [16:44:30] I changed the commit message to be more accurate and merged. [16:45:58] :-) [16:46:15] "yid"? Do people really still use that? [16:46:22] Emufarmers, yep! [16:46:32] hell, we use it as an affectionate name [16:46:37] us Spurs fans are the Yid Army for a reason [16:46:49] What's a Spurs? [16:47:25] Tottenham Hotspur are a football team (actual football, not American Football) [16:49:37] Hand Egg [16:50:21] Gah! It was only two additional bad words. [16:50:32] Damn. Commit message is still wrong. [16:50:35] But less wrong. [16:50:37] Off by one. [16:50:58] heh [16:51:13] there are only two hard problems in computer science; naming, and Error: index out of bounds. [16:53:38] you forgot cache invalidation [16:54:21] Nettrom, that's an invalid problem [16:54:54] Well. It was, but now, we're not so sure. [17:02:55] Ironholds, we think we won't be blocked on sqoop [17:03:03] yep [17:03:11] okay, cool! [17:13:57] halfak: when a dump.xml is compressed into 2+ files, is there any guarantee that all revisions of a given page will be inside of one of the parts? [17:14:25] or can the history of a page happen to be split into both files? [17:15:59] Yes. Guaranteed to have complete pages. [17:16:13] You'll notice that the size of XML files can vary wildly for that reason. [17:16:24] Helder, ^ [17:16:39] * halfak is super stoked about Helder's research direction. [17:16:40] :) [17:16:54] great! [17:17:29] halfak: BTW did you have time to look into any of my questions from yesterday? [17:17:57] Ack! The only non-meeting work I got in yesterday was between 9 and 10 PM. [17:18:00] * halfak scrolls back. [17:18:15] heh [17:19:32] OK. First thing: Should we include badwords in other languages? [17:31:35] (got distracted) I think this will have important signal. [17:37:09] halfak: also, depending on the language of the wiki where a badword appear, the corresponding stemmer will produce different results [17:37:12] >>> pt.stem('nigger') [17:37:12] 'nigg' [17:37:13] >>> en = SnowballStemmer("english") [17:37:13] >>> en.stem('nigger') [17:37:13] 'nigger' [17:37:23] Gah! [17:37:30] Oh wait. This is OK. [17:37:45] so, if a badword is common in more than one language, maybe it needs to be in more than one list? [17:38:02] yes. In the short term, I think we should union language lists. [17:38:16] In the long term, I think we should consider features that have different types of lists. [17:38:38] e.g. racial slurs, curse words, sexuality, etc. [17:41:31] "union" as in "stem1(listForLang1) union stem2(listForLang2)" or "stem1(listForLang1 union listForLang2) union stem2(listForLang1 union listForLang2)"? [17:50:01] halfak: ^ [17:50:21] halfak, JFYI we appear to be hacking edit stuff, which is awesome but totally not my metier :) [17:50:27] so I'mma be benchmarking python if anyone wants me [17:50:57] Ironholds, I'm still catching up on email so that i can get to hacking :( [17:51:10] heh! [17:51:20] Helder, I think we'll need to use the same stemmer, however, we could have wordlist/stemmer pairs for different features. [17:52:19] e.g. num_local_badwords(local_stemmer, local_badwords) & num_foreign_badwords(foreign_stemmer, foreign_badwords) [17:53:49] We could even have num_all_badwords([]) [17:55:26] sum((stem(word) in badwords) for word in words for stem, badwords in stemmers_and_badwords) [17:56:07] Where stemmer_and_badwords = [(, ), (, )] [17:56:10] ... [17:57:39] for that, a given list of words (say in English) should be stemmed by many stemmers (e,g. "en" and "pt") and the results saved to separate lists ( and ) [17:58:04] and then repeat the same procedure for a list of Portuguese words [18:00:56] Yes. [18:01:10] We'd need many stemmers, but they are relatively cheap. [18:01:17] Stemming seems to be cheap too. [18:01:23] Lookups in sets are cheap. [18:17:55] halfak: so, since we currently have a "list of bad stems", we would need to convert these back to "badwords" and keep the badwords in the code instead of the stems (to allow applying different stemmers to each word) [18:30:59] Helder, I made a list of bad words in english.py and just run the stemmer on it. [18:31:14] If you mean that we should leave the list alone, I agree. [18:41:46] halfak: the list seems to contain stems instead of words: [18:41:46] https://github.com/halfak/Revision-Scoring/blob/master/revscores/language/english.py#L30 [18:41:54] https://github.com/halfak/Revision-Scoring/blob/master/revscores/language/english.py#L106 [18:42:56] hm... it also contains words =/ [18:43:04] Helder, you're right about "homosexu" [18:43:20] It mostly contains words, it seems [18:43:21] >>> en.stem("gyppy") [18:43:21] 'gyppi' [18:43:43] As you can see, I run a stemmer on it when constructing the set [18:43:52] https://github.com/halfak/Revision-Scoring/blob/master/revscores/language/english.py#L10 [18:44:09] ouch [18:44:15] didn't see that [18:44:23] (or I forgot about it) [18:44:28] That's good though. We can fix minor issues and stem later. [18:44:37] *later in the code flow. [18:45:16] so, in principle the "stems which are not words" which are in the list should be replaced by the proper words which give rise to them? [18:47:26] +1 [18:56:49] halfak: what about badwords with a low frequency (i.e. not removed often from pages)? should we filter out the ones with less than N occurrences? [18:57:23] Good Q. I'd like to see a writeup about your methods and findings before making this decision. Also, the real decider is the model fitness. [18:57:40] If it gets us a substantially better model, I don't care if it "makes sense" [18:58:36] Thought I suspect that this work will contain better signal and it is certainly worth a try. [18:58:52] Would you be willing to put a writeup on meta? [18:58:55] Helder, ^ [18:59:17] On talk page of the reserch page? [19:00:49] halfak: BTW: the source code of Salebot is not live anymore: http://fisheye.toolserver.org/browse/gribeco/salebot2 [19:00:50] =/ [19:01:21] I asked the author for a copy by email [19:02:38] DarTar, can I grab you for 10 minutes at some point today? [19:03:36] actually I can probably just PM it [19:55:37] Helder, could it be that salebot is now in tool labs? [19:55:59] not sure [20:01:28] halfak: aparently it is not on Labs [20:15:19] Yeah. I can't find it either. [20:23:04] can anyone explain to me where I go to get 2FA set up with mylabs account?I mean,I had it - on my old phone [20:23:17] office hours with gnubeard in -office, if anyone wants... [20:23:39] who? [20:23:52] Ironholds: Damon [20:23:57] oh [20:36:22] Ironholds, do you want to join the video? everyone is after their own business but if you'd like to join, the video is set up [20:37:36] leila, is any of it related to pageviews, LUCIDs or suchlike? [20:39:01] not sure what everyone is doing [20:39:15] What nuria_ and I are doing is not related to those three topics [20:39:20] you can join and ask folks [20:39:28] not everyone is here, b.t.w. [20:39:49] it's mini-hack call [20:40:44] I'll probably just workon kicking my backlog hard in the teeth [20:41:01] I've spent the last 4 days not doing any of the quarterly goals so we can plan for the quarterly goals.This seems bass-ackwards ;p [20:41:20] I bet you're exagerating here [20:41:21] :D [20:41:39] it's okay. just wanted to make sure you know how to join if you'd like [20:42:45] I'm not! [20:42:51] I mean, I've spent time thinking about the pageviews feedback [20:43:02] but I don't thinkn that counts because mostly I was thinking "I feel bad for not replying to any of it" [20:43:19] you worked on WMUtils [20:43:25] that's one thing I know [20:43:36] anyhow, do what you see fit [20:43:39] :-) [20:43:41] yeah, but that wasn't a quarterly review, I just really wanted to write C++ :D [20:43:45] *quarterly goal [21:34:08] Ironholds, want to talk to an EFF guy about unique counting on Monday? [21:35:03] halfak, sure [21:35:06] it's my free day [21:36:03] "free" as in no meetings? [21:37:09] yeah, minus my 1:1 with Toby [21:40:00] Sorry. I would like to invite you to a new one. :) [21:42:27] huh? [22:15:42] Hey Ironholds, I failed. [22:15:46] Monday is a holiday. [22:15:50] Moved meeting to Tuesday.