[00:08:55] [16:54:41] * +Ironholds generates gini versus herfindal values <-- I'm starting to think YuviPanda|zzz is on to something about you guys making up names [00:09:40] Emufarmers, we should generate random names, and then statistical names [00:09:47] and then run a Kolmogorov�Smirnov two-sample test to see if they match up [16:55:55] * Ironholds blinks [16:55:58] amazon just shipped me 200 diapers [16:56:07] do their data-mining algorithms know something about me I don't? [16:56:28] Ironholds: you have a 'send me a magic package every month' subscription? [16:57:01] Ironholds: or have they just decided 'no, you don't need those books, as you're not going to have any time soon. These diapers, however....' [16:57:09] I definitely do not need diapers [18:05:45] +Ironholds standup [18:05:50] ggellerman___, oop! Thanks :) [18:05:55] for some reason my calendar invite didn't trigger [18:05:56] there in 2m [18:06:18] my haiku didn't work :( [18:06:47] well, mostly it did ;) [18:34:58] halfak, do you have an example of one of your "take a TSV and split it into smaller TSVs, making sure all [unique_value_column] values stay in the same file? [18:35:15] Ironholds, in the API talk now [18:35:28] You're supposed to join up :P [18:35:49] yessir [19:17:41] halfak, do you have an example now? :D [19:17:58] Oh yeah. [19:18:42] yay! [19:18:48] 1. sampling identifiers is non-trivial, but we can. [19:19:03] 2. why do you need this? Hadoop will partition and split for you [19:20:54] it will? Huh. [19:20:57] Example? [19:21:41] * Ironholds googles [19:22:33] Ironholds, you're looking for "secondary sort" [19:22:50] Specifically the "-partitioner" you can specify. [19:22:59] gotcha [19:23:14] and this'll result in multiple output files? [19:23:26] or simply partitioned data within hive? [19:23:55] Well, if you want multiple output files, you just tell it to use multiple reducers. [19:24:00] :) [19:24:45] and it'll know to determine what goes into each file based on a particular field? [19:25:04] Exactly. Let me show you an example in streaming. [19:25:07] * halfak digs [19:25:09] okie-dokes [19:25:39] http://socio-technologist.blogspot.com/2014/11/fitting-hadoop-streaming-into-my-python.html [19:25:50] At the bottom of the ol' blog entry is a call to hadoop. [19:26:01] In that call, I set up a org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner [19:26:14] And I tell it mapreduce.partition.keypartitioner.options='-k1,1n' [19:26:29] Which means "partition on the first TSV field and interpret the field as a number". [19:26:39] (not sure if the number bit matters) [19:26:57] I also use mapred.output.key.comparator.class="org.apache.hadoop.mapred.lib.KeyFieldBasedComparator" [19:27:02] To sort the data. [19:27:16] For that I tell it mapreduce.partition.keycomparator.options='-k1,1n -k2,2n' [19:27:33] "Sort on the first and second fields interpreted as numbers" [19:27:50] gotcha [19:27:52] thankee! [19:28:36] :) [19:29:05] Check out stat1002:/home/halfak/projects/intertime/run_in_hadoop.sh [19:29:37] copied and saved for future runs :D [19:29:46] so, ah. For the time being...I have 52GB of requestlogs sitting on stat1002. [19:29:59] Want to session them? [19:30:00] :) [19:30:14] I just need to split them into chunks R can deal with [19:30:32] all the records for each uri_path need to be together, unfortunately [19:30:35] Oh! Then I might have a different strategy. [19:30:37] One sec. [19:30:41] so I was thinking unix sort by column and then ??? [19:31:24] hehehe. sort | uniq | shuf -n > sampled_ids.tsv [19:31:36] I did this for the lol session dataset [19:32:26] see stat1003:/home/halfak/projects/activity_sessions/datasets/release/clean [19:33:13] there's a little python script there that will load a hash map of IDs and then filter the dataset using them. [19:33:18] sample_users.py [19:33:37] Let's say I was going to sample down the AOL dataset. [19:34:41] user_id is the first column. cat aol_search.clean.tsv.bz2 | tail -n+2 | cut -f1 | sort | uniq | shuf -n 100000 > sampled_aol_ids.tsv [19:36:01] then I'd run: bzcat aol_search_clean_tsv.bz2 | python3 sample_users.py sampled_aol_ids.tsv > sampled_aol_search.tsv [19:39:06] cool! [19:39:21] and 100000 is the unique var count, or the row count? [19:39:40] Unique user count. You could find a user with a ton of events and that will inflate your row count. [19:40:33] gotcha [19:40:34] thanks! [19:40:59] hmn [19:41:39] I can't connect to stat3. Weird. [19:41:50] Have you tried since the vlan switch? [19:42:11] You used to be able to use stat1003.wikimedia.org, but now you have to use stat1003.eqiad.wmnet [19:42:21] huh, good point [19:42:22] I'll check [19:43:10] that got it. Ta! [19:43:19] Woot [20:03:39] Ironholds: do you know if we're meeting now? [20:04:32] halfak: do you have project pages for what Morten does handy? [20:05:18] and do you know how long is the length of his collaboration (Bob is 6 months, extend-able to 1 year in our minds, or example) [20:06:10] leila, I hope so, because I've been sat in the meeting for five minutes [20:06:23] me, too. :D let me at least set up the giant Ironholds [20:07:10] Ironholds: they were in another meeting and Dario has told Dan that he's gonna go get lunch [20:07:23] so I'm announcing this meeting as canceled, and I go to my shell [20:07:23] :D [20:08:48] I stay around Ironholds. I just saw Dario's email [20:08:55] kk [20:13:14] leila, I don't think that Morten has a project page for what he is working on with us specifically, but if you want to see what he is up to more generally, check out https://en.wikipedia.org/wiki/User:SuggestBot [20:13:41] Recently, he has contributed heavily to https://meta.wikimedia.org/wiki/Research:Measuring_article_importance [20:17:19] got it, halfak. thanks! [20:29:38] I have officially discarded mwparsefromhell in favor of an island grammar. [20:29:45] regexes to the rescue! [20:35:47] Does anyone here have a copy of "Carter, Jacobi. 2010. ClueBot and Vandalism on Wikipedia" ? [20:36:14] (it was previously on http://www.acm.uiuc.edu/~carter11/ClueBot.pdf) [20:37:01] archive.org to the rescue! [20:37:03] Helder: https://web.archive.org/web/20120305082714/http://www.acm.uiuc.edu/~carter11/ClueBot.pdf [20:37:23] great! :-) [20:37:27] thanks valhallasw`cloud [20:37:33] yw [20:37:59] Oh wow. This is *old* cluebot [20:38:01] rules! [20:38:51] yeah [20:39:11] I was curious about it [20:40:16] We could implement it as a custom scorer ;) [20:40:39] :-) [20:48:46] BTW halfak: https://meta.wikimedia.org/w/index.php?title=Research_talk:Revision_scoring_as_a_service#.28Bad.29Words_as_features [20:50:11] Yeah. I saw that. I've been thinking about the same thing. [20:50:34] Really, we can probably just have a separate feature for a small set of critical bad words. [20:50:55] And another feature that counts up bad-ish words. [20:51:35] the other thing is that idea of having different sets of words (profanities, pronouns, etc) [20:51:55] I like that one better because it is easier to apply cross-language. [20:57:05] BUT I still think it is worthwhile to explore these word classes and which words are the most predictive. [20:57:10] Helder, ^ [21:00:18] about your pull request (https://github.com/halfak/Revision-Scoring/pull/22/files), what other things do you imagine as possible modifiers beside "log"? [21:00:20] halfak: ^ [21:00:53] Not sure, but I figure we'll have some. [21:01:21] e.g. inverse, abs, etc. [21:02:53] halfak: I was wondering if for example "added_badwords_ratio", "badwords_added", "prev_badwords", "proportion_of_badwords_added" and "proportion_of_prev_badwords" would also be considered as some kind of "modifications" of a single feature... [21:04:22] I was thinking about that too. We could do a "ratio" modifier that takes two features. :) [21:04:38] that makes sense, I think [21:05:08] Do it! That sounds great. :) [21:10:57] leila, so can I run a stats problem past you? [21:11:12] (I warned you in standup :D) [21:22:10] are we doing the R&D meeting this week or did the combined staff cover it? [21:22:22] I think that combined staff covers iot [21:22:26] From my POV the Combined was kind of dominated by Lila's game of 20 questions [21:22:27] okie! [21:22:35] heh. That's a good point. [21:22:54] tnegrin, what are your thoughts re. R&D staff meeting [21:22:56] Bah [21:23:09] I swear he can see my typing [21:23:23] hahah [21:23:33] If you still see the 1:30pm PST RD Staff Meeting on your calendar today, the please ignore it...I can't cancel it, but we are not haivng it so enjoy that hour :) [21:23:47] Thanks ggellerman___ [22:13:29] Ironholds, http://i.imgur.com/WB6Prcw.png [22:13:41] hah [22:13:45] the twin cities are hella-dope [22:14:02] :) [22:14:12] DarTar, nuria, other people with spawn, would you have any use for, like...400 diapers? [22:14:14] for whatever reason, my IRC client thinks that my real IRC handle is DarTar-lunch [22:14:18] Amazon was meant to send me a work bench [22:14:21] it sent me 400 diapers [22:14:25] congrats! [22:14:33] either it knows something I don't, or they fucked up and I have 400 diapers [22:14:38] I guess I could build a really terrible boat [22:14:40] 400? [22:14:41] wait, what [22:14:54] taht is a lot of money Ironholds , for real [22:14:59] sorry, that was a lie [22:15:01] 160 diapers [22:15:03] if they have a Frozen theme you can put them on ebay and get rich [22:15:04] ahhhhh [22:15:07] they shipped me 160 diapers [22:15:19] DarTar: you are a business nijaaaaa [22:15:31] naw, he's a sociophysicist [22:15:32] Ironholds: same size? [22:15:40] there was a time when Frozen costumes were apparently the hottest investment of the planet [22:16:08] I watched the movie with my nieces over the holiday. I was unimpressed. [22:16:22] Ironholds: I’m happily approaching the end of potty training with the little one, so I’ll decline the offer, but thanks for thinking of us [22:16:35] no problemo! [22:16:35] Ironholds, I vote boat [22:16:46] halfak: I haven’t seen the movie, but I guess I don’t need to at this stage [22:16:48] nuria, size 4, apparently [22:16:49] 22-37lbs [22:16:55] I don't know if that's baby size or the poop capacity [22:17:02] halfak, ooh! Hot air balloon! [22:17:22] Ironholds: I vote you call the nearest church to your dwellings and take them in, is is a VERY appreciated donation, given how expensive they are [22:17:23] the diaper ship of theseus will totally secure a spot at the next burning man [22:17:25] *holds up PUN sign* an art installation of constellations called The Big Diaper [22:17:31] nuria, that sounds like an awesome idea! [22:18:47] Ironholds: they will love you anywhere you go with those: http://www.roomtogrow.org/index.php?option=com_content&view=article&id=67&Itemid=54 [22:19:00] How much space does 160 diapers take up? [22:19:35] nuria, oh cool! One of those is like 5 blocks from me. Thanks! [22:20:05] halfak, 15x5x7 inches [22:20:07] this is an approximate, mind [22:20:16] I don't have a tape measure on me and so had to use my bow [22:20:22] it's like, 1/4th of a 62-inch recurve long [22:20:27] Ironholds, that's a smaller package than I thought. [22:20:35] I was imagining a mountain of diapers. [22:20:40] I wish [22:20:45] Really? [22:20:47] at least then I could Scrooge McDuck in them and put them to use [22:20:57] lol [22:21:19] (http://www.neatorama.com/wp-content/uploads/2012/04/scrooge-500x281.jpg for anyone without the cultural context) [22:22:22] halfak: cc’d you in https://phabricator.wikimedia.org/T84923 because that sounded vaguely… related [22:22:31] http://www.earthbanana.com/wp-content/uploads/2013/02/scrooge-mcduck-swimming-in-money.jpeg [22:22:44] YuviPanda, saw that. [22:22:57] Scrooge McDuck is probably my earliest memory of american culture. [22:23:03] was on TV a lot here. [22:23:34] Ducktails was an awesomeshow. [22:24:02] yup [22:24:03] mm [22:24:04] it was [22:25:08] halfak: I’m pretty sure I never remembered anyone’s names other than Scrooge McDuck and Launchpad [22:25:39] Hewie, Dewie and Louie! [22:25:50] how do you not remember Hewie, Dewie and Louie?! [22:25:56] that's like forgetting the firemen in Camberwick Green [22:26:00] halfak: https://en.wikipedia.org/wiki/Huey,_Dewey,_and_Louie [22:26:08] Pew, Pew, Barney, McGrew, Cuthbert, Dibble, Grub [22:27:12] Ironholds: I was 8, and did not know English in any form... [22:27:24] and also am pretty sure that show was dubbed in Tamil when I saw it [22:27:33] ... [22:27:39] I desperately want to see Ducktails in Tamil [22:28:39] I can not seem to find any proof online, sadly [22:28:46] nuria, my friend Alice has pointed me to a local womens' shelter. Perfect :). Thank you for the initial idea! [22:41:54] Ironholds: I also just read backscroll. That’s… a lot of diapers [22:42:31] It sounds like it isn't that much. It's sized like a breadbox. [22:44:55] Ironholds, ballpit of diapers [22:45:16] I'm still liking nuria's idea the best ;p [22:45:53] While it may be what reasonable people might actually do with them, the ideas of what you *might* do with them are more fun. [23:33:18] Ironholds: I'm back. what's the stat question? :-) [23:33:38] leila, so, I have a dataset, let's say c(103,9,1) [23:33:43] (simple example) [23:33:57] they represent the number of appearances of each unique user agent for that page [23:34:12] I want to extract the user agents responsible for the concentration/inequality going above [threshold]. [23:34:35] So, easy, right? sort by value, keep removing from the top and recalculating until the value goes below [threshold] [23:34:39] can you say what are the three components in c(...) [23:34:50] table(user_agents) say ;p [23:35:17] the problem is removing values from consideration means the sequence of values aren't comparable [23:35:28] c(9,1) actually has more concentration than c(103,9,1) [23:35:55] wait, that's wrong. But, not much less. You get what I'm trying to say. [23:36:23] So, what I was thinking was instead of dropping 103 and ending up with c(9,1) for the second run, repeating the value 1 103 times. That way the actual sum of the set remains the same, even if I have knocked a value off the top [23:36:35] so, c(103,9,1), and then for the second iteration, c(9,1,rep(1,103)) [23:36:47] does this introduce errors in the opposite direction, or? [23:38:18] yes, it does, since your threshold will probably change [23:38:38] that threshold is mean or some other threshold based on the data, probably [23:38:40] let me think [23:38:45] what is the goal? [23:38:58] you want to find the outliers and then do what with that information? [23:40:39] be able to remove them [23:40:45] so, the use case is: I grab all of a page's requests [23:41:02] I calculate the Herfindal measure over the user agents and find that the requests are really concentrated from one UA [23:41:08] I want to extract that UA [23:41:27] ...except, it might be two UAs, or three, or five, and knowing when to stop extracting is...beyond my maths [23:49:36] Ironholds, when you say UA you mean user_agent? [23:49:46] or unique identifier? [23:49:57] (mix of IP and user_agent, for example) [23:52:11] the latter [23:52:14] ah, wait [23:52:16] the former! [23:52:26] but user agent on its own is not unique [23:53:43] doesn't have to be! [23:54:19] we care only that the proportion of requests for [page] coming from one user agent is disproportionately high [23:54:26] I agree it is a problem for exclusion, but it's not one we can resolve :(