[00:06:16] Ironholds: yt? [00:06:23] DarTar, I deny everything [00:06:37] no, I swear it was you [00:06:58] anyway, yes, I'm here. [00:07:42] alright, I need to talk to Maryana 15 mins but stick around, I’ve got something interesting to discuss [00:08:02] sure [00:08:10] I'm up until 10pm PST tonight. [01:55:40] Ironholds, you around? [01:55:59] I'm looking into our 18 minute insanity. [01:56:04] I am! [01:56:11] Which BTW, is actually centered at 20m. [01:56:26] So, there are a lot of different UUIDs represented here. [01:56:50] But it's obvious which ones have WAY too many observations. [01:57:07] yup [01:57:09] In the previous 10 seconds, a max of 8 obs per UUID. [01:57:23] 1200 and 1201 seconds [01:57:28] automated as HELL [01:57:43] In the problematic 10 seconds, there's several with over 100 obs. [01:57:45] Yeah. [01:57:54] So, I'm going to filter out everyone with > 8 obs. [01:57:57] >100? I had some with >1000. Although that might've been in the entire dataset. [01:58:28] makes sense for hand-working [01:58:31] Indeed. Any many obs per UUID is not a problem, but all in one tiny time slide is. [01:58:42] although obviously for the writeup we'll need, like, 1.5 MADs out for shits and giggles, or something [01:58:59] (the "for shits and giggles" must make it into the text) [01:59:23] 1.5 MADs? [02:00:22] random-ass example; any kind of measure'll do. [02:00:34] But so we have "[value], which is 1.5 deviations out" or whatnot, rather than just "[value]" [02:00:46] I may not be making much sense on account of it's 9pm and I'm handling revert detection. [02:01:39] Ironholds, let me aid you in revert detection :) [02:02:02] naw, I've got it! [02:02:05] we're good [02:02:13] kk [02:13:28] There's another spike! [02:15:55] ooh, where? [14:52:47] morning halfak! [14:53:01] hey dude. How's it going? [14:53:05] say, if we were dual-publishing would you have to be Aaron Halfauthor? How deep does this go? [14:53:21] Had better years, but I'm excited to work on the stuffs today. [14:53:31] wat [14:53:35] Halfauthor? [14:54:02] * halfak might have enough coffee to get the joke and groan. [14:54:30] if you were one of two authors... [14:54:44] Oh goddamn [14:54:46] * halfak groans [14:55:35] the worrying thing is I know exactly what that "oh goddamn" sounded like. [14:55:39] I think I might pun at you too much. [15:21:36] halfak, your internet is bad and it should feel bad ;p [19:32:47] halfak, you documenting the edit dataset, or are you using my one? [19:32:47] and, would you like me to take a stab at documenting the AOL set? [19:32:48] You can leave the edit dataset to me. [19:32:48] Yes please for AOL [19:32:50] cool! [19:38:38] YuviPanda, congratulations! it's official now. :-) [19:38:49] leila: :D indeed \o/ [19:39:02] announcement went out and almost immediately labs died :) [19:39:03] not my fault tho [19:39:29] haha! totally! I saw my queries getting 504 error and I knew what happened. :D [19:40:17] leila: :) [19:40:55] the machines were overcome with happiness [19:41:06] ;p [19:41:09] :P [20:21:36] * Nettrom also cheers for YuviPanda, tots awesome! [20:21:43] *totes [20:22:02] or maybe it's "tots", my English is conflicted [20:22:16] Nettrom: \o/ :D ty [20:32:28] halfak: lemme know when you get a sec to talk about some hadoopy stuff [20:32:55] Hey ottomata, I'm OoO (sort of) today. [20:33:03] ah ok [20:33:03] But you can pull me back with hadoopy stuff. [20:33:05] :) [20:33:11] will you be OoO tomorrow/ [20:33:11] ? [20:33:21] i'm doing a little hack day with fabian tomorrow, and we are looking for projects :) [20:33:33] i also want to talk with you about revision dump stuff a little bit [20:49:44] ottomata, I will be out tomorrow. I'm out until Thurs. next week. [20:49:51] ah ok [20:49:55] well, quick then! [20:49:59] what should fabian and I work on tomorrow?! [20:50:04] Oh! Ha. [20:50:09] So... [20:50:35] I want to be able to stream a sequence of revisions from a page or group of pages -- in order -- to a python script. [20:50:48] OR [20:51:02] given what, start and stop rev_ids? [20:51:09] and page id? [20:51:24] Na. All of the revs for all page. [20:51:30] The python script can filter on its own. [20:51:31] so, given a page id [20:51:37] Ahh. yes. [20:51:40] git all revisions in order [20:51:47] and you want text [20:51:48] right? [20:51:50] revision_id, text [20:51:52] Sort on (page_id, rev_id OR rev_timestamp) [20:51:56] Yes. [20:51:57] ok [20:51:58] Text [20:51:58] hm [20:52:27] The other thing I am working on is replicating the diffdb dataset with some more advanced diff algorithms. [20:52:33] well, fyi, i am working on something (if it ever finishes) that will make this data presentable in hive. so there's that. fabian wants to do fancy scalding stuff i think [20:52:36] but that is a good start [20:52:48] scalding? [20:52:58] yeah, uh so many abstractions there [20:52:59] haha [20:53:14] scala (language) version of cascading (java framework) for doing mapreduce [20:53:27] but, i think we can work on that [20:53:34] oh, halfak, the other thing I want to tlak to you about (this can wait though) [20:53:40] is import format of xmldumps [20:53:50] we are not going to use the xmldumps as the source of analysis in hadoop [20:53:57] too inefficient and difficult to work with [20:54:09] so, particularly, i want to talk to you about how to represent diffs, if we shoudl at all [20:54:13] wikihadoop gives you two revisions [20:54:15] to work with [20:54:32] i think i'd like to store the current rev text, along with the diff from the previous revision, maybe. [20:54:32] not sure [20:54:36] but, we can figure that out later [20:55:27] Diffs depend heavily on tokenization and algorithm. I'm skeptical that storing one diff's result for future processing will be that useful. [20:55:38] For example, i have my own diff algorithm that I'd like to use for WikiCredit. [20:57:12] aye i was afraid you'd say that [20:57:14] ok [20:57:28] i don't really want to duplicate the text for everything though, but,i think it will be ok. [20:58:52] halfak: but, if you get a stream of revisions, you should be able to generate the diffs yourself, right? without having to worry about having the two consecutive revisions in each key? [20:59:16] yes. [20:59:25] Stream of revisions would make so many analyses easier. [21:00:18] Seems like a hive partition on page_id and sorting would make that trivial -- assuming the infrastructure exists to deliver the data structure to hadoop. [21:01:39] it will! :D [21:01:52] :D [21:01:52] partition on page_id...interesting [21:01:55] that would be a lot of partitions [21:01:56] maybe. [21:02:01] Yes it would :/ [21:02:13] but ja, might work, hive does have indexes too....haven't used them though :) [21:03:48] Depending on how indexes work, that might not help us. For example, for many of my analyses, I want to read the entire history of all pages (or a large subset) [21:04:08] In this case, MySQL gives up on the btree index and just sorts the whole table. [21:04:29] What I really want is to (conceptually) store the whole thing on disk in-order [21:04:37] So that processing it in order is trivially easy. [21:04:53] Second best is asking hadoop to sort for me on the way to the mapper/reducer. [21:07:17] hm [21:07:24] i think we can and should do that, i will have to investigate [21:07:35] when converting from xml, i can sort by some value, i wonder if i can sort by two [21:07:38] page_id, revision_id [21:08:06] hm [21:09:14] :D [21:09:29] BTW, if we get this worked out, we need to find a way to open it to the world. [21:09:37] I have people who will kill to have access to this. [21:30:43] haha, well, opening the cluster would be hard, but toby was saying something similar [21:30:53] we could export different dump formats, no problem [21:30:57] if we are creating them anyway [21:49:39] ottomata, halfak: obligatory link to https://github.com/diegoceccarelli/json-wikipedia [21:49:55] diego is a good guy, he used to stalk the analytics channel [21:50:47] milimetric, Ironholds: just heard from toby about the new PV data, you guys are amazing [21:51:46] last night I found myself wondering: what do we mean exactly by “delivering the data” by Monday [21:52:30] hm, interseting! [21:53:15] ottomata, halfak: happy to write a line of intro if you think it’s useful [21:53:39] how long does it take to parse that!? [21:53:57] I don’t know, let me see if he’s around [21:56:41] Ironholds: ping me when you're around [21:56:53] re DarTar's comment from above ^ [21:57:05] I don't like being called amazing until *after* I've done something :) [21:57:24] milimetric: ha ha, deal [21:58:06] I was told you’ll try ingesting terabytes of data from a wifi connection on a plane which did sound amazing ;) [22:01:32] DarTar: it sounds like we're getting new dimensions added to that cube [22:01:53] ideally I'd have some info about that tonight, I'm juggling other eggs :) [22:05:07] milimetric: I just dropped a line to Oliver [22:05:12] he’s technically OoO [22:05:25] but in case he checks his mail... [22:05:38] let me send you an existing exchange we had about the schema [22:06:41] milimetric: in your inbox [22:07:07] thx DarTar