[00:09:40] o/ Emufarmers. Planning to attend the Lyon Hackathon? [00:11:12] Probably not [00:11:17] I've gone to three conferences in the last year [00:11:26] two of them with non-trivial plane rides [00:11:34] I'm not sure I can handle more than that [00:16:06] Bummer. [00:16:28] I know how it is with too much conferencing though :\ [12:26:34] I return! [15:33:17] Good morning science people! [15:33:28] (depending on your timezone) [19:12:53] ewulczyn: I just got in, do you have a minute to talk? [19:13:52] yes. I'm still at home though. I need to get on my bike and head over at 11:45. [19:14:09] meet int he hangout? [19:14:09] ok, no worries then, chat later in the afternoon? [19:14:38] * DarTar waves at Ironholds [19:14:43] wb, sir [19:15:27] How about after backlog grooming? [19:15:45] I have a 4pm with Oliver [19:16:07] only other window is 2.30-3 [19:16:27] or after 5 [19:20:46] ewulczyn: feel free to move it to a time that works for you [19:23:40] hey DarTar [19:23:53] (sorry, was in an awesome meeting with Jon K [20:54:06] sweet! [20:54:09] halfak, guess what? [20:54:32] the sampled-log-and-R and UDF-and-Hadoop counts for pageviews line up! [20:54:40] I am [[not an actively terrible programmer|A GENIUS]] [20:55:19] Woo! Awesome! [20:56:40] halfak, like, the difference is... [20:57:06] <1% [21:01:35] DarTar, would YOU like great news? :D [21:01:42] I love everyone having meetings, I get to spread the joy repeatedly [21:02:03] so, you know the test to see if the new definition held water, with comparing new def UDF + hadoop to new def R + sampled logs? [21:02:10] Just finished a run for January 2015 [21:02:14] the difference is <1% :D [21:10:50] Ironholds: wow, that’s.. surprising [21:10:56] DarTar, what? [21:11:00] I'm The Best At What I Do. [21:11:04] And What I Do Is Pageviews. [21:11:27] well, it’s more about the nature of the data (sampled vs unsampled), not so much the implementation [21:11:36] yeah, I found it a bit disconcerting too for the same reason [21:11:39] this is aggregate data for the whole month of January? [21:11:48] nope, granular; day-by-day for January. [21:11:57] ok cool [21:12:05] I'm gonna produce some quick visualisations of the different implementations, and also of the deltas between them [21:12:13] great, let’s review this later [21:12:18] but there's a reaaaally big gap for the legacy implementations and I want to work out what's going on there [21:12:24] ha [21:12:24] totally! I'll get something done before our 1:1 :) [21:12:33] thanks dude [21:18:35] hey halfak, Ironholds, just got this working, haven't played with it at all yet [21:18:37] http://spark.apache.org/docs/1.2.0/quick-start.html [21:18:40] try it on stat1002 if you wanna [21:18:44] pyspark should work [21:19:06] Cool. Will block some time to play tomorrow morning. [21:19:09] Meetings today :( [21:19:23] cool [21:20:15] ottomata, cool! hey, you want some good news? :D [21:20:28] that UDF we wrote? <1% difference, day-by-day, from the sampled log output + R. [21:20:42] so either we're geniuses or I've managed to subtly break both in the same way. I prefer the null hypothesis. [21:20:51] haha, Ironholds i saw that! [21:20:54] you are excited, and me too! [21:21:05] also, is spark just your cunning plan to get me to learn Scala? :D [21:22:05] or python! [21:22:15] https://github.com/apache/spark/tree/master/examples/src/main/python [23:59:06] DarTar, five minute break? I need to walk around and scream at the moon [23:59:16] actually I'll probably just walk around and grab a glass of water [23:59:19] but that first one sounds fun too [23:59:55] sure