[13:39:45] morning guys [13:40:06] Good afternoon. :) [15:09:27] Diederik, y u work so much on weekend? [15:09:37] good morning and happy mlk day to you! [15:09:40] drdee [15:09:41] ! [15:09:50] gooood morning [16:33:09] yo ottomata, once the blog data is flowing into hdfs, can you let me know? [16:34:00] yup, can do [16:34:28] aight [18:00:41] hmmmmm [18:00:49] possible wmf network just died? [18:00:50] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:32:59] oh drdee, real quick, this is nitty picky [18:33:14] but when you are talking about 'sharding' the webrequest firehose [18:33:14] aight [18:33:21] you should probably use the term 'partition' [18:33:32] yes you are right [18:33:52] sharding usual refers to horizontal partitioning, and partitioning to vertical (uh, right?) [18:34:03] thx [18:48:12] ok now it is for sure time for lunch, drdee, kafka is producting the blog udp2log stream now [18:48:18] it should be consumed hourly in the exact same way as before [18:48:26] WOOOOT WOOOOOOT [18:48:32] do you think I should archive all of the past files? [18:48:42] so we don't look at them? [18:48:51] yeah good idea [18:49:38] k [19:39:09] hokey dokey drdee, all archived and starting anew [19:39:09] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw/webrequest/webrequest-blog [19:39:16] looks like the 1 import in there right now has all seq #s [19:39:24] niiiiiiice [19:39:24] from my eyeball of it [19:40:27] i see gaps [19:42:34] or is the sequence so random that you should do a cat | cut | sort ? [19:44:27] yes and there are two files [19:44:33] so you have to dl them both and do that [19:44:44] at * | awk '{print $3}' | sort -n [19:44:48] cat * | awk '{print $3}' | sort -n | less [19:46:20] duuhhhhh [19:47:42] there are some very minor gaps [19:48:48] i copied a range of 100 seq numbers [19:48:52] and i got 95 lines [19:49:14] so that's a 5% loss [19:49:26] however, that could be the same as we always see on locke, emery etc [19:51:05] ottomata ^^ [19:56:34] really hmmmmmmm [19:56:39] (sorry was doing dishes) [19:56:41] hmmm [19:57:06] lets wait for one more import, i'm not certain that all of the previous hour was from the new stream [19:57:18] should be one coming up in a few minutes [20:05:23] oh it runs at :30, so uhhh, in 25 mins [20:16:17] aight [20:45:23] GROWL [20:45:27] drdee you are right [20:45:39] buuuuut [20:45:52] we always have a 4-5% packetloss [20:46:02] oh right, hm yeah but, hm, well [20:46:04] if that were true [20:46:12] if I saved an hours worth of sampled data to disk [20:46:14] we'd see the same thing [20:46:15] right? [20:46:40] yes [20:46:46] i am looking for the ganglia graph [20:46:58] ok [20:47:01] i'm starting that now [20:47:02] saving data to a file [20:47:03] to compare [20:47:48] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=emery.wikimedia.org&v=3.40132434783&m=packet_loss_average&r=hour&z=default&jr=&js=&st=1358801206&vl=%25&z=large [20:48:11] on emery,for example, we always have 3 to 4% loss [20:48:25] ottomata: limnify pull request merged, fyi [20:48:32] that would be genuine packet loss that happens in the netowkr [20:48:32] cool! [20:49:39] btw, auto build tools suck balls [20:50:03] my configure script is broken because there is a space in my build path, but that is not true :) [20:50:12] there is no space in my build path [21:15:03] drdee, i'm playing with limnify atm [21:15:15] i could pretty easily make it work in a cron job [21:15:25] making it work with oozie will take a lot more research [21:15:29] let's do that [21:15:43] i don't think we should do this when we get more stuff in production [21:15:49] but I can at least make the mobile graph update every 15 minutes [21:15:56] aight [21:16:00] i'm just going to rewrite the datafiles and datasources evry 15 minutes from all the data [21:29:47] drdee [21:29:49] yeehaw [21:29:49] http://dev-reportcard.wmflabs.org/#hourly-graphs-tab [21:29:58] that's fast! [21:29:58] should be updating every 15 minutes [21:30:00] wellll [21:30:02] oh wait its hourly [21:30:04] every hour i think [21:30:05] nm [21:30:24] ok, we should also set this up for the blog traffic [21:30:40] is this now running as the stats user? [21:30:41] ok, this is all just in my own jobs and crontab rigiht now [21:30:42] no [21:30:50] ok [21:31:07] but now would be a good time to do that, i don't like the crontab hackiness, doing this in oozie would be much cleaner [21:31:38] but i'd have to figure outhow to package limnify so it can run as a shell job…or install it on all machines [21:32:04] well you can ship python scripts inside a pig script [21:32:43] right, i know it is possible [21:32:46] same thing for oozie jobs [21:32:57] but its a matter of figuring out how to ship and what needs to be shipped [21:33:04] cause it snot just the script, but the script and all its dependencies [21:33:08] all tarred up together [21:33:46] aaaargggghhhhhhhhhhhhh [21:33:52] so one way of doing it [21:34:02] is create a pip package for limnify [21:34:16] and just use puppet to install it and distribute it [21:35:37] but then we'd ahve to host our own pip repo too, right? [21:35:47] i can script puppet to automatically install things, no problem [21:36:11] if they are downloadable from an01 or something (I did that while I was trying to work with a flume version that doesn't come with cdh4) [21:38:33] drdee: why not debian ? [21:38:44] drdee: we could have a debian package for limnify :D [21:39:14] for a python package it might not be that complicated [21:39:19] does pandas have a deb package [21:40:00] it seems like it does [21:40:01] BUT [21:40:09] you get 110Mb in dependencies [21:41:00] that's absurd [21:41:35] you get gfortran, python2.7 and R as gifts [21:42:16] average_drifter, i could use your help with the current debian package :) [21:43:16] drdee: yeah sure [21:43:28] screenshare? [21:43:35] ok [21:43:46] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [21:44:55] yeah pandas does [21:52:00] but you get 250Mb of extra stuff [21:53:40] actually [21:53:45] if we installed all deps everywhere [21:53:49] i don't mind shipping limnpy [21:53:59] with the oozie job [22:17:46] but 250MB of installed crap on 20 nodes [22:17:54] that's crazy [23:05:10] hm, drdee it looks like my cron script works but the job doesn't completely correctly when it runs hourly [23:05:14] only when I run it manually [23:05:16] i'll check that out tomorrow [23:05:47] aight