[14:20:36] mornnggiingngngngngg [15:03:41] morning [15:03:56] woa how'd it become 10:00 already? [15:04:41] drdee - changing to parse tab separated now [15:06:19] morning! [15:06:22] milimetric, q for you [15:06:30] hey, sure [15:06:42] http://hue.analytics.wikimedia.org/filebrowser/view/user/otto/webrequest_loss_by_hour/2013-02-05_14.00.00/part-r-00002?file_filter=any [15:06:46] see that last percent loss value? [15:06:50] 3.472027322540344E-4 [15:06:52] what will limn think of that? [15:06:59] looking [15:07:12] oh I think javascript parses that, lemme check [15:07:27] yep [15:07:48] so I didn't do anything to handle it, but it's handled [15:08:36] the only trick would be, when setting up a datasource with a column that has values like that, to give it a type of float instead of int [15:09:12] right, those will be floats anyway [15:09:15] since its a percentage [15:09:32] i could round them, but I'd kind of like to be able to keep small percentage values [15:09:38] 0.1% would be nice to see [15:09:51] i don't really konw why that doesn't just say 0 though [15:10:00] pig is being weird [15:10:05] that is basically calculated [15:10:17] (0.0 / 1728091.0) * 100.0 [15:10:26] apparently that == 3.472027322540344E-4 [15:12:58] yoyo [15:14:02] yooo [15:14:04] morning! [15:15:49] how many underscores will drdee_ end up with today :) [15:15:57] no clue [15:16:17] today is official LIMNIFY day [15:16:25] ottomata - Limn rounds to two places I believe [15:16:31] i am so pumped about today [15:16:45] because we can really connect some cool pieces together [15:16:46] that's cool [15:16:50] :) yeah, one more moment and I'll get the tab parsing working - been distracted with wikitech [15:17:10] milimetric we need to talk [15:17:12] drdee_, how are you dealing with multiple output files and limn? [15:17:20] exactly [15:17:22] sure - what's up drdee_ [15:17:23] ROLLUPS [15:17:24] oh [15:17:30] LIMNIFY and ROLLUP day [15:17:31] is today [15:17:42] heheh [15:17:49] So I'd love to do everything in Limn but maybe that's not good for rollups yet [15:17:51] well, i betcha that bit could be in oozie [15:17:52] yeah [15:18:05] that shouldn't be hard, there's got to be a way to chain a cat command [15:18:09] ORRRR [15:18:11] remember you don't have to actually rollup [15:18:15] because Limn's doing that [15:18:22] but you need a single URL, right? [15:18:24] for limn? [15:18:25] yeah [15:18:27] there are 2 issues [15:18:35] right now pig generates files in intervals [15:18:35] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/public/mobile [15:18:37] for example [15:18:40] 1) you need to merge separate files [15:18:52] its just that limn doesn't care about sorting? [15:18:54] it will sort? [15:18:54] 2) rollsup for counts by day / week / month / quarter / year [15:18:59] oh [15:19:02] that kind of rollup [15:19:04] yeah [15:19:10] i think that should be done in kraken [15:19:10] what I mean is if CA, Android, Samsung, IGH 7000, 5 is in two files, Limn can add those to 10 or Kraken can add them to 10 [15:19:21] I'd think Kraken adding them to 10 would be more useful [15:19:27] dunno [15:19:53] the files are so small that the overhead gets quit significant to fire kraken up [15:20:00] and [15:20:03] definition of an OLAP cube: 2) rollsup for counts by day / week / month / quarter / year [15:20:20] milimetric: that was not a coincide :) [15:20:22] one second, we're being crazy [15:20:38] there HAS to be an OLAP cube generation technique for Hadoopy stuff [15:20:39] * drdee_ loves CRAZY [15:21:15] http://www.slideshare.net/Hadoop_Summit/low-latancy-olap-with-hadoop-13386744 [15:21:39] ok, you guys go investigate, I'm gonna go to the bathroom, then crank out this tab delimited change, then help investigate [15:21:43] we shall have cubes soon! [15:21:53] that would be hive [15:39:23] maybe hive would be the way to go with something like that [15:39:33] then the rollups would be easily expressed as queries, right? [15:39:54] that's sounds like a good idea [15:41:59] average_drifter, will you try to log into analytics1001.wikimedia.org ? [15:42:51] we are mixing two approaches here [15:43:15] approach mixer [15:43:27] either we run pre-defined scripts with 'most-wanted' metrics and we only do rollups by time on the frontend [15:43:28] or [15:43:34] dis is dee new approach mix, its da hottest jammin out daaa [15:43:49] we expose the raw data in a cube and let people query it [15:44:13] but having pre-aggregated metrics and then do querying over them….mmmmmmmm not shooo [15:46:37] bacon/beercan mixed approaches coming atcha! [15:46:37] http://www.baconorbeercan.com/ [15:47:31] ROFLOL [15:49:08] regardless, we need to merge the 15 minute interval datasets into 1 day datasets and i think we should have a generic oozie job to do this because this is a recurring usecase [15:49:47] 1 day? won't limn need a single file [15:49:49] I agree [15:49:50] though [15:50:03] shouldn't we just concat all of the existing data into a single file? [15:50:06] we can overwrite the file each time [15:50:16] as long as the data is small enough, this shouldn't take more than a few seconds [15:50:24] but yeah, i think that should be in oozie too [15:50:27] however that works [15:50:46] that's what i am saying ): [15:50:50] could pig do it? [15:50:53] at the end of its run? [15:50:56] after STORE data [15:51:08] reload the same data, sort, and STORE again, but only using parallel 1 [15:51:10] 1 reducer [15:51:14] that way there's only one file? [15:51:28] it would be more elegant to do it in oozie for sure [15:51:29] better to make a separate pig script [15:51:32] yeah [15:51:33] probably [15:51:38] yeah then oozie could just chain that real easy [15:51:39] and do that [15:51:42] yup [15:52:02] that would be simpler for us right now, but it would be more elegant if we could figure out how to just do hadoop streaming or hadoop fs commands via oozie [15:52:27] i will spend 1 hour to see if that's possible [15:52:33] else i will go the way of the ig [16:02:12] hey ottomata, wanna join us in our hangout to help with a proxy password problem? [16:02:22] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [16:35:22] http://code.google.com/p/js-hypercube/ is awesome drdee_ [16:35:38] yo welcome [16:35:54] I think this is something great to shoot for [16:36:18] either MySQL or PIG outputs tuples like this: (time, facts, metrics) [16:36:39] aggregated at whatever level keeps the file size below X [16:36:54] where X is the largest dataset that js-hypercube can handle [16:37:10] and then I'll make us a cube interface :) [16:37:41] maybe if you're nice I'll even make it multi-player so two people can be playing with the same cube [16:37:47] hehe [16:37:53] milimetric [16:38:16] :) [16:38:17] what about multiple columns in the pig output [16:38:28] and then specifying which time, facts, metrics you are interseted in [16:38:30] or i guess [16:38:39] time, fact, metrict1, metrict2, metric3 [16:39:06] you would always use time, fact, but the selection of which metric value to graph would be selectable [16:39:08] possible? [16:55:52] ottomata: I think that's definitely possible, it's a subset of what I was envisioning. js-hypercube lets you slice the data holding zero or more facts constant and rolling up one or more metrics [16:56:26] btw guys - I processed the reportcard in 8 minutes this month. New Limn makes me very happy :) [16:56:58] ottomata, cat on hadoop is possible but it's just ugly and hacky [16:57:07] better to have a simple pig script [16:57:17] that just reads the data and have a single reducer to output the dat [16:57:18] a [17:01:35] a pig script it is [17:01:55] if you don't specify the input schema it will just read all the fields [17:02:04] so this script can be super easily reused [17:02:13] you just parametrize the input and output folders [17:02:14] that's all [17:02:47] this is the script [17:02:49] SET default_parallel 1; [17:02:50] LOG_FIELDS = LOAD '$input' USING PigStorage('\t'); [17:02:50] STORE LOG_FIELDS into '$output'; [17:04:27] hmm, ok cool! [17:04:29] yeah that is easy [17:12:08] the only downside of doing this is that oozie will have to delete that file before it's running and so every hour or so there is a small gap that the file is not available [17:12:23] hmmmmm [17:12:24] so probably we need to do some renaming stuff [17:12:27] yeah [17:12:30] output do tmp file [17:12:36] delete and move when finished [17:12:43] pig can maybe do that? [17:12:44] hmm [17:12:58] job succeeded http://hue.analytics.wikimedia.org/oozie/list_oozie_workflow/0002941-121220185229624-oozie-oozi-W/ [17:13:04] no that's hadoop shell [17:13:18] lemme check [17:13:23] you can run hadoop shell commands in grunt [17:13:24] so mabye [17:13:31] okay so there is now a concatenated file in /wmf/public/aggregated/mobile/ [17:13:39] oh it works in grunt great [17:13:46] cool! [17:14:10] oh totally [17:14:12] yeah its easy peasy [17:14:21] what command did you use? [17:14:26] mv f3 f4; [17:14:29] ls [17:14:31] ls f4 [17:14:33] all works [17:14:37] even in .pig file [17:14:39] that works too [17:14:42] k [17:20:23] milimetric can you verify that the concatenated file works in limn? [17:20:42] sure. link? [17:22:13] the one that otto gave but then use /wmf/public/aggregated/mobile/result.tsv [17:22:47] http://analytics1001.wikimedia.org:81/wmf/public/aggregated/mobile/result.tsv/part-m-00000 [17:23:13] just the result.tsv doesn't work [17:26:23] ohh duhh [17:31:52] drdee_ - it works but it's slow. I mean it's like lightning fast considering it's 20MB. It still takes like 20 seconds or so [17:32:07] yes i am doing a rollup now in pig [17:32:36] this did help me track this bug that makes the colors not show up sometimes [17:32:58] it happens when the data loads after the map [17:34:20] it's 12 seconds actually, and never mind - the colors show up. What a weird bug [17:51:34] drdee_: http://analytics1001.wikimedia.org:81/wmf/public/aggregated/mobile/result.tsv/part-m-00000 got deleted [17:51:41] yes i am fixing it [17:51:43] hold on [17:51:45] oh ok [17:55:06] milimetric: new path: / wmf/public/aggregated/mobile/ [17:55:09] and 200kb file [17:56:41] what's the limn link that visualizes that data? [17:58:26] milimetric ^^ [17:58:51] it's not deployed [17:58:55] I'm just doing it locally [17:59:10] the deployer broke because it depends on a repository that david maintains [17:59:14] and he [17:59:22] 's killed some dependency or something [17:59:42] can you get it deployed within the next 2 hours? [17:59:49] oh yea [17:59:58] i was trying to work out some bugs [18:00:57] hangouts down for you guys? [18:01:00] skype? [18:01:09] i got product meeting [18:01:14] so i am skipping [18:01:16] but you know what i did [18:01:24] oozily mobile page view job [18:01:33] plus rolling up by day [18:02:42] cool [18:28:16] erosen, ottomata, you have access to the VUMI labs instances [18:28:21] yay! [18:29:59] yaaa [18:30:28] ottomata1: where did you send the udp2log files? [18:30:42] s/send/put/ [18:32:46] let's seee [18:32:58] what's the instance name again? [18:32:58] vumi? [18:33:05] ssh vumi.pmtpa.wmflabs [18:33:06] vumi and vumi-metrics [18:33:10] aah [18:33:41] udp2log 4023 0.0 0.0 21704 1216 ? Ss Jan09 8:24 /usr/bin/udp2log --config-file=/etc/udp2log --daemon -p 5678 --recv-queue=16384 [18:33:53] erosen [18:33:56] vumi-metrics [18:34:00] /var/log/vumi/metrics.log [18:34:05] ya [18:34:08] just found it [18:34:09] thanks [18:54:05] ottomata: the vumi log doesn't seem to reflect new searches [18:54:16] want to check it out before the meeting, in case it is a problem? [18:54:57] send gchats to wikipedaivumitest and wikipediavumi (this is the account through which is sends the data) [18:58:47] hmk [19:01:06] erosen, can you hit it? [19:01:38] just search automaton [19:01:50] or first automata and then disambiguated to automata [19:01:52] do it again [19:02:00] Hmmm [19:02:07] welp i'm following the ok [19:02:09] hmm [19:02:12] done [19:02:13] following his instructions [19:02:13] hm [19:02:16] i see it in the other logs [19:02:17] checking [19:02:18] yeah [19:02:21] i noticed that too [19:02:33] this one is up to date: /var/log/vumi/wikipedia_worker_0.log [19:03:43] what is the meeting number? [19:04:20] erosen: what is the meeting number? [19:04:20] no hangout? [19:04:23] webex [19:04:36] drdee [19:04:36] 1-877-668-4493 [19:04:41] that's the number [19:04:44] Meeting Number: 802 485 174 [19:04:44] US Toll-free: 1-877-668-4493 [19:04:44] South Africa Toll-free: 0800-99-9610 [19:04:48] but you need a meeting number [19:04:52] hm [19:04:56] oh wait [19:04:57] duhh [19:05:00] it's the 1st [19:05:39] i'm proposing a hangout [19:05:42] i think it will be better [19:05:55] i think i'm in [19:06:03] yup [19:06:10] your both on dee phone [19:27:32] zehn and silver [19:38:32] erosen, can you hit that vumi search again? [19:38:41] yeah [19:38:48] it is really easy to set up yourself, though btw [19:38:56] i keep trying to do it, but it says XMPP error [19:39:02] PSSHHH [19:39:06] through chat? [19:39:07] ok, its not sending traffic anymore [19:39:09] yeah [19:39:11] gchat [19:39:12] maybe its adium [19:39:18] OHG [19:39:18] yes it is [19:39:19] i see it [19:39:21] it just takes a sec [19:39:24] nice [19:39:27] ok, its almost certainly udp2log buffering then [19:39:44] i have some xmpp scripts around, so I could send a bunch of messages if necessary [19:39:55] ok cool, lemme turn udp2log back on and see [19:39:59] k [19:42:22] do it again [19:42:58] keep going [19:43:27] grr [19:44:09] again [19:46:28] pssshhhhhHHH [20:01:49] erosen, gimme another hit! [20:02:38] donezo [20:02:43] andrew jackson [20:04:28] agh, gimme another erosen [20:04:51] oh wait [20:04:56] File "/usr/lib/python2.7/dist-packages/twisted/internet/udp.py", line 195, in connect [20:04:56] self.socket.connect((host, port)) [20:04:56] File "/usr/lib/python2.7/socket.py", line 224, in meth [20:04:56] return getattr(self._sock,name)(*args) [20:04:56] socket.error: [Errno 13] Permission denied [20:04:57] dr dee comin atcha [20:05:08] hmm [20:05:12] so maybe logs aren't getting sent? [20:05:28] is there anymore to the traceback? [20:06:16] https://gist.github.com/ottomata/4717212 [20:06:43] gimme a hit again [20:07:11] wait, it is listening on udp port 5678??? [20:07:23] i thought it was supposed to send [20:07:57] ... [20:09:00] blinkenlights? [20:09:00] haha [20:10:02] BLINKENLICHTEN? [20:16:30] ottomata: i thin your e-mail is confusing: "supposed to listen on port 5678, or listen on port 5678" [20:17:53] oops [20:18:38] thanks [20:19:03] np [20:21:50] drdee_, should we start putting oozie workflows and coordinator .xml and properties files in kraken repo? [20:21:55] hue is cool and all [20:22:00] but it doesn't yet do everything we need [20:22:07] i've been editing workflows and coordinators manually [20:22:11] and then copying them into hdfs [20:24:29] also, can you check in your pig script to concat files? [20:24:48] ah, I found it, I can do that [20:25:24] oh you are doing more than concating with it [20:25:26] you are rolling it! [20:40:03] ottomata: drdee_, should we start putting oozie workflows and coordinator .xml and properties files in kraken repo? [20:40:20] workflows and coordinators yes, property files no [20:40:38] i am rolling it as well [20:40:51] was needed for demo [20:41:02] did you puhs? [20:42:02] no, i'm creating a new one [20:42:06] that also moves the old dir around [20:42:10] renames and stuff [20:42:13] and also sorts [20:42:14] just in case [20:42:26] DATA = LOAD '$input'; [20:42:27] DATA = ORDER DATA BY *; [20:42:27] STORE DATA into '$output.tmp'; [20:42:27] rm '$output' [20:42:27] mv '$output.tmp' '$output' [20:42:33] making sure it works [20:42:45] re oozie .xml [20:42:50] can I create an oozie/ dir in craken [20:42:52] kraken [20:42:53] and save there? [20:43:14] yes [20:43:22] oh i just pushed concat.pig [20:43:25] just delete it [20:43:33] k [20:43:36] i am gonna write a Zero pig udf now [20:49:44] erosen, which repo had the python script again? [20:50:02] https://github.com/embr/squidpy/blob/master/squid/scrape-mcc-mnc.py [20:50:08] ty [20:50:30] np [20:50:35] and i checked the space thing and it seems to be working [20:51:33] 452,7,vn,Viet Nam,84,Beeline ,452-7,beeline-viet-nam [20:51:34] 452,8,vn,Viet Nam,84,EVN Telecom ,452-8,evn-telecom-viet-nam [20:51:35] 452,1,vn,Viet Nam,84,Mobifone ,452-1,mobifone-viet-nam [20:51:36] 452,4,vn,Viet Nam,84,Viettel Mobile ,452-4,viettel-mobile-viet-nam [20:51:37] 452,5,vn,Viet Nam,84,VietnaMobile ,452-5,vietnamobile-viet-nam [20:51:41] ohh [20:51:45] it is a csv [20:51:50] I think column 2 needs to be capitalized [20:51:53] and the last column has the identifiers [20:52:15] the csv has lot [20:52:17] s of stuff in it [20:52:27] the json file is info i found useful [20:52:31] but you may want other stuff [20:53:06] i would like to have them both the same output [20:53:35] what do you mean? [20:53:49] csv and json same fields [20:53:54] gotcha [20:59:06] erosen: how do you serialize a DataFrame to json/ [20:59:15] pushing it presently [20:59:31] but fyi, [r[1].to_dict() for r in df.iterrows()] [20:59:40] for a list of dicts [21:00:23] pushed [21:00:32] doesn't capitalize the country name yet [21:00:34] awesome [21:00:34] ty [21:05:31] erosen, final request [21:05:40] sup? [21:05:54] just capitalized country [21:05:59] can you change 'Country Code' to CountryCode', 'str' to 'Str' and 'key' to 'Key'? [21:06:03] ty! [21:06:27] in particular the Country Code is important [21:08:03] sure [21:09:47] and maybe str and key can get more descriptive names? [21:11:26] yeah [21:11:33] any suggestions [21:12:33] MNC-MCC? [21:12:46]