[02:25:36] ottomata: hey, I know it's late [02:25:44] ottomata: but can we try another deployment of packages ? [02:25:49] like tommorow or something ? [02:37:41] ja tomorrow lets do [14:32:41] good morning analytics! [14:46:46] morning ottomata [14:48:14] morning [14:54:22] good afternoon gentlemen [15:03:03] mornign! [15:03:06] average_drifter! [15:03:10] how can I help you today? :) [15:23:04] ottomata: let's have another go at deploying the packets please [15:23:28] the packets?! [15:23:29] ok! [15:23:30] heh [15:23:37] packages :) sorry [15:23:42] so, we're just trying to get these on stat1 and run them [15:23:44] so ooook [15:24:09] where are the now? [15:36:36] average_drifter ^ [15:45:24] yeah [15:45:35] trying to finish something small, and re-make packages on build1 [15:45:37] moment [15:46:24] drdee had the baby!!! [15:46:58] 9:54pm last night, mother, father, and baby girl all doing great [15:47:02] :) [15:49:28] oops, not sure if I was supposed to break the news. Excitement got the better of me :) [15:50:46] np, i told ottomata as well [15:56:30] drdee: congratulations ! :) [15:56:38] TY! [17:13:11] packages ready [17:13:34] ottomata: please find the packages in lucid/ and precise/ on build1 [17:13:39] /home/spetrea [17:13:51] bb in 1h [17:14:19] ok cool [17:49:05] morning dschoon [17:49:13] mornin captain [17:49:15] brb a moment [17:50:15] reminder: no scrum today; platform engr meeting at 10a [17:50:35] ohohhhh ok [17:56:04] quick question, has anyone messed around with rq (http://python-rq.org/) [18:06:39] hey guys [18:06:49] coming to the platform hangout ? https://plus.google.com/hangouts/_/37334325de6288c776406a93b3532c5eb6204ef6 [18:06:51] robla, i'm in the etherpad [18:06:52] ahh [18:06:54] was just about to ask for that [18:07:00] was chatting in etherpad [18:07:01] dschoon ^ [18:07:03] ah [18:07:07] what's that link? [18:08:03] http://etherpad.wmflabs.org/pad/p/PlatformEng-Meeting-2012-11 [18:17:36] erosen: haven't used those. I've used the queue structure from redis, which provided a very good semi-poor man's job queue :) [18:17:44] erosen: but I hear zeromq (0mq) is big these days [18:17:50] cool [18:17:52] i'll check it out [18:18:18] thanks [18:41:22] for job queuing in python, i've used http://itybits.com/pyres/ quite a bit [18:41:39] which is a python port of resque [18:41:40] https://github.com/defunkt/resque#readme [18:41:45] yeah [18:41:49] looks pretty good [18:42:01] but none of that (or pyrq) are intended to scale. [18:42:04] they're not distributed [18:42:14] rq is also supposed to be "inspired by" resque [18:42:15] for that you need a real messaging system with brokers. [18:42:17] yes. [18:42:27] yeah [18:42:34] i don't think I need it to fully scale [18:42:36] they're simple to use and work great for a workload that can be handled by a single queue manager [18:42:44] i really like pyres [18:42:52] it depends on redis [18:42:55] great interface [18:42:57] mostly I am spawning multiprocessing pools within multiprocessing pool workers [18:42:59] which is getting messy [18:43:15] yes. [18:43:17] well good to know you like pyres [18:43:18] not a fan. [18:43:46] i've used it with 16 workers across two machines to eat through ~20k tasks [18:43:59] with both producers and consumers mutating things [18:44:11] nice [18:44:19] well maybe I'll give that a try first [18:46:47] (that was when i wrote a spider, so the tasks were high variance and high latency (seconds)) [19:22:38] average_drifter, where's the new udp-filters .deb? [19:22:42] i only see orig.tar.gz [19:26:37] ottomata: have you messed around with pig streaming + python? [19:27:06] hmmm, i think I did once, yeah [19:27:22] have you any experience making 3rd party modules available? [19:27:29] from what I can tell there are a few options [19:27:36] either I use python and compile it into a jar [19:27:42] install the module on every node (not a good option) [19:28:11] or somehow tar the dependencies [19:28:20] and unpack them yourself in the python script [19:28:37] erosen: i wouldn't go down that rabbithole right now [19:28:45] yeah [19:28:46] stick with plain Pig + PiggyBank [19:28:57] we'll work out a solution for everybody regarding python [19:29:09] because we totally know that's everyone's preferred langauge for data processing [19:29:18] but as you say, it's not straightforward [19:29:21] it's only because I have the code already written in python [19:29:24] i know [19:29:28] cool [19:29:37] well i'll see if I can get around it [19:29:39] but a lot of that is handling the grunt work that mapred does for you [19:30:03] try to break the code into small, individual transforms on a piece of data, and write those as pig functions [19:30:03] yeah [19:30:08] cool [19:30:13] (i think they're called "macros"?) [19:30:46] hmmm, i dunno, i got the pig streaming python stuff working once [19:30:48] i forget. but anyway, they're modular and reusable. as we build up a library of those, data processing in pig becomes less about duplicating labor [19:30:50] it wasn't that hard I think [19:30:54] really? [19:30:57] welp. sweet. [19:30:58] heh [19:31:09] you can just ignore me if ottomata has magically solved this already [19:31:14] well, I did once [19:31:17] i agree that we should build a good library of pig transforms [19:31:28] but I was just trying this as a quick solution [19:31:29] when I was trying to ge tthe geocoding stuff to work, i was having trouble with the pig udf for some reason [19:31:32] so I tried python [19:31:45] and I got the streaming bit to work (never solved my orig problem that way though) [19:32:09] i'm usually the last person to say this, but [19:32:10] were you importing pygeoip [19:32:11] or whatever [19:32:24] when it comes to mapred stuff, it's good to stick to pig/hive due largely to performance [19:32:31] we have finite compute and lots of jobs to run [19:32:42] it *will* matter, sadly, that we try to squeeze stuff for perf [19:32:48] true [19:33:01] though as you well know, it sounds a bit like premature optimization [19:33:08] but point taken [19:33:12] pig is faster than python [19:33:25] i mean, it's not really *optimization* [19:33:33] it's merely not *deoptimizing* things :) [19:33:38] hehe [19:34:27] yeah, dunno if I can help you here though erosen [19:34:38] i don't remmebe rwhat I did, I think mine was simple enough to not have any real deps…not sure though [19:34:46] cool [19:34:53] well I figure something out [19:35:01] another reason to stick to pig [19:35:03] erosen: /q erosen [19:35:05] we want to focus our energy at first [19:35:07] lol [19:36:40] 21:27 < erosen> either I use python and compile it into a jar [19:36:41] 21:27 < erosen> install the module on every node (not a good option) [19:36:44] 21:28 < erosen> or somehow tar the dependencies [19:36:46] why not make a .deb package for it ? [19:37:02] because we should be avoiding python and stick to pig :) [19:37:07] rather than us each writing 10% of a solution for each way to process data [19:37:22] because that would require installing the deb everywhere, making the deb is complicated to ship via pig [19:37:30] i seen the point about efficiency, but I do think this might be one of those developer time vs computation time tradeoffs [19:37:38] see* [19:37:47] totally [19:37:51] i just don't think we're there yet. [19:37:56] fair [19:37:58] does pig know job queues or redis ? [19:38:02] and i'd like to stick to the basics rather than getting exotic at first. [19:38:08] average_drifter: why do you ask? [19:39:11] 21:30 < dschoon> try to break the code into small, individual transforms on a piece of data, and write those as pig functions [19:39:25] it seems that something else will do the job queue stuff [19:39:30] * average_drifter isn't sure [19:39:43] we're talking about this in the context of a map-reduce job. [19:39:49] you familiar with the paradigm? [19:40:45] yes [19:40:48] a "job" is a pair of tasks (map, reduce). the input is expected to be enormous -- often terrabytes of data. you break it up into chunks and feed them in parallel to copies of the mapper running on many, many machines. [19:40:49] right? [19:40:56] so pig is a DSL for writing those tasks. [19:40:59] no queue. [19:41:05] not in the traditional sense. [19:41:21] (the reducers then aggregate the map results and format the output) [19:41:26] in this particular case, does all processing sit on the same machine ? [19:41:59] never. [19:42:09] the mappers are run in parallel on the cluster [19:42:12] as are the reducers. [19:42:23] This is the whole point of Hadoop. [19:42:39] Parallel processing that preserves data-locality. [19:43:59] http://research.google.com/archive/mapreduce.html [19:44:13] the original paper is surprisingly short and readable [19:57:35] average_drifter, i'm having trouble getting the udp2log stream on stat1, so i'm going to run this test on analytics26 [19:57:40] i've already get it running there [19:57:41] so. [19:58:22] i'm supposed to use udp-filter instead of filter, right? [20:02:46] average_drifter ^ [20:26:46] where is the udp-filter package? [20:42:58] dschoon [20:43:05] ottomata [20:43:13] now consuming any topic that starts with "event" [20:43:19] from kafka into hadoop :) [20:43:36] /user/otto/event/logs/$topic [20:44:36] https://github.com/wmf-analytics/kraken/blob/master/bin/kafka-hadoop-event-consume.sh [20:44:37] woo [20:44:48] zookeeper-client -server analytics1023.eqiad.wmnet:2181 ls /brokers/topics [20:44:54] cool, i'll check it out [20:44:54] will get you all kafka topics from zookeeper [20:44:56] ahh, nice! [20:45:00] beautiful! [20:45:26] its not the most elegant thing but it will do for now! [21:18:01] average_drifter, where is the new udp-filter package? [21:27:35] sl