[14:11:17] YOYO MR milimetric [14:11:23] howdy [14:11:24] and MR ottomata [14:11:32] YOU WERe so RIGHT ABOUT INTELLIJ [14:11:49] i've been using eclipse for like 7 years or something [14:11:56] and man what a difference [14:12:08] but the Maven thing's still broken right? [14:12:13] yes [14:12:16] very weird [14:12:23] wanna do a hangout see if we can puzzle it out? [14:12:28] because on the cli mvn compile works [14:12:31] sure [14:14:41] https://plus.google.com/hangouts/_/135b63a5ba742d9bb495dc988087611a7c74f5d4 [14:34:07] yoyo ottomata [14:34:12] gooooooood news [14:34:20] an09 has data including wikivoyage [14:34:49] what i suggest is: [14:34:49] 1 [14:35:10] 1) make a backup of the entire webstats folder on locke and put it in a safe and protect it with our livse [14:35:25] 2) only replace filter on locke with the new one you compiled [14:35:43] 3) let's not touch collector, the one that is running is daemonized [14:36:02] and works fine, else i have to fix the one in the time_travel branch [14:37:27] thoughts? [14:37:48] sounds good [14:41:59] aight [14:42:08] brb [14:43:58] man, my coffee is better than any cafe [14:44:01] MM [14:44:02] so good [14:44:06] i should start a cafe [14:45:26] drdee, when you are back, I am ready to do that, i'll go ahead and compile filter on locke [14:48:56] ok [14:48:57] i'm ready [14:49:31] all I have to do are a couple of mv commands, and restart udp2log [15:24:27] ready [15:29:50] hai ok [15:29:55] ok so [15:30:01] in /a/webstats, [15:30:08] there is now a directory called [15:30:08] source.wikivoyage.2013-01-16 [15:30:14] that is your code [15:30:17] with the compiled binary [15:30:22] i'm going to leave everything in place [15:30:31] but symlink bin/filter to the filter inside of that dir [15:30:36] and rename the old one, and leave it wher eit is [15:30:43] s'ok? [15:30:57] yes and maybe add a read me for future reference [15:31:12] this is stuff that we are likely to forget [15:31:18] doing that right now :) [15:31:20] even before you said that [15:31:21] hehe [15:31:27] :D [15:31:57] so, once we have flume and unsampled working, i can write a pig udf to generate these hourly counts [15:32:07] then we can have an oozie job to rerun them [15:32:22] and push the data directly to dumps.wikimedia.org [15:32:28] and then we can burn this piece of code [15:33:45] if I get flume up, do you want me to try to do full stream unsampled? [15:33:56] i might need to take kafka down to do that, so I can use those machines [15:34:01] anyway, wait [15:34:04] let's talk about that in a minute [15:34:11] i'm going to do the webstats collector thing now [15:34:25] ok [15:34:35] when you are ready you can join https://plus.google.com/hangouts/_/135b63a5ba742d9bb495dc988087611a7c74f5d4 [15:34:48] oh i'm restarting udp2log now [15:35:34] k its running [15:36:13] logbot? [15:36:25] ? [15:36:30] oh [15:36:40] i'm going to do that in ops channel too [15:37:03] !log deployed new webstatscollector filter to collect stats on wikivoyage domains and restarted udp2log on locke [15:37:05] Logged the message, Master [15:43:28] ottomata, the cronjob that pushes the data to dumps.wikimedia.org runs every hour, right? [15:44:30] uhhh [15:45:00] or every day? [15:45:18] i do not know about such cron job…or I have forgotten, i am looking [15:45:28] it is not puppetized [15:45:30] hard to find [15:46:10] ah found it [15:46:11] nobody's crontab [15:46:16] /a/webstats/scripts/ship [15:46:23] every hour [15:46:27] no wait [15:46:30] it is commnetd out [15:46:34] # Ship aggregations to dammit.lt [15:46:34] # 2 * * * * /a/webstats/scripts/ship [15:46:36] oh that is dammit [15:46:37] hmm [15:46:41] :D [15:47:25] i don't see a cron for this [15:48:28] mmmm but the data is published on dumps..... [15:48:51] haha, who knows man, maybe its running in a screen somewhere [15:48:58] i'm looking, not finding much [15:51:32] the joy of locke [15:58:09] ottomata: there can be individual cron files for each package [15:58:37] ottomata: I remember one time I had to package something, I wrote a .cron for it. I can't remember right now how exactly that got to be run [15:58:50] hm, i looked in all the places I know of, and also, this def is not from a package [15:58:54] i looked in [15:59:12] /var/spool/cron/crontab files and /etc/cron.* files [16:00:13] ottomata: grep CRON /var/log/syslog [16:07:23] yeah, nothing, i see plenty of stuff about other crons that I could find [16:07:25] not the webstats ones [16:08:13] what keywords are you searching for? [16:08:23] data|webstats [16:08:30] on both locke and dataset2 [16:08:41] files are copied to the /data directory [16:08:46] on dataset2 [16:09:06] maybe try damnit and collector [16:11:19] nothing [16:11:21] or ask apergos [16:11:22] btw, it is 'dammit' [16:12:51] drdee: the mobile log files are ranging from 137mb to 139kb [16:13:23] drdee: compared to the regular squid logs which are averaging(and always close to) 500mb [16:14:08] drdee: https://gist.github.com/41208be304d05ba43582 [16:14:09] 139kb? [16:14:23] drdee: yes [16:15:29] hold on [16:15:53] ottomata, maybe it's a logrotate? [16:16:48] drdee, don't see anything there either [16:17:23] k [16:17:49] rsync? [16:19:59] poking apergos [16:21:12] drdee, can I stop webstats on an09? [16:21:33] yes, but let's leave the source there until we know for sure everything works [16:23:09] sure [16:48:30] ottomata, hangout? [16:49:22] about flume? lemme just work on it, let's talk about what to do once it works [16:49:25] i'm really close [16:51:40] cool [17:00:21] okay the .voy counts show up on dumps.wikimedia.org [17:00:41] yay! [17:00:43] so i think we are all good [17:20:04] \away [17:31:42] mornin all [17:31:53] morning David :) [17:31:56] https://fbcdn-sphotos-f-a.akamaihd.net/hphotos-ak-ash4/471493_10150868620512466_780224815_o.jpg [17:31:59] here are some ducks [17:32:25] [17:35:09] are these your ducks? [17:35:44] no, just some random ducks [17:35:51] hehe [17:39:07] ducks! [17:39:11] random ducks! [17:39:14] EXACTLY WHAT I NEEDED [17:39:17] THANK GOODNESS [17:39:21] THANK YOU, average_1rifter!!1 [17:47:58] Analytic data is fantastic to use. :) [17:56:28] http://commons.wikimedia.org/wiki/File:IPC_NorAmCup.pdf is a recent example where I used it. [17:59:05] awesome! [17:59:10] i'll check it out, purplepopple [18:00:01] :) Love data for writing reports. [18:03:55] you guys having trouble getting into the hangout again? [18:04:38] ottomata, dschoon, erosen ^ [18:21:50] i think erosen basically wrote this: http://discoproject.org/ -- "so yeah. i had to merge some spreadsheets together and ended up writing a map-reduce framework in python running on top of erlang." [18:31:07] it seems that zcat on osx has a bug, it always adds a 'Z' to the input path and then it says it cannot find the file [18:31:55] dschoon is being trolled by average_drifter with ducks, ROFLOL :D [18:34:14] ah, Barkeep was considered, as was like every code review tool ever. Very intriguing: http://www.mediawiki.org/wiki/Git/Gerrit_evaluation [18:36:47] gerrit is here to stay [18:36:51] i cry about it two/three nights a week [18:36:58] but i've made my peace with it [18:37:53] anyways, ottomata -- is kafka scrapped now? :/ [18:39:54] not scrapped, ori-l. we're adding a flume feed to provide short-term reliability to the data imports [18:39:59] gotta get shit done. [18:40:21] it'll also answer the question for good whether it's a problem with the multicast feed or with the jangly import setup we have [18:40:37] yea, not scrapped [18:40:56] but in the current setup (udp2log, no storm), flume is a better suit for what we are doing than kafka [18:41:06] i don't care for kafka, just a bit bummed to have done the work for nothing :/ [18:41:17] fear not. [18:41:31] flume can't work longterm anyway, it writes directly to hdfs. [18:41:32] if/when we get around to building the architecture we had originally planned (kafka producers on frontends, storm doing ETL and hdfs importing), then kafka should be better than flume [18:41:40] (well, it could, we'd just have to redesign ETL) [18:41:47] right, i woudln't say 'can't' [18:41:51] it would just be a different arch [18:41:55] than we planned [18:41:56] how are you going from udp -> flume? is that the thing you e-mailed, ottomata? [18:42:54] yeah, ha, wish we had talked friday ori-l, i had thought about writing what you wrote for kafka for a while, but flume had custom pluggable sources and also a really nice feature: bucketing based on content timestamp [18:42:54] so, this is how i'm doing it: [18:43:02] https://issues.apache.org/jira/browse/FLUME-1838 [18:43:50] then in the flume config file I can just do this: [18:43:50] webrequest.sources.udp2log.type = org.apache.flume.source.UDPSource [18:43:50] webrequest.sources.udp2log.host = 233.58.59.1 [18:43:50] webrequest.sources.udp2log.port = 8420 [18:43:50] webrequest.sources.udp2log.multicast = true [18:43:54] and it starts consuming [18:44:05] welllllllllllllll [18:44:09] it sounds like you made the right decision [18:44:23] alas, my java career was too brief [18:44:28] well, we'll see, i'm just now getting it running on a single node [18:44:32] IF! we keep using kafka for a while [18:44:33] but this looks like a way better approach [18:44:44] i will def want to use your udp producer [18:44:46] also has a good shot of getting merged upstream [18:44:46] fo show [18:44:48] it looks great [18:45:31] do you have a min to chat about this or am i interrupting your hangout? i wanted to ask a few q's [18:45:57] sure [18:46:10] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90?authuser=1 [18:46:27] i'm in bed and my wife is sleeping :) [18:46:35] ohoh [18:46:38] i thought you wanted hangout [18:46:42] yeah no prob, can chat [18:47:28] so one of the things i did with the EL stream is just make it pub/sub so that it's trivial to write another tool that works with the data feed [18:47:58] i was able to add ganglia monitoring, sequence id monitoring, etc as standalone scripts because the 0mq stream is subscribable by an arbitrary # of scripts [18:48:08] also makes code upgrades easier to manage [18:48:12] yeah, that's one of the reasons I like kafka more than flume [18:48:15] pubsub stuff [18:48:24] well, you can get much of the benefits by doing something like say [18:48:52] writing a super-thin program that consumes the multicast udp and rebroadcasts it over 2 udp ports [18:49:05] 'dev' / 'prod' or whatever [18:49:46] that way you can keep one thing chugging along but still have a way of tapping into the firehose easily [18:50:10] that's pretty easy with multicast in general, no? just join the group again on another machine [18:50:17] i think you can do it on the sam emachine too, just need another port…i think [18:50:49] do you have a machine that is on the multicast group but can be used for development/experimentation? [18:51:26] anythign in eqiad can do that [18:53:02] ottomata: wee, i had no idea [18:54:03] so, this socat is a wee funky (I don't know socat very well yet) [18:54:04] socat UDP4-RECVFROM:8420,ip-add-membership=233.58.59.1:10.64.36.126,fork FD:2 [18:54:13] (replace that last IP with your nodes IP) [18:54:30] i think the ,fork FD:2 bit is weird, but I atleast get that to work [18:54:42] that will blast your terminal with the webrequest stream [18:56:56] well, i can do something useful with it i think, which is track seq id gaps (=udplog data loss) [18:57:02] if you guys want [19:00:46] ottomata: jesus christ. i did not know about socat at all. where has it been all my life. [19:01:05] (answer: "/usr/bin", apparently.) [19:01:39] haha [19:01:45] ori-l, that's this, no? [19:01:45] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=Analytics+Webrequest+Packet+Loss&vl=&x=&n=&hreg%5B%5D=analytics10%5Cd%5Cd.eqiad.wmnet&mreg%5B%5D=packet_loss_average>ype=line&glegend=show&aggregate=1 [19:02:11] that examines hostnames and seq numbers in the stream on each of the current udp2log instances in the analytics cluster [19:02:25] now, that uses the packet-loss.c code that ships with udp2log, so it will only work for udp2log data really [19:02:27] and, also [19:02:53] * ori-l looks for the ganglia password again, gah [19:02:54] just starting up an instance somewhere and checking for gaps in seq numbers will tell you if there are dropped packets from the source machine to the dest machine, somewhere in the network [19:02:57] ah [19:03:10] but [19:03:30] we need to track dropped packets on each of the udp2log machine instances [19:03:46] since usually, dropped packets are caused by the kernel udp buffer getting on the dest machine [19:03:51] not some in between network problem [19:04:05] so, you could start consuming the stream for seq gaps anywhere [19:04:13] but that won't tell you if there is packet loss on the places where it counts [19:04:51] that examines hostnames and seq numbers in the stream on each of the current udp2log instances in the analytics cluster [19:04:57] are you sure? i thought what you just described is what it was looking for [19:05:14] yes [19:05:16] yes [19:05:30] if you check seqs on the actual machines that are already consuming the data [19:05:33] that is a good check [19:05:35] but i was saying [19:05:45] if you were to just spawn up something on some machine in eqiad that was checking for gaps [19:05:57] it wouldn't tell you if there were dropped packets on the production udp2log consumers [19:06:15] because usually, we have packet loss due to filled buffers on each node [19:06:28] not because packets get lost in the network somewhere [19:06:38] but, even more annoying! [19:06:44] right, but this is a problem you can solve reliably rather than have to monitor on an ongoing basis, imho [19:06:56] by just getting a receiver up that can consume the stream fast enough [19:07:02] which i think you are on your way to doing [19:07:03] aye, agreed [19:07:08] but even more annoying! [19:07:10] is that packet loss is not related to our recent 0 byte import problems! [19:07:27] the 4 udp2log instances + kafka producers I have seem to do pretty well [19:07:29] so you guys are using 0mq after all [19:07:36] (sorry, couldn't resist) [19:07:42] (we are?) [19:07:47] bad pun [19:08:03] haha, i get it [19:08:03] ha [19:08:30] the problem is udp2log writing to stdout and kafkaproducers reading from stdin not playing well together [19:09:32] yeah [19:10:03] but udpkafka would work for that, i hope? [19:10:26] but the reasons for choosing flume for now seem sensible [19:10:48] okay, so i basically have nothing helpful to add, i don't think [19:10:55] seems like you have a good handle on the problem [19:11:55] yeah i think it would [19:12:12] if flume doesn't work out (it might not), then udpkafka is next on the list! [19:12:29] haha, it's OK, i was just making a show of sulking [19:12:31] you should use what works [19:12:37] flume looks sensible [19:12:47] if you can get HDFS to tolerate the load, stick w/it [19:15:18] well, except -- [19:15:55] if you're going to do _that_, why not udp -> hdfs directly? [19:18:22] that's a lot harder than it sounds [19:18:55] it needs to be buffered, for one thing, appends are impossible or hard, also, one of the main features i'm trying to get flume to do [19:19:00] is content timestamp based bucketing [19:19:19] so, bucketing based on the web request timestamp, instead of when the event was read [19:19:33] two cool things: [19:19:55] http://flume.apache.org/FlumeUserGuide.html#hdfs-sink [19:19:56] http://flume.apache.org/FlumeUserGuide.html#regex-extractor-interceptor [19:21:24] ......this is a way better tool for the problem [19:21:29] i completely agree [19:22:10] ok, carry on :P [19:23:46] hehe :) [21:24:54] hrmmph [21:29:29] what's up ottomata, run into flume probs? [21:31:20] yeah, i'm hoping i've just configured something funky [21:31:24] just asked on flume mailing list [21:31:25] http://mail-archives.apache.org/mod_mbox/flume-user/201301.mbox/browser [21:52:45] gotta head out for dr's appointment [21:52:48] back online after [21:52:50] (from home) [22:00:14] laatas [23:59:50] nite everyone