[14:22:13] hey ottomata [14:22:17] heya, morning [14:25:42] ottomata: the packages are now available :) [14:26:57] cool! i will work on that today then [14:27:05] gimme a few to go through my email [14:27:26] ok [15:03:46] hmm, ok average_drifter [15:03:51] does collector not daemonize itself anymore? [15:06:58] ok, i'm running udp-filter -o | log2udp and collector on analytics1026 right now [15:24:57] ottomata: does daemonization of collector not work ? [15:25:13] I haven't done many updates on collector. I'd expect it to work [15:25:23] it works, it just doesn't daemonize by default like it used to [15:25:25] ottomata: did you get any error ? how did you run collector ? [15:25:31] usr/bin/collector [15:25:31] oh ok [15:25:42] i'm running it in a screen right now [15:25:48] build1 ? [15:25:52] no, analytics1026 [15:25:57] i have the udp2log stream available there [15:25:59] i couldnt' get it on stat1 [15:26:18] could I access analytics1026 too ? [15:26:21] it just keeps printing [15:26:21] 3: Handling the message [15:26:44] yes well collector, is dumping to disk [15:26:51] each hour [15:27:09] right [15:27:10] so STDOUT for collector is more like printing debug messages [15:27:29] I mean that's what collector uses STDOUT for [15:27:31] ok [15:28:58] right, it looks like it is working [15:29:03] it just didnt' background/daemonize [15:29:08] like I thought it was going to [15:30:56] ok, for example, on build2 [15:31:04] i extracted the webstatscollector .deb at [15:31:16] /usr/otto/webstats_new/ [15:31:17] so you can try [15:31:24] /usr/otto/webstats_new/usr/bin/collector [15:31:29] and see that it doesn't background [15:31:45] honestly, I don't think that it should background, thats just what it used to do [15:35:11] stopping hadoop [15:56:53] ok back [15:57:48] ottomata: so we should be checking for collecotr output after it's dumping some on disk [15:57:55] right [15:58:03] nothing yet [15:58:16] hasn't been an hour yet [16:01:03] ottomata: we can tweak the time to get faster results through -t switch [16:01:25] ottomata: can we run with -t 120 (that's 2 minutes) ? [16:02:05] oh, cool, ok! [16:02:09] its almost been an hour [16:02:14] so i'll let this dump and then run with -t [16:04:20] alright :) [16:15:14] welp. [16:15:26] office redesign could have been worse, i suppose. [16:15:30] not sure how, yet [16:15:34] but it could have! [16:15:40] haha, uh oh [16:15:45] they're done? [16:16:19] something like that. [16:19:23] morning milimetric [16:32:05] brb moving upstairs [16:40:16] ah, average_drifter [16:40:17] no dumps [16:40:18] but [16:40:22] Segmentation fault [16:40:29] runnign with -t 120 [16:40:58] ran with -t 2 [16:41:00] seg fault [16:43:08] okay okay. [16:43:12] i'll do it tomorrow, i think. [16:43:15] i'll give it a day. [16:53:22] ottomata: ok reproduced, gonna fix it now [16:53:28] ok cool [16:54:14] if you want to try it with the stream, you can run it on stat1 [16:54:16] stat1 shoudl get the stream [16:54:18] not sure, but i think so [16:54:37] yeah, its there [16:54:49] run collector on stat1 to test with the udp-filter -o stream [17:32:38] ottomata: solved the bug [17:32:42] ottomata: what do I do now ? [17:32:53] ottomata: do I make another set of webstatscollector packages ? [17:33:10] we can just test on stat1 if you want to copy the files over there before making a deb [17:33:13] you can test actually [17:33:17] just copy them over there and run collector [17:33:28] it hsould be able to pick up the udp-filter -o stream [17:33:52] so I tested locally with [17:34:03] cat sampled_bugfixing | ./udp-filter -o | nc -u 127.0.0.1 3815 [17:36:07] ottomata: which stream ? [17:36:30] ottomata: at the moment I take input for udp-filter from a file [17:36:35] ottomata: is there a stream on stat1 ? [17:36:45] so, i have udp-filter -o | log2udp stat1.wikimedia.org 3815 running right now [17:36:58] so a udp stream of the output from udp-filter -o is available on stat1 [17:37:01] try this, and see [17:37:12] netcat -lu 208.80.152.146 3815 [17:37:16] that will just output the udp stream [17:37:23] since collector listens on 3815 for data, [17:37:31] if you just run collector on stat1 [17:37:34] it will test the live stream [17:58:33] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:00:02] can someone repaste link? [18:00:14] discovered that colloquy was disconnected [18:00:25] drdee [18:00:46] dschoon? [18:00:48] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:00:54] thanks [18:00:58] dschoon:nvm [18:00:59] (i should not be here :D ) [18:01:25] no you shouldn't! [18:01:45] i am just lurking [18:23:02] woo. [18:23:08] ottomata: lmk when hue is back up [18:27:02] ok [18:34:28] ok, might be a bit still, need to do some mysql moving stuff and partition creation on an7 [18:34:29] an27* [18:52:53] ottomata: new packages ready [18:54:11] ottomata: on build1 [18:54:16] ottomata: fixed the bug [18:54:28] udp-filter has a new package too? or just webstats? [18:57:53] ok, average_drifter, i'm trying the new webstatscollector collector on stat1 [18:57:56] it isn't segfaulting [18:58:01] but the files it is creating are empty [18:58:03] i'm using -t 10 [19:00:16] ottomata: just webstats [19:00:36] ok cool, so yeah [19:00:37] no output [19:00:39] well [19:00:41] it creates files [19:00:45] but they are empty [19:01:45] ottomata: so you're doing this on stat1 ? [19:01:52] yes [19:01:57] in my homedir you can try it [19:02:03] ok [19:02:03] /home/otto/webstats_new/usr/bin/collector [19:06:36] ottomata: the stream you have, where does it come from ? [19:06:49] ottomata: does it come from the old filter or from the new one ? [19:07:04] it comes from your new udp-filter -o [19:07:13] you can read the stream yourself on stat1 [19:07:40] netcat -lu stat1.wikimedia.org 3815 [19:14:40] ottomata: the stream you have has the following format [19:14:48] NUMBER NUMBER STRING NUMBER NUMBER STRING [19:14:58] the new udp-filter -o format is [19:15:19] NUMBER STRING NUMBER NUMBER STRING [19:15:25] ah, that number is from log2udp [19:15:27] it adds it [19:15:35] erm, should I adapt to it ? [19:15:37] that's how collector works, no? on locke right now [19:15:39] the process is [19:15:51] udp2log stream -> filter -> log2udp -> collector [19:16:02] you should test with log2udp i guess [19:16:52] udp-filter -o | log2udp -h 127.0.0.1 -p 3815 [19:16:55] and then run collector [19:16:56] test with that [19:17:52] where can I get log2udp ? [19:18:24] I need the stream(or a sample of it) locally so I can debug/develop on it, and then deploy to stat1 to test [19:18:28] stat1 has big latency for me [19:18:37] ok I have a sample of the stream [19:18:51] ko cool [19:18:52] my current question is what do I do with the first number ? [19:19:00] i dunno, did you rewrite collector or something? [19:19:03] because I'm not expecting the first number in there [19:19:05] this worked with the previous collector [19:19:22] I didn't rewrite it, but I modified it in some parts yes [19:19:33] ummmmmm [19:19:35] hm [19:19:51] well, i'm not exactly sure WHY we need to use log2udp and collector [19:19:57] if we are changing this [19:20:10] let's actually forget log2udp [19:20:13] we don't need it [19:20:24] ok, can we exclude log2udp from the test ? [19:20:27] yes, so [19:20:28] but [19:20:30] if we do that [19:20:31] if we can do that then it has no reason not to work [19:20:37] that means that collector should read from stdin [19:20:39] not from network [19:20:41] what happens if we exclude log2udp ? [19:20:49] log2udp is needed for network relay [19:21:03] its basically jsut a netcat with some extra stuff [19:21:10] like adding those seq #s [19:21:16] can it be just the netcat without other extra ? [19:21:23] so that's what the first field is ! [19:21:25] the seq [19:21:27] right ? [19:21:27] errrr [19:21:28] yes [19:21:38] i guess so [19:21:44] sigh, it sucks though, because I don't really know [19:21:55] its like you and I were given this thing without any instructions [19:22:03] ok, maybe I can read the seq too and discard it. because if I don't read it then I get "invalid" data and you get 0 files on disk [19:22:20] the best thing to do [19:22:25] would be to change as little as possible, rigth? [19:22:29] yes [19:22:37] you probably should have just let collector read the same format as before [19:41:11] dschoon, erosen [19:41:19] hue should be back up, but at a different url [19:41:19] yo [19:41:20] hm? [19:41:25] awesome [19:41:33] the index.php on analytics1001 has been updated with new urls [19:41:43] nice [19:41:53] coolio. ty [19:41:55] dschoon, what do you think about assigning some alias hostnames to these nodes? [19:42:04] i always approve of pretty names [19:42:11] only for interface purposes, not for config files [19:42:30] I would puppetize /etc/hosts on kraken nodes [19:42:33] with all hostnames and aliases [19:42:39] ahh. yeah, that sounds good. [19:42:41] are we talking about names like http://analytics1027.eqiad.wmnet:8888/about/ [19:42:42] ? [19:42:52] so, that way we don't have to remember that an10 is namenode urls, and an27 is hue oozie etc. [19:42:53] yes [19:42:58] so that could change to [19:43:04] hmm [19:43:07] so, not sure if this is on my end, but [19:43:14] doesn't seem to work for me [19:43:16] oh? [19:43:22] you using proxy? [19:43:33] actually [19:43:34] yeah [19:43:36] my bad [19:43:39] I am in the office [19:44:09] works [19:44:11] my b [19:44:24] brb lunch. [19:44:35] want to grab it before it gets silly out there. [19:45:13] ottomata: another question which may just be me messing up [19:45:21] what are the credentials to enter for the hue login [19:45:39] the special account [19:45:41] ? [19:46:07] or the wmf-analytics [19:46:54] you should have an account, right? [19:47:36] yeah [19:47:37] I do [19:47:52] i think i must just be misremembering [19:47:54] I'll figure it out [19:47:57] shoudl be erosen [19:47:59] and your pw [19:48:00] yeah [19:48:08] ooo [19:48:29] hmm [19:48:48] yeah, I can't seem to figure out the pw [19:48:56] i think diederik told it to me [19:48:59] hmmmmm [19:49:05] but that one doesn't seem to be working [19:49:31] hmmmmmmmmmmmmm [19:49:33] hang on [19:50:19] i still sort of assume it is my fault [19:53:24] ok erosen [19:53:24] try now [19:53:27] no, it was mine [19:53:38] worked [19:53:42] thanks [19:55:26] while I'm in Hue, what is the path for the kafka log stuff? [20:04:54] oh, so, the only stuff that is being imported right now is the fake event log [20:04:56] in my homedir [20:05:01] cool [20:05:25] /user/otto/event/logs [20:05:27] can't view your homedir [20:05:33] seems to be a new thing [20:05:41] hm [20:05:44] Cannot access: /user/otto. [20:05:44] AccessControlException: Permission denied: user=erosen, access=READ_EXECUTE, inode="/user/otto":otto:hadoop:drwx------ (error 403) [20:05:51] now? [20:05:58] works [20:06:07] what did you change? [20:06:21] chmod [20:07:01] gotcha [20:07:04] cool [20:07:18] okay I see now they are regular web server logs [20:07:51] i could use those [20:08:02] but i would need to do the geocoding in hadoop [20:08:08] i mean the ip range checking [20:08:20] or have the X-Carrier [20:10:24] ottomata: new packages in 1-2minutes [20:10:31] ok cool [20:10:41] erosen, X-Carrier has to come from the log sources, right? [20:10:56] I can't really give you that, we want to add that, but FR has told us not to touch those til after the fundraiser [20:11:05] gotcha [20:11:09] the ones in my home directory are not regular webserver logs [20:11:17] they are the proposed event/pixel log format [20:11:28] i'm not regularly importing the web access logs yet [20:11:29] yeah I figured X-carrier is a ways out [20:11:37] but we could do the zero ones if you want [20:12:01] hmmmmm [20:12:18] whatever is the best way for you to get the zero logs in hdfs [20:12:25] i am pretty flexible [20:12:25] ok, i'm going to try with kafka [20:12:28] cool [20:12:46] its just annoying cause i'll have to run the same filters on kraken as on oxygen [20:12:52] so the next step would be to write a UDF which has the carrier IP ranges in it [20:13:08] oh, that would be cool [20:13:14] i can do that [20:13:17] actually, I would love it if I could import all of these to one topic dir [20:13:20] where do the definitive ranges live? [20:13:21] that way I only have to run one filter [20:13:37] that is fine with me [20:13:55] does that seem like the best long term strategy [20:14:02] https://office.wikimedia.org/wiki/Partner_IP_Ranges [20:14:17] like is the intention to do that sort of thing in hadoop? [20:14:38] or would we want to use udp2log [20:14:44] not exactly sure, but if the X-Carrier is in the log line [20:14:59] it would be pretty easy, right? it think once we have the X-Carrier in the log line, we plan on only saving one file anyway [20:15:16] which (for this process anyway) would mean a single filter/stream/directory/filename into hadoop [20:15:34] yeah [20:15:35] cool [20:16:02] when you were looking around for ip utility libraraies did you happen to see any java ones that looked promising? [20:16:33] naw, didn't look at java stuff then [20:16:38] cool [20:16:45] ottomata: new packages ready [20:16:49] ottomata: please try again [20:17:05] so, just let me know when the kafka stream is coming in and i'll start messing around with it. [20:17:37] ok, how often do you want it to import into hadoop? [20:17:38] hourly? [20:17:44] average_drifter, ok... [20:18:15] hmm [20:18:18] daily would be fine [20:18:34] hm, ok [20:19:57] * drdee is looking forward to have a single file with all Wikipedia Zero data using the X-Carrier http header [20:20:12] * drdee is still not here :D [20:20:37] average_drifter, same thing [20:20:45] i'm still using the log2udp stream [20:20:47] should I not be? [20:35:48] back [20:37:04] ottomata: you should be using that [20:37:28] ok, I am, but still no output [20:46:53] hey dschoon, you there? [20:49:16] yeep [20:49:17] sup? [20:51:36] ottomata: I think the sockets are different [20:52:05] any opinions on in hadoop directory structure? [20:52:09] dschoon? [20:52:15] like for storing shared log files and the such [20:52:24] i'm going to import some of the zero stuff for evan [20:52:28] hm. [20:52:32] average_drifter, whatcha mean? [20:53:01] i favor //<...DATEPARTS> for anything new [20:53:30] for archival stuff, i'd vastly prefer we migrate things into one unified structure by assigning PCs and creating the appropriate data directories [20:53:41] we might want a root set of prefixes, i guess [20:53:54] maybe /raw and /products? [20:54:03] which would let you import stuff into /raw now [20:54:11] and we could run jobs to migrate things to /products [20:55:02] how would people know which product code means what [20:55:04] i would prefer one more level there too [20:55:19] dunno what though, /share maybe? [20:55:24] /data [20:55:29] milimetric: there's a product code wiki page [20:55:34] cool [20:55:45] well, permissions are based on groups and users, right? [20:56:05] isn't it more sensible just to have a "shared" group that allows read-only access to certain product codes? [20:56:23] yeah that's cool, i just don't want to clutter up / [20:56:26] and also have groups for each team and each product? [20:56:52] i don't think it's bad to have lots of codes at /products [20:57:01] but i agree that hierarchy is good. [20:57:13] /share just sounds like an invitation for trouble though [20:57:21] yeah [20:57:22] ok [20:57:37] /data sounds too generic [20:57:45] we could put this in /var/wmf [20:57:55] ? [20:57:57] /var/wikimedia/ [20:58:03] i'm confused [20:58:05] there is already a hadoop /var dir [20:58:06] what is /var? [20:58:11] oh. [20:58:14] what usually goes in there? [20:58:28] i'd prefer we stuck it in /raw/ [20:58:29] well, in unix, it is usually a place where installed packages store there data files [20:58:31] and log files [20:58:35] their* [20:58:43] and processed stuff would end up in /products/ [20:58:53] i thought it was runtime data? [20:59:21] yeah [20:59:30] which um, is what we are talking about? [20:59:36] wikimedia specific uhhhhh data [20:59:46] like, mysql creates DBs in /var/lib/mysql [20:59:51] yes [21:00:06] yeah, but hdfs is only "data" [21:00:20] otherwise i'd suggest /var/wmf/{raw,products} [21:00:32] no point in the prefix, as there *is* nothing else [21:00:42] other than hdfs metadata and our random system files [21:01:14] hmmmmmmmm, what if you had shared .jar files? [21:01:25] shared pig udfs, etc. [21:01:35] shared MR job .jar files [21:02:10] /jars? [21:02:12] (do those usually go in hdfs?) [21:02:22] (i guess there's no reason not to.) [21:02:22] sometimes, yeah, you can upload them so they already exist and then refer to them [21:02:38] in pig at least [21:02:43] i want to avoid replicating the unix filesystem with hdfs, though :P [21:02:45] /lib [21:02:46] that seems silly. [21:02:50] /lib would work [21:03:00] jars: /lib/java ? [21:03:00] i guess so, seems to me like something that would be good to base stuff off of [21:03:06] rather than create our own [21:03:08] i'm fine with that [21:03:18] does it automatically create user directories in /home? [21:03:23] i vaguely recall that [21:03:26] it assumes /user [21:03:27] not /home [21:03:32] :P [21:03:45] but, some of the dirs in /user are not real users, like 'history' [21:04:09] what about just a root /wikimedia dir [21:04:38] /wikimedia/{raw,products/{…}} [21:04:39] ? [21:05:00] ottomata: I'm trying to nc -lu 208.80.152.146 3815 | nc -u 127.0.0.1 3815 [21:05:18] ottomata: the collector on stat1 I'm running from sandbox_collector, is not receiving any data [21:05:32] ottomata: I think that's because it's binding to the loopback [21:05:38] ottomata: uhm, I may be wrong.. [21:05:47] why do you need to do the double netcat? [21:05:51] can't you just run collector [21:05:52] ? [21:06:05] ottomata: to steer the flow of data on the loopback interface [21:06:08] oh [21:06:12] why? [21:06:20] because its hardcoded in sandbox collector? [21:06:27] ottomata: /wmf makes my typey bits happier [21:06:45] pssshhhhh i'm fine with that only because it is a well known acronym [21:06:51] in general I am in favor of verbosity [21:07:00] but ok. [21:07:03] /wmf it is [21:07:08] ottomata: yeah [21:07:09] spetrea@stat1:~/sandbox_collector/dumps$ netstat -antup 2>&1 |grep 3815 [21:07:12] tcp 0 0 127.0.0.1:3815 0.0.0.0:* LISTEN 31835/collector-sta [21:07:14] i'm going to save these zero files in /wmf/raw/wikipedia-zero [21:07:15] ottomata: ^^ [21:07:30] y u hardcode on 127.0.0.1? [21:07:39] y u have 'sandbox_collector'? [21:07:53] ottomata: just a directory for testing [21:07:56] oh sorry [21:07:57] ok [21:08:02] y u hardcode on 127.0.0.1? [21:08:05] ottomata: I didn't hardcode 127.0.0.1 , it was there before I had written any code [21:08:10] hm [21:08:24] oh hmmmmmmmmm right, because i'm sending this from an27 via log2udp [21:08:25] sigh [21:08:33] collector is so bad! [21:08:33] ok [21:08:38] so your double netcat works? [21:08:40] for testing this? [21:08:49] ottomata: the double netcat doesn't work, it immediately returns [21:08:52] ok [21:08:52] ottomata: I dunno why [21:08:54] ottomata: nc -lu 208.80.152.146 3815 | nc -u 127.0.0.1 3815 [21:08:54] hm [21:08:58] ottomata: does it look wrong ? [21:09:19] ottomata: looks good to me, I mean it says "take the data from udp 3815 this ip and throw it on the loopback same port on udp" [21:10:13] yeah i dunno, looks fine to me [21:10:17] sigh, ergh [21:10:30] lemme just test this on an26 [21:10:36] ottomata: alright [21:10:47] ottomata: on an26 the loopback should coincide with the other stuff [21:10:51] so it should work there [21:10:53] exactly [21:10:54] k [21:12:07] what the crap is the point of collector reading from a network socket if it is hardcoded to 127.0.0.1!?!!?!?!? [21:12:10] man so dumb [21:14:11] ok cool, i have output [21:14:50] ok, average_drifter, I'm running this in a screen now on an26 and it is outputting [21:14:59] i'm leaving it on the default hourly output [21:15:15] this to make 100% sure it works before we deploy it for real, right? [21:15:25] drdee wanted to leave this running for a few days and then compare? [21:15:53] yes, until we know 100% sure that everything works as expected [21:16:38] ok, its running on an26 right now [21:16:50] saving dumps at /home/otto/tmp/webstats_test/dumps [21:17:47] :) [21:18:30] drdee , ottomata do you see a solution for us to add another cmdline param that would indicate on whic ip to bind 3815 udp port for collector to listen on that particular (ip,port) pair ? [21:19:01] do we really need it? i would like to move on to the next project [21:19:22] well, it's not needed, if it will run on an26 , the param will not be needed [21:19:34] ottomata: is that where the collector will run ? on an26 ? [21:19:39] no [21:19:41] this is just for testing [21:19:43] it will run on lock [21:19:48] but it will be able to listen on 127.0.0.1 [21:19:50] so it will work the same [21:19:56] ok nice :) [21:20:01] but [21:20:03] ? [21:20:12] if you were going to modify this thing, i'd say get rid of the networking bit all together [21:20:16] just make it read from stdin [21:21:11] * drdee mumbles if it ain't broke don't fix it, we have a lot of wikistats bugs still to fix [21:21:37] ok :) [21:27:25] drdee! [21:56:48] ok, erosen [21:56:51] http://analytics1027.eqiad.wmnet:8888/filebrowser/view/wmf/raw/wikipedia-zero?file_filter=any [21:57:11] word [21:57:15] do you see a file in there? [21:57:16] this should start importing daily [21:57:17] no [21:57:19] cool [21:57:20] just checking [21:57:23] it will import 24 hours from now [21:57:26] hopefully :) [21:57:28] great [21:57:29] this is kinda hacky and new [21:57:31] i'll keep you posted [21:57:33] and I dont' trust it [21:57:35] so we'll see [21:57:45] do you think it will be dropping data and such [21:57:46] ? [21:57:51] no, def not [21:57:55] or do you think it will just break [21:57:56] all the data is in kafka [21:58:00] k [21:58:05] the import into hadoop bit is hacky and weird [21:58:16] running in a while loop in a screen right now [21:58:18] gtocha [21:58:20] i'm not sure about limitless consumption [21:58:21] hehe [21:58:22] we'll have to see [21:58:26] cool [21:58:31] well thanks for setting it up [21:58:33] all the imports i've done inot hadoop via kafka thus far have been small [21:58:48] gotcha [22:14:23] robla, we're trying to listen to everything the good dr. dee is saying without condoning his working during his time off :) [22:14:41] (hence calling him dr. dee so he doesn't get pinged) [22:28:37] too late!