[00:00:10] that's probably what I'm looking for [00:01:04] great! [14:00:26] morning guys [14:02:09] morninnnng [14:02:11] ottomata, would this be a good day to setup a udp2log instance in kraken and count the traffic by hour? [14:02:19] (not storing in hadoop :) ) [14:02:28] you want bytes, right? [14:02:31] yup [14:02:37] or do you want me to just save all the data to a file? [14:02:48] we could pipe it into kafka again [14:02:58] the data there just gets stored on disk in files [14:03:03] we don't even need to save the data [14:03:13] just count the bytes per hour for a 7 day period [14:03:16] and save those numbers [14:03:23] hmm [14:03:40] saving is okay with me but it's not a requirement, [14:03:45] hmm [14:04:05] yeah that sounds fun i guess, I can figure that out [14:08:06] cool [14:30:10] hey average_drifter [14:42:57] milimetric sorry for not saying gooood moooorning sooooooooner [14:43:04] so: GOOOD MORNING! [14:45:18] :) [14:45:41] morning drdee - sorry I'm doing maths and my brain is apparently crusty with age [14:46:28] matrix stuff? [14:47:27] no simple stuff to add equal padding to the top and bottom of a log scale: log(y0 - yp) - log(y0) = log(y1 + k*yp) - log(y1) [14:47:40] (solve for yp) [14:48:00] i just got it but I was like - man, you don't use logs for 12 years and it all goes to sh*t [14:48:05] :) [14:48:10] yep [14:52:34] i was doing some geometry this week when buying a couch because our room has angles [14:52:49] man, that was like 20 years ago [14:56:51] ottomata, i ran some more pig jobs last night [14:56:53] it's cool [14:57:16] i do think we need to do more fine-tuning of the configuraiton [14:57:17] :) I got sick of those calculations so I put our entire apartment into google sketchup. Was very nice [14:57:23] :D [14:57:36] so ganglia monitoring would be very helpful [14:58:07] also, i sent you a link last night with a sample chapter from the hadoop operations book that gave some instructions on how to set it up [15:02:46] http://files.cloudera.com/pdf/Hadoop-Operations_sampler_2012-08.pdf [15:02:56] coffee coffee time [15:32:16] grr sometimes am I not signed onto IRC and I don't know it [15:56:11] hokay drdee, I *think* I've got a super simple hourly line and bytecount summary running on an11 [15:56:19] check this out, lemme know if you think it is too naive [15:56:31] https://github.com/wmf-analytics/kraken/blob/master/bin/netcat-wc [15:56:51] i was gonna do something fancier with signal traps, but this seemed to work just fine [15:57:09] I'm probably going to miss a few bytes in between the killall netcat and the next exec in the loop [15:57:12] reading now [15:57:15] but i think it will be inconsequential [15:58:35] maybe you can also output the date and hour to which the count applies [15:58:38] i am [15:58:44] sorru [15:58:46] echo $(date "+%F_%H.%M.%S") $(netcat -lu $ip $port | wc -lc) & [15:58:53] man i don't know what's wrong with me [15:58:55] hehe [15:59:14] the echo isnt' in the background [15:59:15] so [15:59:16] is it running? [15:59:17] yup [15:59:26] i've run it with intervals of 10 seconds and the like [15:59:33] its running with hourly interval right now [16:00:18] should see the first output in 46 minutes [16:00:23] are you writing it to a file? [16:00:26] yes [16:00:31] /home/otto/hourly_udp2log_wc.txt [16:00:33] perfect [16:05:14] average_drifter you here? [16:05:25] louisdang? need more assistance with labs? [16:29:02] average_drifter that youtube clip is crazy!!!! [16:29:13] milimetric, check this out: http://www.youtube.com/watch?v=YDW7kobM6Ik [16:29:19] from average_drifter [16:45:20] drdee geez - or you could just ask the girl out and move on if she says no :) [16:45:41] haha [16:47:15] cool, drdee, it is working [16:47:16] cat hourly_udp2log_wc.txt [16:47:16] 2012-10-10_15.47.00 361257720 155228433849 [16:47:46] that's 144G in the last hour [16:47:47] whoa [16:47:51] that seems like way more than I thought [16:47:53] is that right? [16:50:11] that' 3.6 TB / day [16:50:21] i thought we were estimating like 300GB / day [16:51:39] this has always been my big big worry [16:53:59] .0 [16:59:55] ottomata, some fields we can discard but it's still a lot of data [17:00:24] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:01:31] ottomata ^^ [17:16:19] here [17:16:27] drdee: hey, glad you liked it :) [17:30:36] hey erosen [17:30:42] hangout die on you? [17:30:48] ya [17:30:50] just restarted [17:30:53] will be rejoining shortly [17:31:01] if it still exists... [17:31:16] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:31:29] ottomata - gotcha, I was leaving out the 301 which I'm not really sure should be coutned [17:31:31] *counted [17:31:46] unbelievable that youtube clip [17:31:52] just finished watching it [17:34:41] yeah , i'll cout whatever drdee tells me too :p [17:34:43] be back in ab it [18:19:50] ottomata question about pig [18:19:56] i ran a pig script last night [18:19:59] and i was counting url's [18:20:15] but it seemed that pig / hadoop shortens the urls' to a fixed max length [18:20:23] do you know anything about this? [18:21:00] louisdang around? [18:21:45] hm [18:22:06] whatcha mean? chararray has a max size? [18:22:10] show me [18:27:52] hey drdee [18:27:53] i deleted the output files [18:28:03] but the url's where shortened [18:28:12] they all had the same length and they were unique [18:28:22] so it almost seems that the key has a max size [18:28:28] i can start the job again [18:28:31] are you running something? [18:28:36] hey louisdang [18:28:41] do you need any help? [18:28:46] not running anything [18:28:52] did you get the hadoop instances running? [18:28:59] maybe ottomata can help you as well [18:29:08] I wrote a pig script for the example apache logs but I can't get openjdk to find my classpath [18:29:17] don't use openjdk [18:29:20] use sun jdk [18:29:26] does the terms of service allow me to? [18:29:41] yes, you just can't redistribute it yourself [18:29:47] but you are not doing that anyways [18:29:50] oh ok [18:30:06] I took the longest time trying to fix it last night [18:30:31] daww, louisdang, maybe ja I can help? [18:31:08] I'll try installing sun java first [18:32:45] I got the hadoop instance to work in pseudo distributed mode since I can't open the ports for the full version myself [18:34:01] aye [18:34:03] hm [19:02:13] yeah, most of my time before was on spent on figuring out getting full distributed to work until I found out that you need a net admin to open the ports [19:02:32] ottomata, asher suggested that we could look into the kafka go producer [19:03:14] hm [19:03:59] it has zookeeper support [19:04:03] (i believe) [19:04:13] hmmm [19:04:22] and it was his suggestion :D [19:04:32] i thought I looked through the included clients and the only ones that had zookeeper were C# and java [19:06:34] https://github.com/jdamick/kafka.go [19:09:54] ottomata, pig output is like this [19:09:56] http://en.wikipedia.org/wiki/Bush_Rat 3 [19:09:56] http://en.wikipedia.org/wiki/Bushfood 17 [19:09:57] http://en.wikipedia.org/wiki/Butyrate 10 [19:09:58] http://en.wikipedia.org/wiki/Butzbach 2 [19:09:59] http://en.wikipedia.org/wiki/Buy_side 19 [19:10:09] all url's have the exact same length so they get truncated somehow [19:10:19] or maybe i am doing something wrong [19:10:43] just run hadoop fs -cat /user/diederik/referer/part-r-00000 [19:10:59] hm, that is really truncated [19:11:54] where's your pig script? [19:11:56] can I see that? [19:13:13] home/diederik/referer.pig on an01 [19:13:34] (very strongly inspired on your code :D ) btw [19:16:16] fyi, you don't need that first line [19:16:22] since you aren't using RegexExtract [19:17:08] did you want to match the referrer on that one? [19:17:10] not the uri? [19:17:28] also, i think I saw your chat last night [19:17:43] you were trying to get numbers about 404s for BannerController and referrers, right? [19:18:30] no, close [19:18:50] i want to match the URI, it should contain Special:BannerController [19:19:01] but the key should be the referer [19:34:44] drdee, isn't $11 content type? [19:34:55] ah naw sorry [19:34:57] you are right [19:35:55] oh, re before [19:36:05] shoudlnt' you be running this on the 404 log data though? [19:36:10] we have a sample of 404s that jeff generated [19:36:36] no, don't think so [19:36:53] matter wants to know which pages call the Special Banner page [19:37:00] matter = Matt [19:37:25] hm ok [19:39:32] any idea why it's truncating the url's? [19:39:42] i did some googling but couldn't find anything [19:40:24] only barely looked at it, chatting with RobH atm [19:40:53] k [19:55:26] drdee, [19:55:36] i just ran somethign very similar on a smaller subset of data [19:55:42] i get long results, like: [19:55:43] (http://www.google.co.jp/search?q=%E5%90%8C%E5%92%8C%E5%AF%BE%E7%AD%96%E4%BA%8B%E6%A5%AD&hl=ja&gbv=1&gs_l=heirloom-hp.1.4.0l10.2781.4625.0.6907.11.10.0.0.0.1.250.1159.1j6j1.8.0...0.0...1c.4j1.-odbap4VLXY&oq=%E5%90%8C%E5%92%8C%E5%AF%BE%E7%AD%96%E4%BA%8B%E6%A5%AD,1) [19:55:57] mmmmmmm [19:56:10] this is mine [19:56:11] it might try to reduce memory usage or something [19:56:30] and just make sure that they key stays unique [19:56:38] https://gist.github.com/3868011 [19:56:42] but i cannot find any parameter that hints at this option in the docs [19:56:47] or did you fix my script? [19:56:50] this will lower stuff: [19:56:55] i mainly added thsi: [19:56:55] URI_REFERER = FOREACH LOG_FIELDS GENERATE uri, referer; [19:57:07] that will minimize the amount of data M/R has to send around [19:57:11] k [19:57:20] and where do you define $input? [19:57:24] on the cli [19:57:30] oh really [19:57:31] pig -p input=/path/to/logs/* [19:57:32] how? [19:57:37] ahhhh cool [19:57:39] didn't know that [19:57:44] pig -p input=/a/b -p output=/b/c -f ./myscript.pig [19:57:44] output the same i guess? [19:57:57] yup [19:57:59] -p ==> parameter? [19:58:00] yup [19:58:04] right on [19:58:06] i think there is a —param long option [19:58:09] i was just hardcoding that stuff [19:58:20] i hardcode in my text editor when I am testing with a small file [19:58:21] because i didn't know how you were running it [19:58:26] ty [19:58:34] any of the ones I check in I try to paramaterize [19:58:53] actually, we don't even need URI after that point [19:58:54] hang on [19:59:13] i think that shoudl work too [19:59:27] first filter on uri, then generate referrer bag, then group and count referrer [19:59:29] https://gist.github.com/3868011 [19:59:40] ty sr [19:59:42] ty sir [19:59:46] so, i was running that without the filter though [19:59:48] on this dataset: [20:01:08] i get Out of bound access. Trying to access non-existent column: 1. Schema referer:chararray has 1 column(s). [20:01:30] /user/otto/logs0/sampled1-2months-200_lines.log [20:01:32] oh oops [20:01:51] yeah i removed uri, but didn't change $1 to $0 [20:01:54] COUNT = FOREACH (GROUP REFERER BY $0 PARALLEL 7) GENERATE $0, COUNT($1) as num; [20:02:20] ok running [20:05:47] ottomata, where did you store the udp2log byte count file again? [20:05:54] i mean, I didn't really change much in your job though, other than cleaning it up with the stuff i'd do [20:06:25] analytics1011:/home/otto/hourly_udp2log_wc.txt [20:06:33] oh machine 11 [20:06:55] aye [20:08:57] so far traffic is robust, about 144Gb per hour [20:10:03] do you know what a uberized job in hadoop is? [20:10:04] yup [20:10:14] nope [20:10:30] i always see 'uberized: false' [20:10:35] no clue what it means [20:11:02] you wanna fiddle with ganglia monitoring? [20:23:12] i did a bit, not really sure what to set some thigsn too [20:23:16] asked in ops room but got crickets [20:23:27] :) [20:23:48] so is hadoop already sending info to ganglia? [20:26:53] afaik, all the relevant parameters should be configured in hadoop-metrics2.properties [20:27:16] that file should contain the ganglia endpoint info among other stuff [20:37:06] ? [20:37:08] really? [20:37:14] i'm looking on an01 [20:37:19] nothing about ganglia there, no? [20:37:23] ok so [20:37:28] i don't know ganglia very well [20:37:43] i was looking into this earlier, but didn't want to restart things while your jobs were running [20:37:49] http://wiki.apache.org/hadoop/GangliaMetrics [20:37:56] i'm not sure what to replace @GANGLIA@ [20:37:57] with [20:37:58] maybe localhost [20:38:01] since gmond is running [20:38:20] i'm going to try the mcast_join host first [20:38:36] if you aren't running anything I can try [20:40:14] the thing is, i'm not really sure what this is going to do [20:40:15] or how to test it [20:40:24] other than hope that it shows up at ganglia.wikimedia.org [20:43:56] i don't think we need gmond [20:44:06] ah drdee, I just restarted namenode and resourcemanager on an01, i think I did it before you started your latest job though [20:44:18] you know, you should run on a smaller sample set [20:44:20] we need to create hadoop-metrics2.properties [20:44:20] til you get it right [20:44:22] mabye just one log file? [20:44:26] metrics2? [20:44:31] yup [20:44:34] i just edited hadoop-metrics.properties [20:44:35] why 2? [20:44:48] because it's v2 [20:44:54] v2 of what? [20:45:01] of the hadoop metrics stuff [20:45:10] not backwards compatible with 1 (AFAIK) [20:45:25] pig job seems to be running [20:45:33] yeah [20:46:02] did you have a look at the pdf file i pasted? [20:46:10] or i mean the link that i pasted [20:46:19] http://files.cloudera.com/pdf/Hadoop-Operations_sampler_2012-08.pdf [20:46:55] no didn't see that [20:49:31] see https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/conf/hadoop-metrics2.properties [20:49:37] as example conf file [20:50:31] i think we run ganglia 3.1 [20:50:37] but you might want to check that [20:50:42] yea think so too [20:50:54] the gmetad version is later that that at least [20:53:14] brb guys [20:54:30] yeah i'm not sure what our ganglia server is [20:56:38] according to banisher nickel [20:56:44] banisher == binasher [20:56:48] stupid autocorrect [20:58:07] wat? [20:58:08] tell me [20:58:12] according to him whaaaat? [21:04:27] drdee do you know of a good list of wikipedia languages? [21:04:33] i can't seem to find a canonical list [21:04:46] other than a wikipedia page which I am reluctant to parse [21:06:05] ottomata: where can you get GeoIPCity.dat [21:07:25] i think there is a package you can install [21:07:52] ja [21:07:53] geoip-database [21:09:37] louisdang: ubuntu package [21:09:46] louisdang: libgeoip-dev [21:09:48] something like that [21:09:51] ok thanks [21:10:07] louisdang: I think you can also get it from maxmind's website [21:10:36] erosen: gimme the page, I'll write you a oneliner to parse that [21:10:53] hehe [21:11:03] if you really want to, it is http://meta.wikimedia.org/wiki/List_of_Wikipedias [21:11:52] and I just need the "wiki" column from all of the tables (that is, for any number of articles) [21:12:03] erosen: so you want a list like "en,fr,de,ru,pt,br etc" ? [21:12:09] yeah [21:19:38] erosen: here you go :) [21:19:39] erosen: curl http://meta.wikimedia.org/wiki/List_of_Wikipedias 2>/dev/null | perl -ne 'm|wikipedia.org/wiki/" class="extiw" title=".*:">(.*?)| && print "$1\n"' | sort | uniq [21:19:43] done [21:19:51] awesome [21:19:54] i appreciate it [21:21:02] no problemo :) here's the result of running that https://gist.github.com/61971fa0a9912b028f71 [21:22:07] great [21:22:13] hopefully you enjoyed that [21:26:39] yea no problem [21:27:09] * average_drifter goes back to researching debianization processes