[13:49:16] hey ottomata [13:51:14] yoyoooo [13:54:24] drdee, whatcha think of parse.ly? [14:20:51] drdee: morning [14:20:54] afernoon [14:21:04] hey average_drifter [14:21:08] somewhere in between [14:21:29] today is a US / Canadian holiday so it will be quiet [14:26:40] average_drifter: which variable in wikistats (SquidCountArchiveProcessLogRecord.pm) counts the overall number of pageviews? [15:51:06] it is cooooold in here, i'm going to find a cafe [15:51:14] be back in a little bit [15:51:49] laterz [16:26:37] average_drifter: ^^ [17:23:07] Hi drdee, is there any work for me? [17:23:29] hey louisdang, [17:23:38] do you have a working hadoop / pig setup in labs? [17:23:46] yes [17:24:10] ottomata ^^ [17:24:22] I cloned the kraken repo from github too [17:24:26] ok [17:24:44] louisdang, 1 sec [17:30:21] louisdang, [17:30:28] how about downloading this dataset: http://waxy.org/bt/seed/star_wars_kid_logs.zip.torrent [17:30:28] yes [17:30:34] ok [17:31:04] that should give you a good start with testing your pig script [17:31:05] s [17:31:10] check also http://www.quora.com/Are-there-any-free-large-datasets-in-the-format-of-an-Apache-access-log [17:31:15] for other datasets [17:32:18] what metrics should I pull [17:32:30] so the first script you could work on is geocode the traffic and count page views by day by country [17:34:52] louisdang, also check https://github.com/mozilla-metrics/akela [17:35:09] ok [17:37:17] mozilla wrote a geocoding pig UDF there, but there might be better ways to do it [17:48:26] ottomata, feel like changing the block size of HDFS? [17:48:31] it's not hard, http://hadoop.apache.org/docs/r0.19.2/distcp.html [17:48:46] "You can change the block size of existing files with a command like hadoop distcp -Ddfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/inputdata-with-largeblocks. After this command completes, you can remove the original data." [17:49:13] all the data is in /user/otto/ > [17:49:14] ? [17:49:17] yeah [17:49:19] that's fine [17:49:30] i'm puppetizing mongodb right now [17:49:32] ok, i wil do that, but first a stroll with the lady :) [17:49:32] you are welcome to do that [17:49:36] oook [17:49:53] also, lemme know the confs you want to change, and I will puppetize them [17:50:19] at the end of today, it will be core-site.xml, mapred.xml [17:50:28] and a new file, hadoop-metrics2 [17:50:33] or something like that [17:50:46] can i just create a new file in the puppet folder? [17:52:14] and can you add me as an admin to wmf-analytics on github? [17:54:33] and can you set HDFS block size to 256Mb (i will fix it then for the existing files) [17:56:40] well, i'd rather not make any changes until you are done with your changes, since I am having you edit the cloned puppet repo directly [17:56:50] but, puppet is setting that too [17:56:56] i think in hdfs.xml [17:56:58] or whatever it is [17:57:02] you can change it there the same way you have been [17:57:06] with the others [17:57:12] hadoop-metrics2 [17:57:12] ? [17:57:14] what's that/ [17:58:30] drdee, what's your github username? [17:58:43] ah dvanliere [17:58:43] ? [18:20:36] grabbing food, brb [18:56:04] baaack [19:03:44] drdee, you still running stuff? [19:11:53] back as well [19:12:06] github username dvanliere [19:12:23] ottomata ^^ [19:12:34] not running anything right now but soon i will : [19:12:35] :) [19:13:17] ok added you [19:13:53] ty [19:14:27] i am gonna enable ganglia monitoring on hadoop, so you get charts with number of mappers & reducers and stuff like that, okay with you? [19:14:56] so this is hadoop specific monitoring not just the hardware/os stuff [19:20:35] hmm, drdee [19:20:36] [Fatal Error] mapred-site.xml:41:5: The element type "name" must be terminated by the matching end-tag "". [19:20:38] ok [19:20:41] ok [19:20:43] whoops [19:21:00] i hadn't finished yet [19:21:51] hehe [19:21:55] i can't hadoop fs -ls [19:23:07] ok i am gonna run puppet now, okay? [19:24:46] oh yeah sorry, ok cool [19:24:51] btw i am running quick pig stuff [19:24:54] okay ,puppet ran [19:24:54] nothing big, just checking some changes [19:25:01] restarting hadoop [19:25:21] ok [19:26:39] hadoop is coming back online right now [19:27:00] i can do hadoop fs -ls / [19:29:23] yeah its cool now [19:29:31] btw, lemme know when you are about to run a benchmark [19:29:38] so my pig stuff doesn't interfere [19:33:19] mmmmm [19:33:41] are you running something? [19:34:59] yeah [19:35:05] but it isn't working [19:35:06] ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1349724329215_0006_m_000000_3 Info:Container [pid=12079,containerID=container_1349724329215_0006_01_000008] is running beyond virtual memory limits. Current usage: 31.0mb of 1.0gb physical memory used; 30.7gb of 2.1gb virtual memory used. Killing container. [19:35:09] are you running something? [19:36:08] i am getting: [19:36:09] Error reading task output Server returned HTTP response code: 400 for URL: http://analytics1004:8080/tasklog?plaintext=true&attemptid=attempt_1349724329215_0005_m_000000_2&filter=stdout [19:36:16] not running anything [19:37:12] right its not running right now [19:38:08] restart nodes? [19:38:24] but i get that error when i launch a job [19:40:04] did you change the hdfs blocksize? [19:40:34] no [19:52:39] drdee, should I not mess with hadoop right now? [19:52:57] no, please don't [19:53:00] k [19:53:43] it is running again :) [19:53:58] i used accidentally a MRv1 option instead of a yarn option [19:54:04] that didn't work [19:54:07] obviously [19:54:14] right now running my benchmark [19:55:56] BAM, 23% performance improvement during writing of data [19:56:08] from 83 seconds to 65 seconds [19:56:49] ah yeah, mapreduce site probably won't need any changes [19:57:18] right? [19:58:02] it does! [19:58:17] the 23% improvement comes from tweaking mapred-site.xml [19:58:57] read improvement 27%! [19:59:33] can you try running your pig script and re-enabing the combine stuff, see if it works now [19:59:59] (i am not running anything atm) [20:00:25] ok…well my combine stuff was on the geocoding thing [20:00:26] ottomata ^^ [20:00:28] i haven't been disabling it on others [20:00:37] ok [20:00:38] and i haven't gotten teh geocoding to work yet [20:00:43] k [20:00:46] but i'll keep messing with pig [20:00:52] actually, i'm probably taking off in an hour or something [20:00:56] maybe i'll puppetize your chnages? [20:01:07] yes please do [20:01:10] ok, which files? [20:01:18] 1 se [20:01:19] c [20:01:49] etc/puppet.analytics/modules/cdh4/templates/hadoop/mapred-site.xml.erb [20:02:09] etc/puppet.analytics/modules/cdh4/templates/hadoop/conf-site.xml.erb [20:03:36] or is it just git commit -a? [20:04:26] coz i can do that ;) [20:05:06] naw, [20:05:13] because they need to be templatized [20:06:04] ok [20:06:59] your pig scripts should also run faster now, curious to hear if you see improvements [20:11:23] btw, drdee, [20:11:28] how did you choose these nubmers? [20:11:30] like [20:11:34] map/reduce tasks maximum? [20:11:39] are the dependent on the number of nodes? [20:11:51] (number_of_cores/2 )-2 [20:12:02] these are cisco dependent settings [20:12:23] once the dell machines are there we should calculate that based on the hardware of the machines [20:12:26] cisco has 24 cores [20:12:32] (24/2)-2 [20:12:52] i can code that to be automatic in puppet [20:12:53] the -2 is because you want a process for the name node, and other stuff else it can become unresponsive [20:12:54] based on cores [20:13:02] yup that would be very cool [20:13:23] this way you have a max total of 22 slots [20:29:28] oh drdee, do you want me to up default block size? [20:29:36] yes please, [20:29:43] i think 256mb would be good [20:30:10] hokay [20:30:44] louisdang, you got the data loaded into hadoop? [20:31:31] been trying to fix latency issues with ssh [20:32:15] it's just slow…. [20:32:17] drdee [20:32:20] 24 /2 − 2 is 10 [20:32:20] not 11 [20:32:22] zat ok? [20:32:55] sorry it's 11 [20:33:05] it's (24-2)/2 [20:33:27] else you keep 4 cores stand_by [20:33:29] oh [20:33:32] sorry [20:33:34] k [20:34:02] louisdang: if you have any tips on reducing latency, the please share them ;) [20:37:17] k, drdee, just restarted hadoop with that, tis all puppetized [20:37:28] thanks so much! [20:37:31] also, now that block size is 256MB, maybe you want to run your benchmark one more time? [20:37:39] yeah totally [20:37:53] also i made some notes of thing todo but we can discuss that tomorrow [20:38:07] ok [20:38:32] btw, if you are interested [20:38:39] you can see how I am configuring things here: [20:38:39] http://git.less.ly/?p=kraken-puppet.git;a=blob;f=manifests/site.pp;h=4d8a9e79a6118e92cd3a13a0bda6fbcfb08dfd63;hb=HEAD#l45 [20:46:02] how'd that benchmark go [20:46:04] any diff? [20:46:52] same speed [20:46:57] aye k [20:47:07] but i know what to do because the IO speed is all over the place [20:47:13] so i think i can do this better [20:47:45] oh? [20:48:11] 12/10/08 20:42:35 INFO fs.TestDFSIO: Total MBytes processed: 10000.0 [20:48:12] 12/10/08 20:42:35 INFO fs.TestDFSIO: Throughput mb/sec: 69.65631573814798 [20:48:13] 12/10/08 20:42:35 INFO fs.TestDFSIO: Average IO rate mb/sec: 232.24398803710938 [20:48:14] 12/10/08 20:42:35 INFO fs.TestDFSIO: IO rate std deviation: 227.75360937451464 [20:48:32] so the IO std deviation is very high, you want that as close to 0 as possible [20:53:38] drdee: kinda cool? [20:53:39] https://github.com/linkedin/datafu/blob/master/src/java/datafu/pig/urls/UserAgentClassify.java [20:54:33] that looks slow to me [20:55:18] psshhhhhh you look slow [20:55:30] :D [20:58:21] another 14% improvement, bigger hdfs block + more memory for reducers [20:58:47] hadoop totally needs to warm up [20:59:01] right after a restart i ran the benchmark 49 seconds [20:59:09] same benchmark second time 37 seconds [20:59:41] did you just run the same bench again? [21:01:17] whem? [21:01:55] 12/10/08 20:57:26 INFO fs.TestDFSIO: Test exec time sec: 37.542 [21:15:16] ooooooooooook i'm outty [21:15:18] laataaaas [22:02:37] drdee: $scripts {"$ext,$file,"} += $count_event [22:02:49] drdee: I presume the $scripts hash counts per-page viewcount [22:03:08] drdee: but you need overall right ? [22:03:24] uhm, just summing over all values in $scripts should give an overall pageview count like [22:03:34] use List::AllUtils qw/sum/; [22:03:43] sum(values %$scripts); [22:03:54] drdee: would this solve the problem ? [22:05:17] drdee: there is another solution [22:06:15] drdee: actually, considering that these counts are recorded in multiple hashes, you can sum the values of any of them [22:06:27] drdee: uhm, can you please tell me the context where you want to use this ? [22:06:55] also, different pages have different mime-types [22:07:07] maybe you are only interested in the ones which have mimetype text/html [22:07:18] ? [22:17:00] hey [22:17:26] we are trying to replicate the page view count from reportcard.wmflabs.org using the hadoop cluster [22:17:57] but we need to know the business logic of ex's scripts else we can't replicate his counts, so i am not talking about individual pages but entire projects [22:18:57] average_drifter ^^ [22:19:00] yes I'm here [22:19:06] but let's continue tomorrow :) [22:19:11] uhm I can dive into the code of reportcard [22:19:16] really finalize the debian packages ;) [22:19:19] and then wikistats! [22:19:24] ohh alright [22:19:25] you're right [22:20:18] drdee: are you on tonight ? might have some questions [22:20:29] i am now online :) [22:20:33] so better ask me now [22:20:49] drdee: alright, uhm, I'm gonna go and look at that log thing and fix that [22:21:05] what log thing? [22:21:12] drdee: git-dch related [22:21:19] perfect!