[01:15:29] drdee: when is the august report card coming out? http://reportcard.wmflabs.org/ [02:03:26] the august report is out, except it still says july [02:12:21] ok - https://wikimediafoundation.org/w/index.php?title=Template:Reports-en&diff=84365&oldid=83946 [02:12:50] (although strictly speaking http://reportcard.wmflabs.org/graphs/pageviews_mobile_target and http://reportcard.wmflabs.org/graphs/active_editors_target are still not quite up to date ;) [02:45:47] you have to poke howie for those numbers, we don't have them [12:38:39] morning average_drifter [12:45:36] hey drdee [12:46:41] yoyoyoyo [12:48:44] one more small feature we need: collector should have a command line option to specify where to store the output :( [12:49:32] alright [12:49:33] (applies primarily to where to store the berkeley db's) [12:50:30] is the git tutorial today? [12:50:39] yes [12:50:42] where? [12:50:56] #git-gerrit [12:51:03] now? [12:51:28] http://bit.ly/US19bQ [12:51:32] this should resolve to your local hour [12:51:54] ty [12:52:09] you're welcome :) [12:52:35] allright [12:52:39] what we could do [12:52:41] is to go to labs [12:53:01] fetch your recoded udp-filter stuff in a new branch [12:53:11] and start building the debian package to see if it works [12:53:18] meanwhile we can wait for ottomata to do his review [12:53:23] one approved, we mege [12:53:24] [12:53:27] we merge [12:53:33] and then we also have the debian script ready [12:53:36] good plan? [12:56:50] yes [12:57:19] ok so I have this on my list [12:57:36] 1) merge changes for run.sh from /25408 we talked about yesterday [12:57:47] 2) add collector output directory switch for cmdline [12:58:06] 3) debianize on labs [12:58:26] 4) go through review of Diederik of wikistats and make a new git review [12:58:33] please feel free to re-prioritize [12:59:07] excellent! [12:59:12] and i will start with 3) [12:59:23] we can do that now, shared screen [12:59:23] ? [13:00:51] hold on, let me first make one phone call [13:00:57] ok [13:20:16] ready [13:52:17] morning! [13:52:22] drdee, it looks like the figured people might have found their problem [13:52:38] morning [13:52:47] yes 404's due to caching issues, right? [13:52:55] yeah [13:53:15] so. i was going to get up and run the pig stuff this morning (i was having a problem, was goign to figure it out) [13:53:16] but now... [13:53:17] hm [13:53:21] we should be able to confirm that quickly by counting the 404's :D [13:53:29] aye hm [13:53:45] ohh and for 2011 the fundraising campaign used both B11 and C11 [13:53:56] that's why the count was so low [13:54:18] well, also they weren't really running in it september [13:54:23] there are def way more B11s in nov [13:54:45] k [13:55:18] ottomata: I just looked on the code review [13:55:29] ja? [13:55:41] ottomata: yes, the match_interal_traffic is splitting stuff, putting '\0' stuff in the string to split it into parts [13:55:58] ottomata: so I don't want to affect what happens after that [13:56:04] ottomata: so that's why I'm creating a copy [13:56:26] aye cool [13:56:36] average_drifter: did you add sufficient checks to prevent buffer overflows [13:56:37] tis merged! [13:57:05] still building meld [13:57:06] drdee: I should do strncmp wherever needed instead of strcmp and also strncpy instead of strcpy [13:57:14] yes, good idea [13:58:31] I watched this explanation about buffer overflows recently https://vimeo.com/22550600 [13:58:48] it does explain how they can be crafted but not very much about preventing them [13:59:18] ty [13:59:20] it has voiceover so the guy goes through all the steps, it was interesting to see that [14:12:26] just wanted to say that in that video at 12:18 he shows the actual shellcode used [14:12:35] the vid is 18m so it's quite long [14:15:11] the shellcode there was probably crafted using nasm(or some other assembler) and hexdump to show the bytecode [14:17:18] very interesting [14:17:29] still installing meld [14:17:37] dear lord that has many dependencies [14:19:02] it does but it is worth it (if installation is succesful :) [14:24:15] average_drifter... [14:24:27] so i just fetched the patch set into a new branch [14:24:34] cool [14:24:47] mergetool [14:25:54] are yoy watching? [14:25:56] yes [14:26:11] show me :) [14:26:20] a new meld will be fired up for each file to be merged [14:27:02] soo uh [14:27:07] can you open a console please ? [14:29:48] that was the wrong direction [14:29:53] now we have the older version [14:29:54] anyways it's fine [14:30:03] wait [14:30:06] so that's just one file [14:30:18] but there were multiple files which had to be merged [14:31:13] yes [14:31:15] hold on [15:28:24] changing locs, be back in a bit [16:03:31] ottomata; is all the fundraising data loaded in kraken [16:03:33] ? [16:09:10] * milimetric is grabbing lunch [16:31:13] back [16:32:42] me oto [16:32:50] well [16:32:51] me too [16:32:52] and [16:32:53] me otto [16:43:41] and me toooo [16:50:31] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:23:02] drdee, ready? [17:23:11] yeah [17:23:12] I got https://gerrit.wikimedia.org/r/#/c/26474/ through https://gerrit.wikimedia.org/r/#/c/26480/ [17:23:45] yep [17:24:06] so you can accept all of them except one [17:24:13] (i'd go with one of the ones where i push a file) [17:26:10] ok [17:26:37] i will do a -1 on https://gerrit.wikimedia.org/r/#/c/26475/ [17:27:17] k, cool [17:27:35] ok all have been accepter except for 475 [17:27:45] so basically all that means for me is that this commit wasn't merged to origin/master yes? [17:28:05] and the other ones are not merged either [17:28:08] hang on, gotta restart because my sip client sux [17:28:12] oh really? [17:28:13] ok [17:28:23] because they are dependent on each other [17:28:23] brb [17:28:32] i gotcha [17:34:41] ooo, drdee: [17:34:42] https://github.com/linkedin/datafu [17:36:26] cool cool cool [17:38:21] drdee, i'm back but i'll try to listen to the call [17:39:02] this seems way too convoluted to me [17:39:07] i am out [17:39:16] and getting coffee :) [17:39:41] i'm gonna skip the gerrit thing after all [17:44:04] bah! [17:44:04] http://www.cloudera.com/blog/2012/10/cdh4-1-now-released/ [17:44:11] the day I installed the first version! [17:44:11] hehe [17:44:40] whoaaa [17:44:41] Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing. [17:44:43] nice! [17:45:43] ncie [17:46:12] yeah super nice! [17:46:23] this was the main reason we wanted DSE in the first place [17:46:30] now that CDH has it [17:46:32] coooool! [17:54:18] drdee - good call. I just wasted 30 minutes of my life connecting to whatever the heck sip is [17:54:26] epic fail [17:56:16] drdee when you get back ping me so I can ammend the commit properly and git review again [18:00:03] back [18:01:52] milimetric ^^ [18:04:24] cool, so git commit -a --amend [18:04:30] I'm trying to figure out how to pass it the specific commit [18:04:48] yes that's the crucial part [18:05:19] maybe git rebase ^ --interactive [18:07:21] http://git-scm.com/book/ch6-1.html [18:07:22] reading the docs on that now [18:10:09] drdee - I don't understand rebase, the docs are like 10 pages [18:10:21] reading (this may take a while) [18:17:00] GREAT walkthrough of git rebase further down the page: http://blog.jacius.info/2008/6/22/git-tip-fix-a-mistake-in-a-previous-commit/ [18:44:02] drdee, wanna help me figure out why my pig stuff isn't working? [18:44:10] most certainly [18:44:45] ok, so the comments here are helpful [18:44:46] https://github.com/mozilla-metrics/akela/blob/master/src/main/java/com/mozilla/pig/eval/geoip/GeoIpLookup.java [18:44:55] on how to use the GeoIPLookup thing [18:44:57] from akela [18:45:25] here's my error: [18:45:25] ERROR 1200: Pig script failed to parse: [18:45:26] Failed to generate logical plan. Nested exception: java.lang.RuntimeException: could not instantiate 'com.mozilla.pig.eval.geoip.GeoIpLookup' with arguments '[GeoIPCity.dat]' [18:46:45] yep, got it and what is your command line? [18:47:26] https://gist.github.com/3828956 [18:47:37] i've tried several different ways of passing the .dat file [18:47:38] in DSE [18:47:44] it worked when I specified the local filesystem path [18:48:01] /usr/share/GeoIP/GeoIPCity.dat [18:48:08] but I think things have changed in the newer version of pig i'm using [18:48:13] like the instructions there say [18:48:22] it expects it to be in my /user/otto/GeoIPCity.dat folder [18:48:24] and it is [18:50:02] if I use the local filesystem path [18:50:06] I get much farther in the script [18:50:09] all the way to the DUMP command [18:50:26] so the script will parse with /usr/share/GeoIP/GeoIPCity.dat [18:50:30] and the job will start [18:50:32] k [18:50:34] but it dies [18:50:40] :[ [18:50:45] 2012-10-03 18:49:23,588 [Thread-3] ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:otto (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist: /usr/share/GeoIP/GeoIPCity.dat [18:50:45] 2012-10-03 18:49:23,588 [Thread-3] INFO org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob - Job2819459054452218790.jar got an error while submitting [18:50:45] java.io.FileNotFoundException: File does not exist: /usr/share/GeoIP/GeoIPCity.dat [18:51:11] is it looking locally or on HDFS? [18:51:30] i don't know, on the DSE pig stuff [18:51:32] it was def local [18:51:34] that's how I got it to work [18:51:47] but, those docs say it looks in HDFS [18:51:53] try putting in HDFS? [18:51:59] yeah tried that too [18:52:35] if I use an HDFS or relative path [18:52:40] the script will not parse [18:52:47] if I use an absolute local filesystem path [18:52:50] it parses, but the job dies [18:53:22] strange.... [18:53:40] what version of pig had DSE and what version does cDH4 have? [18:53:56] we are currently running Apache Pig version 0.9.2-cdh4.0.1 (rexported) [18:53:58] and dse? ummm [18:54:04] i dunno [18:56:43] i can't really tell without having it installed somewhere... [18:58:14] :) [18:58:20] i am googling [18:58:22] but no luck soy ar [18:58:31] i am peeking into DSE .debs [18:58:34] peaking* [18:58:37] peeking peaking [18:58:38] peek [18:58:39] peak [18:58:40] peek [18:58:51] peek [18:58:53] . [18:59:28] ./usr/share/dse/pig/lib/pig-0.8.3.jar [18:59:52] and from the commit message in akela: [18:59:53] https://github.com/mozilla-metrics/akela/commit/432c02a153789c28409902863aa1b8dec5be065f [18:59:55] oops [19:00:00] https://github.com/mozilla-metrics/akela/commit/432c02a153789c28409902863aa1b8dec5be065f [19:00:03] ack! [19:00:09] Add the getCacheFiles method from Pig 0.9 [19:03:59] yup [19:04:05] that looks like the culprit [19:10:27] ottomata: http://www.jarvana.com/jarvana/view/org/dspace/dependencies/dspace-geoip/1.2.3/dspace-geoip-1.2.3-javadoc.jar!/com/maxmind/geoip/LookupService.html [19:11:02] if we would add this line: [19:11:03] String fileName = getClass().getResource("/GeoIP.dat").toExternalForm().substring(6); [19:11:25] to line 64 in src/main/java/com/mozilla/pig/eval/geoip/GeoIpLookup.java [19:11:28] would that solve it? [19:14:44] or [19:14:46] @param filename Basename of the GeoIP Database file. Should be located in your home dir in HDFS [19:14:55] did you put it in your home dir on HDFS? [19:14:58] yes [19:15:09] grumble [19:16:12] did you put GeoIPCity.dat in your home folder or GeoIP.dat? [19:16:16] City [19:16:19] that is weird though [19:16:24] try the other one [19:16:27] the docs don't actually seem to do what the cdoe looks like it does [19:16:32] :) [19:16:36] no that won't make a difference, I'm passing the argument in [19:16:42] see ok [19:16:47] lookupService = new LookupService(lookupFilename); [19:16:56] right? that comes from directly what I pass in [19:16:57] but [19:17:00] cacheFiles.add(lookupFilename + "#" + lookupFilename); [19:17:23] i find this weird [19:17:24] acheFiles.add(lookupFilename + "#" + lookupFilename); [19:17:26] that is supposed to tell pig that it should 'cache' the given filename from the local filesystem (i'm pretty sure') and name it with the value that comes after the # [19:17:37] but I don't thikn I need to cache this file [19:17:42] since it is already deployed on all the machines [19:17:51] or, hmmm [19:17:52] haah [19:17:53] i bet [19:17:57] if I make a local directory [19:18:09] /user/local/GeoIPCity.dat [19:18:10] it might work [19:18:15] that is dumb, but lemme try [19:19:23] grr nope [19:19:25] ould not instantiate 'com.mozilla.pig.eval.geoip.GeoIpLookup' with arguments '[GeoIPCity.dat]' [19:19:34] hmm, ok i'm going to try to edit the java code and see if I can fix it... [19:19:40] don't really know what I'm doing, but mayyybe [19:34:47] ottomata: could you copy oxygen:/a/squid/404.log to kraken? [19:35:42] zat a request from jeff green? [19:35:48] yes [19:36:15] and another simple pig request: count URL's and sort them by frequency [19:36:25] (no geocoding :D ) [19:37:36] hmmmm, would that be more urgent? i'm having trouble with my udf at the moment [19:37:42] think that doesn't require a udf? [19:37:50] nope it doesn't [19:39:05] request URLs? [19:40:11] YES [19:40:21] i mean yes without the capitals [19:42:45] loading 404.log into hadoop now [19:44:00] k [19:50:00] just got in [19:50:04] woo, STUFF [19:50:05] I HAVE STUFF [19:51:32] WHAT KIND OF STUFF? [19:52:27] THE KIND THAT USED TO BE IN STORAGE [19:52:30] LIKE AN AERON [19:52:34] AND A 30" MONITOR [19:52:53] THAT"S AWESOME! [19:54:31] drdee, re 404 uri counts [19:54:37] got it, but there are a lot of lines [19:54:38] yes sir [19:54:42] you want me to truncate? [19:54:50] the uris with fewer hits? [19:55:02] just count every URL [19:55:06] yeah I did that [19:55:18] how many uniques ar ether? [19:55:24] are there? [19:56:02] but the file i7278207s [19:56:03] oops [19:56:04] 7278207 [19:56:29] can haz cdh4? [19:56:34] yessuh you can [19:56:44] sorry, i should have told that you should only count URL's with BannerControl [19:56:50] haha [19:56:52] geez [19:56:56] whoopsie [19:57:35] okay. i desperately need a shower [19:57:38] of this i am certain [19:57:43] as i smell like moving. [19:57:44] oof [19:57:49] thankfully, i do not feel like vom [19:57:53] BannerController [19:57:53] which is a vast improvement over yesterday [19:57:54] ? [19:58:01] brb guys [19:58:22] yes count any url that contains the string BannerControl [19:58:53] BannerController [19:58:56] ler [19:58:58] right? [19:59:17] for example [19:59:17] http://en.wikipedia.org/w/index.php?title=Special:BannerController&cache=/cn.js&303-4 444496 [19:59:50] yep looks good [20:00:27] that's better [20:00:30] 543 uniques [20:00:52] can you email me that list? [20:01:57] https://gist.github.com/3829470 [20:02:07] ty [20:06:35] super cool right! [20:06:55] 14 seconds to process 6.5 GB data! [20:07:03] BAM [20:15:58] ottomata, feel like running another MR job? (this time probably using hadoop streaming and udp-filter)? [20:16:13] using the 1:1 banner impression data [20:17:01] counts for 404 vs 200/302 [20:17:41] and possibly geocoded as well [20:20:16] that sounds exciting! [20:21:40] yeah bring it on, as long as I don't have to UDF geocode [20:21:42] this thing is annoying [20:22:06] can I use the days we talked about? [20:22:11] 9-30 and 10-01? [20:22:15] i loaded those in already [20:22:27] and for that, I do'nt think I need udp-filter [20:22:31] I already have the status as a field in pig [20:22:34] I can just filter on those and count [20:22:35] great [20:22:42] okay let's start with that [20:22:54] ok, so you want two numbers? [20:23:08] 404 count and 200+302 count? [20:23:14] yes as a start and assuming that geocoding is not possible right now [20:23:19] yes [20:23:40] i'm going to get you three numbers, 200 and 302 separate [20:23:41] that'll be easier [20:23:44] and we can just add it afterwards [20:23:49] k [20:29:40] hey dschoon, check this: [20:29:42] http://www.cloudera.com/blog/2012/10/cdh4-1-now-released/ [20:29:46] i saw your email [20:29:48] i mean [20:29:49] yeah! [20:29:56] i agreed with asher [20:30:04] that namenode spof was not that big a deal [20:30:10] it was maintenance that i care about [20:30:12] with dse [20:31:17] aye, thus far it has been a bit harder to set things up [20:31:20] than with DSE [20:31:22] more services to start, etc. [20:33:53] but! [20:33:57] it's nice to have those changes [20:39:31] aye ya [20:49:30] dudes does gerrit just keep emailing you over and over until you fix the review? [20:50:07] because it's doing that to me so I'm trying to figure out if it's singled me out in some sort of hazing the new guy process [20:51:16] it likes you :) [20:53:34] how about this low tech solution: just make another commit on the release branch and submit all of them again? [20:54:20] gerrit is great at helping [20:54:24] see how it's helping you? [20:54:26] no [20:54:26] lol [20:54:29] it's a regular little helper [20:54:38] because then you have a new change set in gerrit [20:54:39] it's just a little... dumb [20:54:51] oh really? so what happens to the old one? [20:54:52] i would have said "very dumb" but whatevs [20:55:19] in your particular case that's very annoying because the other 4 commits will not be merged [20:55:20] the old changeset can't be deleted or cancelled by me? [20:55:30] that's why in gerrit you don't want dependent changesets [20:55:57] so i don't think your idea is going to fly with gerrit [20:56:33] nah that's crazy talk, the idea shall fly! [20:56:59] good luck! [20:57:00] I'm just surprised by how smart I have to be to defeat its dumbness [20:57:56] remember: "A fool may ask more questions in an hour than a wise man can answer in seven years." [20:58:14] (here the fool is gerrit) [21:05:41] drdee: since the TCP part is mentioned to not be very stable and stuff in the collector, do we really need it ? [21:05:53] drdee: I mean, maybe it's used by some systems or components [21:06:00] yes, just leave it there [21:06:03] drdee: but since the majority of stuff is geared towards udp [21:06:06] oh alright [21:06:15] the tcp stuff is only for testing IIRC [21:06:20] ok [21:14:09]