[13:32:39] hiyaa ottomata, average_drifter [13:33:09] hey drdee [13:33:46] http://garage-coding.com:8010/one_line_per_build <-- this be buildbot [13:35:04] nice! for which projects are you using it? [13:35:09] morning! [13:35:31] how was your weekend? [13:36:05] drdee: I'm using it for 2 personal projects(which are on github) and Mediawiki::Bot from CPAN. I am thinking about using it for wikistats as well, as I'm solving bugs on it [13:36:20] ok [13:37:18] good weekend! sorry I missed the bit about oxygen not being happy [13:37:21] it looks ok atm [13:37:25] buildbot is very very cool, it has a master and slaves. and you can easily set it up for multiple different architectures, and different compiler versions and different operating systems (vagrant can solve the problem of quickly getting vms up for that purpose) [13:37:30] ottomata, so about the pig script [13:37:58] it seems that in some cases the http_status is mixed up with the response time field [13:38:11] oh weird [13:38:14] hm [13:38:16] and then the pig script takes the first three digits of the response time [13:38:21] can you show me a couple of cases? [13:38:36] check the current output in hadoop [13:38:46] ? [13:38:57] user/diederik/mobile/part-r-00000 [13:39:30] \/ [13:39:36] / [13:39:46] aye ok, ok, but do you know of a line where it is like tha? [13:39:49] what source data is that from? [13:39:56] i don't in which file it happens (yet) [13:40:04] that is from all sampled? [13:40:07] yes [13:40:26] hmmm [13:40:31] but it's still weird because you would expect that erikz would have encountered this problem as well [13:40:45] well, hm, yeah, maybe he did and manually filtered it out [13:40:53] and mayb eit only happens in a small set of the data [13:40:55] lets try to find it! [13:41:20] he would have told me, and i have looked at wikistats often enough and never seen any code about this [13:42:17] we can look at the output on hdfs and get one of the response times [13:42:24] and grep for that response time in the sampled files [13:42:34] response time should be quite unique [13:43:10] right yeah, but hmm [13:43:22] i'd like to find the sampled file [13:43:34] not sure if there is a way to print the current match's filename in pig [13:43:38] buuut, i could print the timestamp [13:43:41] and we could fine it [13:54:26] drdee, are you sure that is response time? [13:54:49] have you looked at the output time? [13:54:53] i mean file? [13:54:54] yes [13:55:09] what else could it be? [13:55:20] in the logs i'm looking at [13:55:26] response time is an integer [13:57:25] really? [14:00:05] good morning everyone [14:00:22] morning [14:01:17] ok, drdee, trying to find fields that match that pattern in all sampled logs :) [14:01:18] hehe [14:01:27] :) [14:01:34] good mooooorning milimetric [14:01:42] morning drdee [14:02:04] bam bam bam another bean counting day! [14:02:41] :) [14:05:16] average_drifter, i think that the webstatscollector build script does not correctly replace the version number yet [14:05:40] also, when i ran your git2debchange script it kept repeating itself [14:11:59] ottomata, milimetric, are you in for some serious brain gymnastics? [14:12:11] always [14:12:25] 'coz if you are, you should most definitely buy 'causality' from judea pearl [14:12:55] it's a mathematical theory to reason and establish causality [14:13:13] it's wicked, even though i understand like 25% [14:13:16] like a rigorous version of that article you linked us to a while back [14:13:22] yes [14:13:32] that article was the 101 version [14:13:39] this is the 'bible' so to speak [14:13:41] cool [14:13:46] hm, let's see [14:24:49] mornin kids [14:25:11] morning dschoon [14:25:36] milimetric, do you remember what we were working on last week? heh [14:34:16] ottomata, oxygen is kicking up again [14:34:48] i think we should disable the 1:1 banner impression filter and run it somewhere else [14:36:22] looking at it... [14:36:37] load is fine... [14:36:54] iowait is low [14:39:47] restarting udp2log…there is a socat process running, not sure what that is... [14:40:14] what is it doing? [14:40:48] multicast stuff? [14:41:02] is it puppetized? [14:41:07] /usr/bin/socat UDP-RECV:8419,su=nobody UDP4-DATAGRAM:233.58.59.1:8420,ip-multicast-ttl=10 [14:41:08] i'm not sure... [14:41:17] not puppetized, i'm just not sure if it is a part of udp2log? [14:41:23] which machine is 233.58.59.1? [14:41:40] that is the multicast addy that udp2log will receive on [14:42:36] k [14:44:12] who is the owner of the process? [14:44:15] nobody [14:44:22] :) [14:46:07] the scoat is taking 80% cpu [14:46:16] and i am not used to seeing it in the processlist on oxygen [14:46:22] i can't figure out where it came from though [14:47:14] hmmmmmmm vanadium [15:00:19] what IS socat? [15:00:41] kidna like netcat but fancier [15:00:55] ...is it doing multicast? [15:01:08] this sounds like something ori-l was talking about on friday [15:01:08] i'm not sure what it is doing, it can do multicast [15:01:15] what was he talking about? [15:01:18] but it was just an idea. [15:01:53] he was talking about sharding the incoming stream on seq# to different boxes, before applying rules [15:02:02] but i doubt he'd just muck with production [15:02:20] so i'm sure it's just coincidence that a mysterious process has appeared, piping data somewhere else. [15:03:21] yeah [15:03:26] i can't kill these procs either! [15:09:02] even with -9? [15:09:06] they start back up [15:09:08] yeah so [15:09:12] init.d? [15:09:28] or do you think they have some sort of ctl? [15:09:29] there is a socat receivin on 8419, sending to a udp-readyer python script on vandium [15:09:46] welp. [15:09:50] that's def ori-l [15:10:07] no init, its not acting like init, more like parent proc starting them back up, but I can't fine the parent [15:10:32] same on vanadium [15:10:35] socat is relentless! [15:10:35] hehe [15:10:42] huh [15:10:46] screen? [15:10:47] hehe [15:11:23] don't see one [15:12:20] i mean, ps should tell you the ppid [15:12:26] (presuming there is one) [15:15:17] same as the pid [15:15:57] (it was worth a shot) [15:16:00] who owns it? [15:16:02] yeah i was looking for that too [15:16:03] nobody [15:16:11] bleh. [15:16:18] and lsof won't tell us much [15:16:27] because all its doing itself is routing bytes [15:16:52] who is logged in? [15:16:57] just me [15:17:05] i guess check the login history? [15:17:14] at least that'll tell you where to look for whatever is starting it [15:17:29] i grepped it auth.log for socat and didnt' find anythig [15:17:34] hm. if `nobody` owns it, doesn't it have to be started by root? [15:17:56] would thikn so yeah [15:18:12] no cron or anything? [15:18:16] can't find one [15:18:29] i mean, there could be a trap/signal handler [15:18:34] but now we're getting exotic [15:18:41] yeah but it is just the /usr/bin/socat [15:18:46] that said: it would behave exactly like there were a monitor process [15:19:02] hm. the handler would have to reregister after you kill it [15:19:05] so that can't be it, i guess [15:19:12] you checked puppet? [15:19:20] puppet wouldn't be able to do what this is doing [15:19:23] heh [15:19:24] it only runs every 30 mins or so [15:19:26] but i mean [15:19:33] just for clues [15:19:36] how it's doing it [15:19:36] will check [15:19:41] brb [15:20:01] ah cllues i found some [15:20:41] ah ha [15:20:45] upstart script checked in by asher on friday [15:22:12] cool, killed the vanadium one [15:27:25] ahh [15:33:53] ok ,so i'm not sure that is the cause, i'm still waiting for the nagios alert to settle [15:33:56] i'm going to run to a cafe pretty soon [15:43:59] ok, be back in abit [16:13:56] good morning everyone [16:14:06] hey louisdang [16:14:26] thank so much for the pig UDF, i am reading it right now and will give some feedback shortly [16:14:45] maybe you can quickly explain the purpose of the schema function [16:15:02] ok, I'll find the link I have about it [16:16:00] basically, it runs during compile-time to check if your input schema matches whatever you put in the schema function [16:16:07] k [16:16:49] http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html for more detail [16:17:04] ok, i will read that as well [16:41:34] ottomata, so far not having any luck finding lines that cause issues with the pig script, are you more lucky? [16:41:47] i am running a match right now on all the files to find offending lines [16:41:50] so I can examine them [16:42:11] k [16:44:47] louisdang, i have two suggestions for your UDF, let me know what you think [16:44:55] ok [16:45:27] 1) maybe the return type can be a boolean (true/false) as that simplifies the filtering in the pig script [16:45:43] er [16:45:47] before you guys get too far into this [16:45:53] may i suggest you both check out avro? [16:46:04] i'd prefer we not roll our own serialization/schema language [16:46:11] we are not doing that dschoon [16:46:15] and avro is what most people use. [16:46:31] "schema validation" is a subset of the things that avro would do. [16:46:53] unless i'm misunderstanding [16:46:58] i am afraid you are [16:47:21] i'll reread. [16:47:45] the schema validation is part of pig and just makes sure that it receives the right input, there is no serialization [16:47:53] hm. [16:48:37] louisdang, 2) i am worried that your current method of identifying an ip4 / ip6 address might be too naive, why not use a regex for both (compile it once) and use that ? [16:49:43] ok drdee, for 1) i was thinking in case there will be a new standard and we would have to differentiate between ip4, ip6 and that new standard [16:50:05] but I can easily change it to boolean [16:50:13] that sounds like very far into the future :D [16:50:53] so maybe have two separate functions, isValidIp4Address() and isValidIp6Address() [16:51:02] alright [16:51:29] and make the UDF be any one of them? [16:51:56] and for 2) if you think the speed tradeoff is worth it I can do that [16:52:28] going to miss hangout today, fyi [16:52:30] train issues [16:53:18] drdee: added more tests to git2deblogs, build is green http://garage-coding.com:8010/builders/git2deblogs-builder [16:53:30] drdee: can you please pull from the github and try again ? [16:53:45] yup [16:56:15] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:01:40] erosen^^ [17:25:56] ok drdee [17:25:59] i figured out the problem [17:26:00] yoyo [17:26:03] pig stuff [17:26:10] multiple spaces [17:26:15] in between fields on some lines [17:26:17] in lines? [17:26:21] yeha [17:26:22] ohhhhhhhh arrrghhhhhh [17:26:31] the only ones i've seen are from dec 2011, but I haven't looked very hard yet [17:26:42] just glanced at a few of the matches I've found [17:26:42] okay so i think we need to bump up the tab delimiter project [17:26:56] we've got stefan to make the fixes in wikistats [17:27:06] well i mean, yes [17:27:07] but [17:27:11] i don' thitnk it is happening now at all [17:27:15] it jus thappened once upon a time [17:27:59] why would it not happen everytime? [17:28:26] so far, with my pig script, it happens everytime [17:29:04] no, i mean [17:29:16] the data that is currently being generated doesn't have extra spaces in it [17:29:18] just old data [17:29:23] ohhh, ok [17:29:29] so bumping up the tab delimeter in sources [17:29:33] is not going to to fix your problem here [17:29:53] i assumed that the problem still exists [17:30:01] okay, so that means the old files need to be fixed [17:30:10] so far i've only seen dec 2011 records with the extra spaces [17:30:14] but i've only looked at a few matches [17:33:13] it also happens in 2012, with the mime type field [17:33:28] there is a trailing space after text/html; charset=UTF-8 [17:33:39] this field now causes two issues already :) [17:38:36] ottomata, i'll write a shell script to go through each line of each file and replace two spaces with a single space [17:38:48] and then copy the files again to hdfs? [17:38:50] how about any number of spaces with a single space [17:38:57] sure [17:39:02] and that will take forever on its own, how about doing it in hadoop streaming? [17:39:03] :) [17:39:45] i don't think it will take forever, [17:39:47] i think you can do it in pig even! [17:39:54] no? ha, with all of the sampled logs? [17:40:13] http://pig.apache.org/docs/r0.9.1/func.html#replace [17:40:15] but if you want to turn this into an hadoop thing, sure why not [17:41:18] drdee [17:41:21] but we probably loose the coupling of one day one file [17:41:29] here are the months and counts of lines that match your decimal http_status problem: [17:41:30] 2011-10 330718 [17:41:30] 2011-11 1034329 [17:41:30] 2011-12 2744652 [17:41:30] 2012-01 807112 [17:41:30] not sure if that's an issue [17:41:57] it isn't an issue for hadoop, the stuff we are running doesn't care, as long as the files are as big as the block size [17:42:11] i gotta get some food soon [17:42:26] i know but if just want to run a quick job for july [17:42:40] then right now you can delimit that using the filenames [17:42:48] you will lose that [17:43:24] that's true [17:43:37] so then a small job becomes a big job [17:43:49] and so far, most of the queries are for particular timefarmes [17:43:55] that's my worry [17:44:54] aye [17:45:25] ack, 15 mins til ops meeting, need food, brb [18:29:58] drdee: https://github.com/louisdang/kraken/tree/master/src/org/wikimedia/analytics/kraken/pig [18:30:31] I can probably encapsulate both functions into one general regex match UDF later [18:30:46] hm! my copy of eclipse auto-crashes instantly! [18:30:50] exciting! [18:31:00] time to redownload! oh eclipse, how i missed you [18:31:57] eclipse is the only ide I know, is it bloated/buggy compared to other ide's? [18:32:33] louisdang: looks good, can you create a jar so we can test it in labs? [18:33:21] drdee: is the export wizard in Eclipse for the jar ok? [18:33:32] let's try it! [18:34:08] alright one sec [18:37:22] drdee, let's try this: https://github.com/downloads/louisdang/kraken/kraken.jar [18:37:58] hoi erosen [18:38:02] hoi??? [18:38:06] oh hoi [18:38:06] :p [18:38:16] are you in the office? [18:38:22] I was this morning [18:38:23] but no longer [18:38:27] ah k [18:38:33] whats up/ [18:38:39] drdee, this might help you write a test script: https://gist.github.com/3894251 [18:38:59] I wanted to have a chat about access to DBs after OH'ing a fragment of conversation between you and Jessie last Friday [18:39:00] ty! [18:39:13] (i'm working from home, in case anyone cares. as i usually do on mondays.) [18:39:42] chat later over Skype maybe? [18:39:46] sounds good [18:39:52] any time works for me [18:40:41] great, finishing something and I'll ping you [18:41:45] cool [18:44:13] louisdang: i get an error [18:44:14] https://gist.github.com/a705fce0e608e38b6401 [18:46:46] louisdang, are you compiling with java 7? [18:46:59] drdee, let me check [18:47:06] i think you are [18:47:16] drdee, openjdk 7 on my local machine [18:47:23] came with Ubuntu [18:47:26] try compiling with java 6 [18:47:40] ok I think I have to install it first [18:47:43] pig / hadoop is java 6 [18:47:53] so you can't mix and match that [18:47:59] ok [18:48:10] (it doesn't matter for compilation, but for testing, you ought to use sun-jdk-6) [18:48:45] alright, I'll look into that some time [18:48:51] great! [18:49:20] would openjdk6 work? it's kind of a pain to find the sun version on Ubuntu 12.10 [18:49:39] we can try [18:50:32] alright installing [18:51:01] drdee, while we're waiting, is there a time we can schedule that Skype call tomorrow? [18:51:12] totally, send me an iCal invite [18:51:27] cool ok. You're on the east coast right? [18:51:32] yup [18:52:10] do you have any preference? (morning, afternoon, evening) [18:52:42] firefox has a very annoying bug, when i have a page with the hadoop cluster stuff running (which is protected page) and then I open a tab for a new page then it will ask me the credentials of this new page even though it's a publicly accessible page [18:52:47] louisdang: afternoon [18:55:38] drdee: doesn't matter. the creds are for the proxy, not the page [18:56:40] apparently foxyproxy will let you selectively apply the tunnel, but i never got it to work with our stuff [19:00:50] ohhhhh [19:00:54] but that's still retarded [19:01:01] but thanks for the tip [19:01:13] didn't realize it was foxproxy [19:04:50] ...no. [19:04:57] drdee, it is not foxyproxy. [19:05:14] you have all your traffic going through an HTTP proxy for the cluster [19:05:23] that proxy has a password [19:05:35] you have to enter it for *every request* that doesn't already have creds [19:05:45] right, i only enabled the proxy for cluster traffic [19:05:55] yeah, well, that didnt' work for me at all [19:06:05] and for me neither ;) [19:06:20] ok drdee, you can try https://github.com/downloads/louisdang/kraken/kraken.jar [19:06:24] yup [19:10:55] louisdang, are you sure you compiled it with Java 6? because i get the exact same error [19:11:26] make sure that both your JDK and JRE are Java6 [19:12:04] drdee, alright let's try sun [19:12:11] ok [19:12:44] although i doubt this is sun / openjdk issue [19:12:50] it seems to be a 6 vs 7 issue [19:13:04] ok let me check if I uploaded the right jar [19:15:54] drdee, it looks like have to make a new project in Eclipse since a simple update-alternatives didn't change the compiler it uses [19:16:19] can't you just update the build config? [19:16:29] and specify a diffferent JVM? [19:17:04] right click properties on the root of your project, and then Build Path or Build Config [19:25:19] drdee, https://github.com/downloads/louisdang/kraken/kraken.jar [19:25:37] I updated the build path [19:31:02] louisdang, i can't download the jar file [19:34:05] brb foods [19:36:47] drdee, I tried reuploading a few times but keep on getting file not found [19:36:53] so I uploaded this: https://github.com/downloads/louisdang/kraken/kraken2.jar [19:37:24] note that the file name is now kraken2 so just rename it [19:40:52] tried it, still same error as the original error [19:41:02] Unsupported major.minor version 51.0 [19:41:12] version 51 maps to Java 7 [19:58:50] drdee, https://github.com/downloads/louisdang/kraken/kraken.jar I've tried changing the compliance level to 1.6 not sure what else to do. [20:00:51] I also rebooted eclipse before exporting the jar [20:01:12] louisdang: it is now at least running! [20:01:26] great! [20:14:03] drdee, I accidentally scheduled the call to 2 pm PDT but meant it to be 2 pm EDT. Is the time ok for you? [20:14:25] yes 2pm edit is fine, [20:47:32] louisdang, good news, your UDF seems to work, i ran some tests and so far the categorization works [20:47:40] great job! [20:47:59] nice, I was worried about the IPv6 since I didn't have sample data for that [21:34:21] christ, nothing is intuitive in eclipse [21:45:20] brb getting a coffee [21:59:54] mk [23:06:53] drdee: trying to upload the missing dumps to the datahub despite the 500 errors, ping me in 10-15 from now if you're ready for the data policy draft [23:13:17] well it's 7:15 pm, i am about to call it a day :) [23:14:04] later guys! [23:14:16] k tty tomorrow [23:14:20] ciao [23:58:35] brb