[00:08:28] grr [00:08:38] trying to get internet set up here [00:08:40] such a pain [00:10:01] anyone need me for anything? sorry i was all disappeared for a while. milimetric? erosen? [00:10:14] no worries [00:10:14] now that you mention it [00:10:20] i think i did have a question [00:10:28] oo [00:10:33] I fixed my limn branch (it was my changes to the reportcard that I forgot about) [00:10:34] how do you get descriptions to show up on limn [00:11:04] dschoon I'm in a mess with my ssh-ing, I have to move the conf file out of .ssh every time I want to push to gerrit (trying to puzzle it out now) [00:11:17] descriptions of the graph? [00:11:40] like in the json desc field or whatever it is?? [00:12:06] hm. [00:12:16] sorry? what are you asking? [00:12:28] oh. yeah. [00:12:35] it's the description field. [00:12:35] it's in the edit ui [00:12:41] i think it's called "desc" in the JSON [00:14:42] milimetric: i'm down to help with the ssh woes [00:15:09] k, so the symptoms [00:15:28] ...../projects/reportcard-data# git pull [00:15:41] that gives public key error access denied [00:16:04] if I mv ~/.ssh/config ~/.blah then it works [00:16:33] and the opposite is true with ssh kripke and ssh reportcard [00:16:40] (works with the config, not without) [00:28:23] k dschoon, I pushed a new branch on reportcard-data called feature/d3. I'll be working on that so it's parallel to limn and so I don't blow up prod [00:28:32] sweet [00:28:38] i'll update my copy [00:28:46] if you do the same, limn works again and it puzzled out the problem I was having [00:28:55] one sec on ssh [00:29:01] i'm trying to sign up for internet still [00:29:01] yeah, my apologies on that mistake, good think EZ had me do those tests :) [00:29:22] nah, forget it, I just ~/.configon and ~/.configoff [00:29:30] not worth the hours we'd spend [00:29:57] heh [00:29:57] once I type those commands 5000 times then the investment you're about to put in will pay off. Seriously, leave it [00:30:02] yeah, but it's probably just username or something in the file. [00:30:13] if it bothers you personally :) [00:30:17] stick the file on etherpad and i'll at least read it [00:30:55] http://etherpad.wikimedia.org/IlszcZZEEq [00:31:31] i know why. [00:31:42] i think. [00:32:07] i'm going to make some changes [00:32:14] gonna go eat dinner, I'll try them after [00:32:21] like removing the IdentityFile directives, because you only have one key [00:32:33] oh, reminder: I'll be gone tomorrow for jury duty. Forgot to mention today [00:33:17] nope, still breaks with those gone [00:33:33] k, goin to eat. ttyl :) [00:33:35] i also changed [00:33:39] the *.wikimedia.org [00:33:45] entry [00:33:45] which will effect gerrit [00:34:13] lmk how that goes [00:59:26] nice, that was what it was [00:59:30] woo [00:59:32] thanks dschoon! you're an ssh warrior [00:59:42] i figure [00:59:54] i've invested a ton of time in figuring this out for myself already [01:00:00] the marginal cost in helping others is smal [01:00:01] small [01:00:22] :) steph and I are playing in secret chamber if you wanted to get a game [01:00:36] *will be playing shortly [01:00:59] word [01:01:04] omw. [13:01:07] morning average_drifter [13:32:55] morning ottomata [13:34:18] can you have a look at http://www.mediawiki.org/wiki/Analytics/Kraken/Metadata_Schemas [13:42:23] can dooooo [13:43:00] and louisdang got hue to work with yarn..... [13:43:01] in labs [13:43:06] (apparently0 [13:43:20] hi drdee [13:43:23] got a new review [13:43:27] morning [13:43:42] for you https://gerrit.wikimedia.org/r/#/c/31145/ [13:45:21] it seems that the spacing is all weird :( [13:46:01] ;( I tried to set up my editor [13:46:04] I got tabs in [13:46:11] but I think I did somethin wrong [13:46:46] my tabs are showing up on my screen(in my vim) as 2 spaces. I mean they're tabs but they take up 2 spaces [13:46:56] they are tab characters [13:47:09] uhm, I think gerrit shows them as 4 spaces [13:47:47] :) [13:51:55] ok cool, good to know [13:52:19] that is a setting for sure, average_drifter [13:52:23] tab width can be changed [13:52:42] http://www.linuxquestions.org/questions/suse-novell-60/how-to-make-a-tab-is-4-spaces-width-in-vim-355658/ [13:54:17] set tabstop=4 [13:54:17] set noexpandtab [13:54:17] set shiftwidth=4 [13:54:17] set softtabstop=4 [13:54:17] this is in my .vimrc [13:54:31] it was set to 2 when I did the git-review for the link I posted above [13:54:34] but now I set it to 4 [13:56:32] right, so that won't actually change anything unless you delete or add tabs [13:56:54] it will just help make things look normal for you and everyone else (this is why I prefer spaces in general :p , too bad we have to use tabs) [13:57:28] drdee, how's this work? [13:57:29] "name": "ip", "type": ["int", "string"], [14:00:08] we could store IPv6s as 4 ints [14:04:40] drdee: [14:04:40] http://www.mediawiki.org/wiki/Talk:Analytics/Kraken/Metadata_Schemas [14:05:39] gerrit expands one tab to 6 spaces btw [14:05:48] in the diff view I mean [14:09:49] man, google docs works so much better than etherpad or mediawiki for draft docs [14:11:34] ottomata: do you mean it's faster ? [14:12:18] ottomata, how can you store an ip6 as 4 ints? [14:13:54] no, i mean inline comments are nicer [14:14:02] its just 128 bits, right? [14:14:06] so an ipv6 is like aaaa:bbbb:cccc:dddd:eeee:ffff:gggg:hhhh where every letter there is a hex digit. so we actually have 2*8 bytes there, that's 16 bytes. now an integer is 4 bytes [14:14:08] http://en.wikipedia.org/wiki/Integer_(computer_science)#Value_and_representation [14:14:13] so yeah, 4 ints [14:15:42] 128 bits == 16 bytes [14:15:51] aaaa:bbbb:cccc:dddd:eeee:ffff:gggg:hhhh [14:15:55] is 40 chars long [14:15:57] that is 40 bytes [14:17:17] yes, but each group like "aaaa" is actually just 2 bytes, because those letters are actually hex digits [14:18:23] right [14:19:00] but how do you handle ip6 addresses like aaaa:::dddd:eeee:ffff:gggg:hhhh? [14:22:35] that means that the bbbb and cccc are both zero ? [14:23:47] http://en.wikipedia.org/wiki/IPv6_address#Presentation <-- says here under the subtitle "groups of zeroes" [14:28:25] average_drifter: yes [14:34:55] yeah but no matter what, if you store it as a string [14:35:46] you need to be able to store 40 bytes [14:35:53] which is more than 16 bytes [14:36:17] i think your idea makes sense [14:36:43] my only question is do we need to supply converters? and what type of representation do ip6 libraries in general expect [14:37:48] yea we probably do, but i mean, C has a native funciton for this [14:37:54] probably other languages (certainly java) do too [14:38:05] shall i send out the email to analytics list asking for feedback on the schema's? [14:41:55] sure, check the talk page [14:41:57] I added things there [14:42:26] i saw it, excellent! [14:42:33] okay sending now [14:43:49] cool [14:43:50] ottomata, you wanna headbutt with hue again? [14:44:02] (how did the piano go btw?) [14:44:27] yea for sure [14:44:27] decided not to get it! [14:44:37] :) [14:44:45] so louisdang said he got it to work [14:44:58] which is good news [14:45:03] yeah! [14:45:04] saw that [14:45:06] that is good news [14:45:13] means we can too :) [14:45:13] he was talking about webhdfs vs httpfs [14:45:18] iiiinteresting [14:45:27] should we enable webhdfs? [14:45:28] yes i remember reading about that, only glossed over it once I got most of it to work [14:45:29] maybe so [14:45:31] i will look into it [14:45:34] k [14:45:41] i can reproduce this problem locally, so that makes it much easier to play [14:46:17] when dschoon is on later, let's talk more about my comments and content in event stream [14:46:25] especially re cookies [14:46:30] i want to get asher a varnish log format specification soon [14:46:55] yeah totally that's why i sent out the email right now [14:47:11] but can we make that spec match the current web traffic logs? [14:47:29] or alternatively update the current web traffic logs to match the new varnish log specification? [14:49:25] to enable webhdfs: [14:49:31] add [14:49:32] [14:49:33] dfs.webhdfs.enabled [14:49:34] true [14:49:35] [14:49:36] to hdfs-site.xml [14:50:19] dschoon and I talked about that a bunch yesterday [14:50:43] i think we should enable it internally [14:50:43] we decided that since we are going to have to do a bunch of different ETL stuff for event stream than the request stream, that we can change the format if we want [14:50:51] (re event log format) [14:51:00] right sorry [14:51:04] ok [14:51:08] if the ETL for requests and events was going to be the same [14:51:18] then we'd probably want to be able to share the same format so the ETL process would be the same [14:51:35] but, one of dschoon's examples was that [14:51:47] the event stream will have referer set to the page that generated the event [14:51:59] not the actual referer to the page that the event was triggered on [14:52:18] so the event generating JS lib (or whatever) will probably add a real referer metadata string in the query params [14:52:25] so we'll have to parse that out [14:52:34] yup [14:52:43] i think there are more examples too [14:52:59] also, ori much prefers that the line starts with the url/product_id [14:53:10] we should talk with ori about that, and agree on a referer key that is always present in the payload data [14:53:14] that way his pub/sub services can much easier subscribe based on string prefix [14:53:26] well, that is kind of irrelevant, i think, i mean [14:53:38] if/when we code a JS event lib for people to use [14:53:40] that will be an issue [14:53:56] the varnish log format won't matter [14:54:07] it will always put the http header referrer in the field [14:54:11] but yeah [14:54:55] re webhdfs [14:54:57] it shoudl be on [14:56:18] k [14:58:17] hmm, so louisdang's cluster is not using local mode? [15:00:02] no [15:00:23] interesting [15:00:33] he's on labs right? maybe i'll copy over his configs and compare them all [15:00:37] yeah i am waiting for him to come online [15:00:44] yes his instance is on labs [15:00:53] i think it's the one in the hadoop group [15:13:51] hmm, can you make me part of hadoop group? [15:13:51] hadoop project? [15:13:53] i don't think I can... [15:18:46] 1 sec [15:20:15] drdee: hey! [15:20:29] hey ori-l [15:20:46] hey a_d [15:20:53] hey a_d [15:20:54] hey o-l [15:21:07] hey doctor diederik [15:21:26] hey gusy! [15:21:30] hey louisdang! [15:21:36] hey drdee [15:21:49] what instance on labs did you use for your hadoop installlation> [15:22:22] drdee / ottomata: so, i'm giving a quick eventlogging preso after stevenw at the metrics meeting. kinda scrambling to finish it. can/should i point people to yr design docs? [15:22:27] I actually been using my own machine... [15:22:39] ori-l, yeah totally [15:22:46] louisdang, ok [15:22:50] can you send us your conf files? [15:23:07] ok. it's for pseudo distributed mode [15:23:16] us (is ottomata and myself) [15:23:51] drdee: ok can you give me a one or two-sentence pitch? like, when can people start using it, what is it going to be awesome for? [15:24:31] * ori-l is not trolling, despite appearances to the contrary [15:25:51] ori-l, are you referring to the avro serialization stuff? or the whole event/ hadoop logging chain in general? [15:27:07] drdee, louisdang [15:27:12] there are these: [15:27:12] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Hadoop [15:27:22] tried to log into one, didn't ahve access [15:27:27] whichever you prefer. i'm basically going to start by looking at some really pathological example from the CT logs where we triple-encoded our data and then talk about how the EventLogging extension is supposed to help with that [15:27:30] ottomata: https://gerrit.wikimedia.org/r/#/c/31145/ [15:28:05] but i dont want to give the impression that its the only game in town or something we invented [15:28:40] ori-l i think it would be okay to just alert people about the existence of the wiki page with the two proposed schema's and that we are actively seeking feedback [15:28:57] cooooooool i will [15:28:58] hmm, well, maybe one of the nicest piece is the ability to quickly query data you log [15:29:08] ottomata: please review [15:29:15] once this is all set up, data should be available in hadoop within a few minutes [15:29:24] ottomata, drdee I've been using my own machine instead of the hadoop cluster since it was just faster and more efficient [15:29:38] and will be queryable via pig and hive very easily via hue web interface or somethign similar [15:29:57] that's cool louisdang, I do that too [15:30:04] but the whole pig/hive thing is too early to mention :) [15:30:07] oh ok [15:30:11] louisdang: can you tar up your configs and send them too me? i wonder what I'm doing wrong [15:30:18] particularly as we are still fighting with hue [15:30:26] what about the ability to intersect the data with server logs? [15:30:48] the data == mediawiki data? [15:31:02] ottomata, ok doing that now [15:31:11] tar -cvf ~/louisdang.hadoop.confs.tar /etc/{hadoop*,hive,hue,oozie,hbase,zookeeper,pig,sqoop}/ [15:31:19] yeah, im guessing you guys are going to make that queryable too alongside more human-designed events? [15:32:06] yes, we definitely want the ability to intersect web traffic data with mediawiki data [15:32:20] i have been working on importing data from mediawiki into kraken [15:32:46] drdee, just curious since you know more about how hive works than I do [15:32:47] if sqoop imports mysql data [15:33:00] and we have this avro serialized web log data [15:33:23] can we map a hive schema onto the serialized web data, without having to copy it into hive data dirs and/or format? [15:33:32] we import the mysql data as avro files as well [15:33:35] hive is mostly just defining a schema around existing data files, right? [15:33:36] oh ok [15:33:38] ottomata: i thought that was the idea! that would be awesome [15:33:53] yeah, i think so too, i just haven't done much with hive yet, so i'm curious [15:34:03] i know you can import data with hive into its own warehouse stuff [15:34:13] so the answer is yes to your question [15:34:13] cool [15:34:13] there ought not be a hard distinction between auto-generated server logs and custom events [15:34:29] ok cool. /runs to put something together [15:34:32] right now i am talking with the analysts about the proper schema for mediawiki data [15:34:43] nice cool [15:34:48] so with hive, it would be like: [15:35:01] create table bla bla bla bla using /path/to/log/data [15:35:11] exactly [15:35:11] then you can select and join agains whatever schemas are in hive [15:35:14] yup [15:35:16] no matter where the data is stored [15:35:17] coooooool [15:35:18] yup [15:35:19] super cool [15:35:55] ottomata: sorry to jump topics; can i also ask that we provide notice and schedule any changes to the varnish / udp logging configs? i want to make sure i know to expect data loss then [15:36:09] sounds good to me [15:36:12] doesn't have to be too far in advance, even a day's notice would be great [15:36:16] you mean, once it is actually running? [15:36:23] yeah, totally [15:36:51] anything that would impact the flow of data.. like restarting varnishncsa to grep for the new pattern, or whatever [15:37:35] aye cool [15:37:35] yeah [15:37:45] we can even write that down in the official format spec doc [15:37:50] notice policy for changes [15:37:51] etc. [15:37:56] ottomata can you have a look at https://gerrit.wikimedia.org/r/#/c/31145/ [15:39:00] yeah, sorry, average_drifter and drdee, looking at that now [15:39:04] k [15:39:50] ottomata: hey, just fixed spacing problems [15:39:50] hey average_drifter, I did have one question that maybe isn't that relevant about append_field [15:39:57] ottomata: I just hit a new git review [15:40:01] ottomata: yes please , ask [15:40:05] wanted to ask this before, but i wante dto focus on that other stuff [15:40:14] why do you increase the field count before you add the field? [15:40:16] if you did it after [15:40:32] you wouldn't have to do fields[*i-1] and fields[*i-2], etc. [15:41:17] drdee, ottomata : https://github.com/downloads/louisdang/kraken/louisdang.hadoop.confs.tar [15:41:25] brb [15:41:27] thx! [15:41:37] ottomata: I can increase the field count after I add the field, is that ok ? [15:42:53] yeah that's cool, i mean, effectively it is the same [15:43:24] it just reads slightly more sane in the code if you don' t have to do index arithmetic when you don't need to [15:43:56] ottomata: I will make it more readable [15:44:28] cool, danke! [15:46:43] oo, average_drifter, i think you have a misplaced semicolon on line 1049 [15:47:34] oo, louisdang, since it isn't apparent in the .tar [15:47:45] what is your /etc/hadoop/conf symlink pointing to? [15:47:47] ls -l /etc/hadoop.conf [15:47:53] ls -l /etc/hadoop/conf [15:48:00] is it conf.empty or conf.psuedo? [15:49:49] i think conf.pseudo [15:52:48] actually, the symlink is there but it points at alternative [15:52:51] louisdang: [15:52:58] ls -l /etc/alternatives/hadoop-conf [16:01:04] is there a canonical way to encode ipv6 and ipv4 address so I can perform checks on them ? [16:01:13] I'm mainly interested in checking whether some ipv6/ipv4 is local or not [16:01:32] ottomata: do you think representing them as CIDR would solve the problem easily ? [16:01:46] I mean maybe I can use libcidr to check if they are local or not [16:02:00] currently I need to run string matching on them to see if they are local [16:03:05] by local I mean both loopback, or local area networks [16:03:10] ottomata, I'm using conf.pseudo [16:03:28] /etc/alternatives/hadoop-conf -> /etc/hadoop/conf.pseudo [16:04:58] ok cool, danke [16:09:41] drdee: I think if the ip is local then the geoip library resolves them to XX [16:10:01] drdee: which means I don't need to match the ips from x-forwarded-for to local ips myself [16:10:01] perfect [16:10:03] drdee: because the geoip already takes care [16:10:06] yea :) [16:11:43] interesting, louisdang, I don't have this: [16:11:43] hadoop.proxyuser.httpfs.hosts [16:12:29] oh, louisdang [16:12:32] you are not using webhdfs [16:12:37] you are just using httpfs [16:12:41] you have [16:12:41] < dfs.webhdfs.enabled [16:12:41] < false [16:13:06] iiinteresting [16:14:17] ottomata: can you have another look please, I pushed another patchset [16:16:26] ok, average_drifter [16:16:26] 2 qs [16:16:33] in append_field [16:16:40] yes [16:16:45] 1. is this safe? [16:16:45] new_field_data[strlen(new_field_data)] = 10; [16:17:12] does 10 == \n? [16:17:15] yes [16:17:23] 0x0a if you prefer [16:17:37] ok, do we know for sure that new_field_data is long enough? [16:18:20] ottomata, yes I turned it off to try httpfs [16:18:39] but you did have hue+hive working with webhdfs? [16:18:48] ottomata: so basically append_field just takes a pointer. now.. whoever provides that pointer is responsible for making sure there's enough memory [16:18:49] ottomata, yes that's what I had first [16:19:02] ottomata: can we go by that convention and have it written as a comment ? [16:19:02] louisdang: ok cool, and did it work with httpfs [16:19:17] ummmm, your snprintf might be safer there [16:19:34] i don't thikn we should expect the user to know that his field needs to be long enough to add \n [16:19:36] also [16:19:36] 2. [16:19:39] ottomata, beeswax still worked with httpfs but I get an error with the filebrowser [16:19:49] hm, ok [16:20:01] ottomata, no error before with webhdfs [16:20:14] louisdang; problem with filebrower is a known issue, it does not support yarn yet [16:20:17] average_drifter: you should be consistent when using \n [16:20:28] you have 0x0a, you have 10 [16:20:28] drdee, ok [16:20:29] why not just use '\n'? [16:20:37] ottomata: ok '\n' then [16:20:42] filebrowswer works with yarn [16:20:43] jobbrowswer doesn't [16:20:48] cool [16:20:50] oh sorry mixed them up [16:21:10] anyways it is not related to beeswax AFAICT [16:22:26] louisdang: is there anything else that you did to get it to work or did you just use a vanilla installation? [16:22:57] ottomata: there's a problem with the snprintf because if I do snprintf(new_field_data,"%s\n",new_field_data); <=== the input and output are the same [16:23:10] ottomata: apparently the output is undefined if I use snprintf to add a \n with the same input and output [16:23:11] drdee, I just followed the instructions on cloudera [16:23:17] ottomata: http://stackoverflow.com/a/1973595/827519 [16:23:36] louisdang: which url? [16:23:53] drdee, https://ccp.cloudera.com/display/CDH4DOC/Hue+Installation [16:24:14] drdee, also I had to chown hue /user temporarily to make the sample files [16:24:56] k [16:25:29] ottomata: new_field_data[strlen(new_field_data)] = '\n'; [16:25:52] ottomata: if I do that it works fine, although I do agree that there are concerns about memory like "do we have enough memory to write another '\n' character ?" [16:26:09] ottomata: on the other hand since this is C and we're using strings allocated in some other place with malloc [16:26:21] ottomata: we don't have access to the malloc size.. [16:26:45] ottomata: so there is no way of telling if we have enough memory to add another \n .. however ! [16:27:18] ottomata: the area string was produced in geo_lookup [16:27:26] ottomata: and it was produced inside static char area[MAX_BUF_LENGTH]; [16:27:32] ottomata, did you restart the hadoop cluster after enabling webhdfs? [16:28:03] ottomata: and MAX_BUF_LENGTH is 128 (udp-filter.h) [16:28:34] ottomata: so if every country code is 2 bytes (US , JP, DE , etc) and we add just one byte for '\n' , we still have 125 bytes left because MAX_BUF_LENGTH is 128 [16:28:45] ottomata: would you agree ? :) [16:29:43] so in my particular case it does work, but yes, I cannot guarantee that everyone using an append_field will not produce problems. but I can write in a comment "Use with care. Make sure you have an extra byte of memory in there for append_field to put a \n" [16:32:12] it looks like there's no standard way to answer the question "Given a pointer, how much memory was allocated for that pointer through malloc ?" http://stackoverflow.com/a/1281721/827519 [16:33:08] mornin [16:33:15] yo dschoon [16:33:20] howdy drdee [16:33:22] ottomata, "/tmp (on the local file system) must be world-writable, as Hive makes extensive use of it." [16:33:24] ok cool, average_drifter, if we add a comment, then I am cool with that fo sho [16:33:43] ottomata: great ! thanks [16:33:49] ohhh, that is just because you are snprintfing into the same place [16:33:52] hm [16:34:01]