[00:08:28] grr [00:08:38] trying to get internet set up here [00:08:40] such a pain [00:10:01] anyone need me for anything? sorry i was all disappeared for a while. milimetric? erosen? [00:10:14] no worries [00:10:14] now that you mention it [00:10:20] i think i did have a question [00:10:28] oo [00:10:33] I fixed my limn branch (it was my changes to the reportcard that I forgot about) [00:10:34] how do you get descriptions to show up on limn [00:11:04] dschoon I'm in a mess with my ssh-ing, I have to move the conf file out of .ssh every time I want to push to gerrit (trying to puzzle it out now) [00:11:17] descriptions of the graph? [00:11:40] like in the json desc field or whatever it is?? [00:12:06] hm. [00:12:16] sorry? what are you asking? [00:12:28] oh. yeah. [00:12:35] it's the description field. [00:12:35] it's in the edit ui [00:12:41] i think it's called "desc" in the JSON [00:14:42] milimetric: i'm down to help with the ssh woes [00:15:09] k, so the symptoms [00:15:28] ...../projects/reportcard-data# git pull [00:15:41] that gives public key error access denied [00:16:04] if I mv ~/.ssh/config ~/.blah then it works [00:16:33] and the opposite is true with ssh kripke and ssh reportcard [00:16:40] (works with the config, not without) [00:28:23] k dschoon, I pushed a new branch on reportcard-data called feature/d3. I'll be working on that so it's parallel to limn and so I don't blow up prod [00:28:32] sweet [00:28:38] i'll update my copy [00:28:46] if you do the same, limn works again and it puzzled out the problem I was having [00:28:55] one sec on ssh [00:29:01] i'm trying to sign up for internet still [00:29:01] yeah, my apologies on that mistake, good think EZ had me do those tests :) [00:29:22] nah, forget it, I just ~/.configon and ~/.configoff [00:29:30] not worth the hours we'd spend [00:29:57] heh [00:29:57] once I type those commands 5000 times then the investment you're about to put in will pay off. Seriously, leave it [00:30:02] yeah, but it's probably just username or something in the file. [00:30:13] if it bothers you personally :) [00:30:17] stick the file on etherpad and i'll at least read it [00:30:55] http://etherpad.wikimedia.org/IlszcZZEEq [00:31:31] i know why. [00:31:42] i think. [00:32:07] i'm going to make some changes [00:32:14] gonna go eat dinner, I'll try them after [00:32:21] like removing the IdentityFile directives, because you only have one key [00:32:33] oh, reminder: I'll be gone tomorrow for jury duty. Forgot to mention today [00:33:17] nope, still breaks with those gone [00:33:33] k, goin to eat. ttyl :) [00:33:35] i also changed [00:33:39] the *.wikimedia.org [00:33:45] entry [00:33:45] which will effect gerrit [00:34:13] lmk how that goes [00:59:26] nice, that was what it was [00:59:30] woo [00:59:32] thanks dschoon! you're an ssh warrior [00:59:42] i figure [00:59:54] i've invested a ton of time in figuring this out for myself already [01:00:00] the marginal cost in helping others is smal [01:00:01] small [01:00:22] :) steph and I are playing in secret chamber if you wanted to get a game [01:00:36] *will be playing shortly [01:00:59] word [01:01:04] omw. [13:01:07] morning average_drifter [13:32:55] morning ottomata [13:34:18] can you have a look at http://www.mediawiki.org/wiki/Analytics/Kraken/Metadata_Schemas [13:42:23] can dooooo [13:43:00] and louisdang got hue to work with yarn..... [13:43:01] in labs [13:43:06] (apparently0 [13:43:20] hi drdee [13:43:23] got a new review [13:43:27] morning [13:43:42] for you https://gerrit.wikimedia.org/r/#/c/31145/ [13:45:21] it seems that the spacing is all weird :( [13:46:01] ;( I tried to set up my editor [13:46:04] I got tabs in [13:46:11] but I think I did somethin wrong [13:46:46] my tabs are showing up on my screen(in my vim) as 2 spaces. I mean they're tabs but they take up 2 spaces [13:46:56] they are tab characters [13:47:09] uhm, I think gerrit shows them as 4 spaces [13:47:47] :) [13:51:55] ok cool, good to know [13:52:19] that is a setting for sure, average_drifter [13:52:23] tab width can be changed [13:52:42] http://www.linuxquestions.org/questions/suse-novell-60/how-to-make-a-tab-is-4-spaces-width-in-vim-355658/ [13:54:17] set tabstop=4 [13:54:17] set noexpandtab [13:54:17] set shiftwidth=4 [13:54:17] set softtabstop=4 [13:54:17] this is in my .vimrc [13:54:31] it was set to 2 when I did the git-review for the link I posted above [13:54:34] but now I set it to 4 [13:56:32] right, so that won't actually change anything unless you delete or add tabs [13:56:54] it will just help make things look normal for you and everyone else (this is why I prefer spaces in general :p , too bad we have to use tabs) [13:57:28] drdee, how's this work? [13:57:29] "name": "ip", "type": ["int", "string"], [14:00:08] we could store IPv6s as 4 ints [14:04:40] drdee: [14:04:40] http://www.mediawiki.org/wiki/Talk:Analytics/Kraken/Metadata_Schemas [14:05:39] gerrit expands one tab to 6 spaces btw [14:05:48] in the diff view I mean [14:09:49] man, google docs works so much better than etherpad or mediawiki for draft docs [14:11:34] ottomata: do you mean it's faster ? [14:12:18] ottomata, how can you store an ip6 as 4 ints? [14:13:54] no, i mean inline comments are nicer [14:14:02] its just 128 bits, right? [14:14:06] so an ipv6 is like aaaa:bbbb:cccc:dddd:eeee:ffff:gggg:hhhh where every letter there is a hex digit. so we actually have 2*8 bytes there, that's 16 bytes. now an integer is 4 bytes [14:14:08] http://en.wikipedia.org/wiki/Integer_(computer_science)#Value_and_representation [14:14:13] so yeah, 4 ints [14:15:42] 128 bits == 16 bytes [14:15:51] aaaa:bbbb:cccc:dddd:eeee:ffff:gggg:hhhh [14:15:55] is 40 chars long [14:15:57] that is 40 bytes [14:17:17] yes, but each group like "aaaa" is actually just 2 bytes, because those letters are actually hex digits [14:18:23] right [14:19:00] but how do you handle ip6 addresses like aaaa:::dddd:eeee:ffff:gggg:hhhh? [14:22:35] that means that the bbbb and cccc are both zero ? [14:23:47] http://en.wikipedia.org/wiki/IPv6_address#Presentation <-- says here under the subtitle "groups of zeroes" [14:28:25] average_drifter: yes [14:34:55] yeah but no matter what, if you store it as a string [14:35:46] you need to be able to store 40 bytes [14:35:53] which is more than 16 bytes [14:36:17] i think your idea makes sense [14:36:43] my only question is do we need to supply converters? and what type of representation do ip6 libraries in general expect [14:37:48] yea we probably do, but i mean, C has a native funciton for this [14:37:54] probably other languages (certainly java) do too [14:38:05] shall i send out the email to analytics list asking for feedback on the schema's? [14:41:55] sure, check the talk page [14:41:57] I added things there [14:42:26] i saw it, excellent! [14:42:33] okay sending now [14:43:49] cool [14:43:50] ottomata, you wanna headbutt with hue again? [14:44:02] (how did the piano go btw?) [14:44:27] yea for sure [14:44:27] decided not to get it! [14:44:37] :) [14:44:45] so louisdang said he got it to work [14:44:58] which is good news [14:45:03] yeah! [14:45:04] saw that [14:45:06] that is good news [14:45:13] means we can too :) [14:45:13] he was talking about webhdfs vs httpfs [14:45:18] iiiinteresting [14:45:27] should we enable webhdfs? [14:45:28] yes i remember reading about that, only glossed over it once I got most of it to work [14:45:29] maybe so [14:45:31] i will look into it [14:45:34] k [14:45:41] i can reproduce this problem locally, so that makes it much easier to play [14:46:17] when dschoon is on later, let's talk more about my comments and content in event stream [14:46:25] especially re cookies [14:46:30] i want to get asher a varnish log format specification soon [14:46:55] yeah totally that's why i sent out the email right now [14:47:11] but can we make that spec match the current web traffic logs? [14:47:29] or alternatively update the current web traffic logs to match the new varnish log specification? [14:49:25] to enable webhdfs: [14:49:31] add [14:49:32] [14:49:33] dfs.webhdfs.enabled [14:49:34] true [14:49:35] [14:49:36] to hdfs-site.xml [14:50:19] dschoon and I talked about that a bunch yesterday [14:50:43] i think we should enable it internally [14:50:43] we decided that since we are going to have to do a bunch of different ETL stuff for event stream than the request stream, that we can change the format if we want [14:50:51] (re event log format) [14:51:00] right sorry [14:51:04] ok [14:51:08] if the ETL for requests and events was going to be the same [14:51:18] then we'd probably want to be able to share the same format so the ETL process would be the same [14:51:35] but, one of dschoon's examples was that [14:51:47] the event stream will have referer set to the page that generated the event [14:51:59] not the actual referer to the page that the event was triggered on [14:52:18] so the event generating JS lib (or whatever) will probably add a real referer metadata string in the query params [14:52:25] so we'll have to parse that out [14:52:34] yup [14:52:43] i think there are more examples too [14:52:59] also, ori much prefers that the line starts with the url/product_id [14:53:10] we should talk with ori about that, and agree on a referer key that is always present in the payload data [14:53:14] that way his pub/sub services can much easier subscribe based on string prefix [14:53:26] well, that is kind of irrelevant, i think, i mean [14:53:38] if/when we code a JS event lib for people to use [14:53:40] that will be an issue [14:53:56] the varnish log format won't matter [14:54:07] it will always put the http header referrer in the field [14:54:11] but yeah [14:54:55] re webhdfs [14:54:57] it shoudl be on [14:56:18] k [14:58:17] hmm, so louisdang's cluster is not using local mode? [15:00:02] no [15:00:23] interesting [15:00:33] he's on labs right? maybe i'll copy over his configs and compare them all [15:00:37] yeah i am waiting for him to come online [15:00:44] yes his instance is on labs [15:00:53] i think it's the one in the hadoop group [15:13:51] hmm, can you make me part of hadoop group? [15:13:51] hadoop project? [15:13:53] i don't think I can... [15:18:46] 1 sec [15:20:15] drdee: hey! [15:20:29] hey ori-l [15:20:46] hey a_d [15:20:53] hey a_d [15:20:54] hey o-l [15:21:07] hey doctor diederik [15:21:26] hey gusy! [15:21:30] hey louisdang! [15:21:36] hey drdee [15:21:49] what instance on labs did you use for your hadoop installlation> [15:22:22] drdee / ottomata: so, i'm giving a quick eventlogging preso after stevenw at the metrics meeting. kinda scrambling to finish it. can/should i point people to yr design docs? [15:22:27] I actually been using my own machine... [15:22:39] ori-l, yeah totally [15:22:46] louisdang, ok [15:22:50] can you send us your conf files? [15:23:07] ok. it's for pseudo distributed mode [15:23:16] us (is ottomata and myself) [15:23:51] drdee: ok can you give me a one or two-sentence pitch? like, when can people start using it, what is it going to be awesome for? [15:24:31] * ori-l is not trolling, despite appearances to the contrary [15:25:51] ori-l, are you referring to the avro serialization stuff? or the whole event/ hadoop logging chain in general? [15:27:07] drdee, louisdang [15:27:12] there are these: [15:27:12] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Hadoop [15:27:22] tried to log into one, didn't ahve access [15:27:27] whichever you prefer. i'm basically going to start by looking at some really pathological example from the CT logs where we triple-encoded our data and then talk about how the EventLogging extension is supposed to help with that [15:27:30] ottomata: https://gerrit.wikimedia.org/r/#/c/31145/ [15:28:05] but i dont want to give the impression that its the only game in town or something we invented [15:28:40] ori-l i think it would be okay to just alert people about the existence of the wiki page with the two proposed schema's and that we are actively seeking feedback [15:28:57] cooooooool i will [15:28:58] hmm, well, maybe one of the nicest piece is the ability to quickly query data you log [15:29:08] ottomata: please review [15:29:15] once this is all set up, data should be available in hadoop within a few minutes [15:29:24] ottomata, drdee I've been using my own machine instead of the hadoop cluster since it was just faster and more efficient [15:29:38] and will be queryable via pig and hive very easily via hue web interface or somethign similar [15:29:57] that's cool louisdang, I do that too [15:30:04] but the whole pig/hive thing is too early to mention :) [15:30:07] oh ok [15:30:11] louisdang: can you tar up your configs and send them too me? i wonder what I'm doing wrong [15:30:18] particularly as we are still fighting with hue [15:30:26] what about the ability to intersect the data with server logs? [15:30:48] the data == mediawiki data? [15:31:02] ottomata, ok doing that now [15:31:11] tar -cvf ~/louisdang.hadoop.confs.tar /etc/{hadoop*,hive,hue,oozie,hbase,zookeeper,pig,sqoop}/ [15:31:19] yeah, im guessing you guys are going to make that queryable too alongside more human-designed events? [15:32:06] yes, we definitely want the ability to intersect web traffic data with mediawiki data [15:32:20] i have been working on importing data from mediawiki into kraken [15:32:46] drdee, just curious since you know more about how hive works than I do [15:32:47] if sqoop imports mysql data [15:33:00] and we have this avro serialized web log data [15:33:23] can we map a hive schema onto the serialized web data, without having to copy it into hive data dirs and/or format? [15:33:32] we import the mysql data as avro files as well [15:33:35] hive is mostly just defining a schema around existing data files, right? [15:33:36] oh ok [15:33:38] ottomata: i thought that was the idea! that would be awesome [15:33:53] yeah, i think so too, i just haven't done much with hive yet, so i'm curious [15:34:03] i know you can import data with hive into its own warehouse stuff [15:34:13] so the answer is yes to your question [15:34:13] cool [15:34:13] there ought not be a hard distinction between auto-generated server logs and custom events [15:34:29] ok cool. /runs to put something together [15:34:32] right now i am talking with the analysts about the proper schema for mediawiki data [15:34:43] nice cool [15:34:48] so with hive, it would be like: [15:35:01] create table bla bla bla bla using /path/to/log/data [15:35:11] exactly [15:35:11] then you can select and join agains whatever schemas are in hive [15:35:14] yup [15:35:16] no matter where the data is stored [15:35:17] coooooool [15:35:18] yup [15:35:19] super cool [15:35:55] ottomata: sorry to jump topics; can i also ask that we provide notice and schedule any changes to the varnish / udp logging configs? i want to make sure i know to expect data loss then [15:36:09] sounds good to me [15:36:12] doesn't have to be too far in advance, even a day's notice would be great [15:36:16] you mean, once it is actually running? [15:36:23] yeah, totally [15:36:51] anything that would impact the flow of data.. like restarting varnishncsa to grep for the new pattern, or whatever [15:37:35] aye cool [15:37:35] yeah [15:37:45] we can even write that down in the official format spec doc [15:37:50] notice policy for changes [15:37:51] etc. [15:37:56] ottomata can you have a look at https://gerrit.wikimedia.org/r/#/c/31145/ [15:39:00] yeah, sorry, average_drifter and drdee, looking at that now [15:39:04] k [15:39:50] ottomata: hey, just fixed spacing problems [15:39:50] hey average_drifter, I did have one question that maybe isn't that relevant about append_field [15:39:57] ottomata: I just hit a new git review [15:40:01] ottomata: yes please , ask [15:40:05] wanted to ask this before, but i wante dto focus on that other stuff [15:40:14] why do you increase the field count before you add the field? [15:40:16] if you did it after [15:40:32] you wouldn't have to do fields[*i-1] and fields[*i-2], etc. [15:41:17] drdee, ottomata : https://github.com/downloads/louisdang/kraken/louisdang.hadoop.confs.tar [15:41:25] brb [15:41:27] thx! [15:41:37] ottomata: I can increase the field count after I add the field, is that ok ? [15:42:53] yeah that's cool, i mean, effectively it is the same [15:43:24] it just reads slightly more sane in the code if you don' t have to do index arithmetic when you don't need to [15:43:56] ottomata: I will make it more readable [15:44:28] cool, danke! [15:46:43] oo, average_drifter, i think you have a misplaced semicolon on line 1049 [15:47:34] oo, louisdang, since it isn't apparent in the .tar [15:47:45] what is your /etc/hadoop/conf symlink pointing to? [15:47:47] ls -l /etc/hadoop.conf [15:47:53] ls -l /etc/hadoop/conf [15:48:00] is it conf.empty or conf.psuedo? [15:49:49] i think conf.pseudo [15:52:48] actually, the symlink is there but it points at alternative [15:52:51] louisdang: [15:52:58] ls -l /etc/alternatives/hadoop-conf [16:01:04] is there a canonical way to encode ipv6 and ipv4 address so I can perform checks on them ? [16:01:13] I'm mainly interested in checking whether some ipv6/ipv4 is local or not [16:01:32] ottomata: do you think representing them as CIDR would solve the problem easily ? [16:01:46] I mean maybe I can use libcidr to check if they are local or not [16:02:00] currently I need to run string matching on them to see if they are local [16:03:05] by local I mean both loopback, or local area networks [16:03:10] ottomata, I'm using conf.pseudo [16:03:28] /etc/alternatives/hadoop-conf -> /etc/hadoop/conf.pseudo [16:04:58] ok cool, danke [16:09:41] drdee: I think if the ip is local then the geoip library resolves them to XX [16:10:01] drdee: which means I don't need to match the ips from x-forwarded-for to local ips myself [16:10:01] perfect [16:10:03] drdee: because the geoip already takes care [16:10:06] yea :) [16:11:43] interesting, louisdang, I don't have this: [16:11:43] hadoop.proxyuser.httpfs.hosts [16:12:29] oh, louisdang [16:12:32] you are not using webhdfs [16:12:37] you are just using httpfs [16:12:41] you have [16:12:41] < dfs.webhdfs.enabled [16:12:41] < false [16:13:06] iiinteresting [16:14:17] ottomata: can you have another look please, I pushed another patchset [16:16:26] ok, average_drifter [16:16:26] 2 qs [16:16:33] in append_field [16:16:40] yes [16:16:45] 1. is this safe? [16:16:45] new_field_data[strlen(new_field_data)] = 10; [16:17:12] does 10 == \n? [16:17:15] yes [16:17:23] 0x0a if you prefer [16:17:37] ok, do we know for sure that new_field_data is long enough? [16:18:20] ottomata, yes I turned it off to try httpfs [16:18:39] but you did have hue+hive working with webhdfs? [16:18:48] ottomata: so basically append_field just takes a pointer. now.. whoever provides that pointer is responsible for making sure there's enough memory [16:18:49] ottomata, yes that's what I had first [16:19:02] ottomata: can we go by that convention and have it written as a comment ? [16:19:02] louisdang: ok cool, and did it work with httpfs [16:19:17] ummmm, your snprintf might be safer there [16:19:34] i don't thikn we should expect the user to know that his field needs to be long enough to add \n [16:19:36] also [16:19:36] 2. [16:19:39] ottomata, beeswax still worked with httpfs but I get an error with the filebrowser [16:19:49] hm, ok [16:20:01] ottomata, no error before with webhdfs [16:20:14] louisdang; problem with filebrower is a known issue, it does not support yarn yet [16:20:17] average_drifter: you should be consistent when using \n [16:20:28] you have 0x0a, you have 10 [16:20:28] drdee, ok [16:20:29] why not just use '\n'? [16:20:37] ottomata: ok '\n' then [16:20:42] filebrowswer works with yarn [16:20:43] jobbrowswer doesn't [16:20:48] cool [16:20:50] oh sorry mixed them up [16:21:10] anyways it is not related to beeswax AFAICT [16:22:26] louisdang: is there anything else that you did to get it to work or did you just use a vanilla installation? [16:22:57] ottomata: there's a problem with the snprintf because if I do snprintf(new_field_data,"%s\n",new_field_data); <=== the input and output are the same [16:23:10] ottomata: apparently the output is undefined if I use snprintf to add a \n with the same input and output [16:23:11] drdee, I just followed the instructions on cloudera [16:23:17] ottomata: http://stackoverflow.com/a/1973595/827519 [16:23:36] louisdang: which url? [16:23:53] drdee, https://ccp.cloudera.com/display/CDH4DOC/Hue+Installation [16:24:14] drdee, also I had to chown hue /user temporarily to make the sample files [16:24:56] k [16:25:29] ottomata: new_field_data[strlen(new_field_data)] = '\n'; [16:25:52] ottomata: if I do that it works fine, although I do agree that there are concerns about memory like "do we have enough memory to write another '\n' character ?" [16:26:09] ottomata: on the other hand since this is C and we're using strings allocated in some other place with malloc [16:26:21] ottomata: we don't have access to the malloc size.. [16:26:45] ottomata: so there is no way of telling if we have enough memory to add another \n .. however ! [16:27:18] ottomata: the area string was produced in geo_lookup [16:27:26] ottomata: and it was produced inside static char area[MAX_BUF_LENGTH]; [16:27:32] ottomata, did you restart the hadoop cluster after enabling webhdfs? [16:28:03] ottomata: and MAX_BUF_LENGTH is 128 (udp-filter.h) [16:28:34] ottomata: so if every country code is 2 bytes (US , JP, DE , etc) and we add just one byte for '\n' , we still have 125 bytes left because MAX_BUF_LENGTH is 128 [16:28:45] ottomata: would you agree ? :) [16:29:43] so in my particular case it does work, but yes, I cannot guarantee that everyone using an append_field will not produce problems. but I can write in a comment "Use with care. Make sure you have an extra byte of memory in there for append_field to put a \n" [16:32:12] it looks like there's no standard way to answer the question "Given a pointer, how much memory was allocated for that pointer through malloc ?" http://stackoverflow.com/a/1281721/827519 [16:33:08] mornin [16:33:15] yo dschoon [16:33:20] howdy drdee [16:33:22] ottomata, "/tmp (on the local file system) must be world-writable, as Hive makes extensive use of it." [16:33:24] ok cool, average_drifter, if we add a comment, then I am cool with that fo sho [16:33:43] ottomata: great ! thanks [16:33:49] ohhh, that is just because you are snprintfing into the same place [16:33:52] hm [16:34:01] yeah i see, we coudl snprintf a new string, buuuuut, whatever [16:34:06] if you add a comment, this is less mem and easier [16:34:12] yeah [16:34:14] it is [16:34:26] local /tmp is usually world writeable, unless you make it not so [16:34:40] k [16:35:08] also, average_drifter, I appreciate the use of descriptive variable names, rather than 'i', etc. [16:35:10] much more readable, thank you :) [16:35:55] the only times I personally use single letters like k, j, etc. for variable names, is when doing short loops where the variables refer only to iterated indexes [16:36:27] if you ever need more meaning to a variable, like you do in this case (e.g. last_field_index, new_field_index), then it is really good to do like you did [16:36:53] so readers of your code don't have to keep the indexes and offsets associated with their meanings in their heads [16:37:03] since the variable names are descriptive, the meaning is there [16:37:06] so um, thanks! [16:37:25] ottomata: ok, I can get rid of the "i" [16:37:55] drdee: i see you sent out the avro schemas. i don't know that was a good idea. [16:38:00] oh, no I mean, that's ok, I wouldn't do it here like you did, but it doesn't really matter that much to me [16:38:03] it is pretty localized inside of this func [16:38:09] it's an implementation detail that people shouldn't really care about. why rfc? [16:38:14] i'm just rambling about preferences I have when I choose variable names [16:38:29] for 'i' here, i mean, it isn't really necessary, but it doesn't matter so much [16:38:35] ottomata: new patchset [16:38:47] dschoon: making sure we capture all the important fields [16:38:48] im just saying I appreciate your use of last_field_index = i-2 [16:38:49] etc. [16:38:52] sure. [16:39:55] and sharing what we want to capture so people can start imagining what kind of questions they can ask [16:41:06] i don't expect a ton of feedback anyways [16:41:43] *nod* [16:43:21] I had started a Data Storage Formats page when my internets ran out yesterday. [16:43:46] I was planning on discussing the various methods and schemas there. [16:43:53] Is your page actually linked from anywhere? [16:45:03] not yet [16:45:04] hmm, average_drifter, do you think you shoudl be safe and also add a null byte after you add the \n? [16:46:47] ottomata: yes, sorry [16:47:01] aye cool [16:48:48] drdee: you mind if i edit that page? [16:48:54] ottomata: new patchset [16:49:01] not at all [16:49:08] kk [16:49:14] have we heard from dan today? i hope he's ok [16:49:28] he seriously hurt his head. [16:50:04] no he is still out [16:50:10] he looked dizzy yesterday [16:50:12] dschoon: how ? [16:50:28] what happened ? [16:50:42] average_drifter: he walked quickly down some stairs in the place where he has lived for years and went headfirst into a beam. [16:50:51] (it is not the most flattering story.) [16:50:54] haa [16:51:50] I hope he gets better [16:51:59] i do also. [16:52:12] louisdang: how big was your input file that you used to run the query on? [16:52:25] i already told moeller that we won't be demoing the new limn stuff, so i think we should send him home if he attempts to think hard today. [16:53:09] drdee, I ran the 3 sample queries that came with hue [16:53:13] k [16:53:44] I'm gonna go wash some dishes [16:53:51] be back in 20m [16:54:14] and he washes dishes, too! [16:54:15] drdee, can't check the logs anymore it says: The result of this query has expired. [16:54:20] can we keep him? please? [16:54:27] hmm, I will try those sample queries too [16:55:34] do we have some unsampled data somewhere? [16:55:58] not really, no [16:56:01] we can get it easily [16:56:14] could you grab me maybe a gig or so? [16:56:25] ah i mean, we'd have to capture it, but yeah [16:56:27] yeah. [16:56:31] ok [16:56:35] if it's not a big deal, that'd be great. [16:56:54] i want to run some experiments to see what sort of savings we get if i do some bundling. [16:57:41] drdee: why did you call it "metadata"? [16:58:48] we have the 10 minute unsampled data file on stat1 [16:58:57] i thought we used metadata all the time [16:59:06] ottomata, sample files are not fully installed [16:59:11] what's "meta" about it? [16:59:19] isn't it just data? it's not data about data... [16:59:36] isn't a schema always meta? [16:59:38] drdee: can you give me the path to the unsampled file? [16:59:49] I mean, if you're talking about the schema itself. [16:59:59] But it's a Data Schema. [17:00:09] dschoon: stat1:/a/squid/archive/sampled-1.log.gz [17:00:20] that is 3.2G gz compressed [17:00:41] I think it's just a buzzword [17:00:48] Whereas the schema, for example, found in the header of Avro datafiles describes the Avro schema itself. [17:01:05] i just don't like making things sound more complicated than they are. :) [17:01:23] another example of a metadata schema would be the avro directory [17:01:27] since it describes the descriptions of data. [17:01:53] standup time? [17:02:14] gotta be quick. [17:02:20] metrics meeting at 1030 [17:02:34] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:02:40] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:02:48] oh dan's is having jury duty [17:11:07] g2g interview [17:20:44] average_drifter, approved change. [17:20:53] shutting down hue for a brief moment [17:22:41] sure [17:29:47] ottomata, installed samples in hue again, ran sample query, same error [17:30:48] ok cool, good to know [17:33:53] ottomata: thanks ! :) [17:38:12] fyi errybody, meeting is starting [17:55:09] dschoon, do we want need request time in seconds in the event log data? [17:55:10] it will just be varnish 204ing [17:55:31] that's the time for it to reply, right? [17:55:39] Varnish:time_firstbyte [17:55:39] Time to the first byte from the backend arrived [17:55:45] not really any backend bytes [17:56:51] ottomata: i don't have it listed in the EventData schema [17:56:54] so we can drop it. [17:57:07] ok cool [17:57:17] https://github.com/wmf-analytics/kraken/blob/master/src/main/avro/EventData.avro.json [17:57:59] i'm reading media wiki page [17:57:59] cool [17:58:04] i've added comments there, btw [17:58:14] http://www.mediawiki.org/wiki/Talk:Analytics/Kraken/Metadata_Schemas [17:58:21] cool. [17:58:28] i'm actually editing it atm [17:58:29] cool [17:58:31] you posted to the talk page? [17:59:10] yes [17:59:28] help me figure out best varnish log format [17:59:31] http://etherpad.wikimedia.org/Analytics-Pixel-Service [18:00:45] you have a link to the varnish log dogs? [18:00:50] *docs? [18:00:58] https://www.varnish-cache.org/docs/trunk/reference/varnishncsa.html [18:01:06] ty [18:01:33] mainly, lets figure out if we need to add or remove fields, and if the order makes sense [18:04:39] be back in a min, lunchtime [18:04:44] half listening ot meeting too [18:05:33] let's use 'tabs' to delimit fields :D [18:32:06] yeah for sure [18:32:12] (re tabs) [19:23:23] dschoon, i am still waiting for your reply on the avro schema regarding UA parsing, handling of referrers and the Accept_Language header :) [19:23:40] yep. [19:23:42] eating food. [19:23:52] watching fundraising video atm [19:23:55] then back downstairs :) [19:26:49] anyone opposed to me rebooting kripke? [19:27:00] ottomata, drdee, erosen? [19:27:04] no [19:31:12] nm, unnecessaray [19:42:14] aiight, ottomata [19:42:15] ... [19:42:21] who is now gone. [19:42:21] hokay. [19:47:16] drdee, are you using pig 0.10 now? Can optimize ParseWikiURL by using booleans instead of chararrays [19:52:49] the total irony of this is that louisdang back ported it to make it work pig 0.9 [19:53:02] cdh 4.1 upgraded to pig 0.10 [19:53:37] yeah it's a simple fix though. Just remove .toString() on the Boolean object at the end [19:54:07] can also make the UDF accept a boolean at declaration if it's pig 0.10 then return boolean [19:54:27] yup [19:54:47] or maybe there's a reflection method to check the pig version calling the udf [19:54:51] drdee: we know where otto went? [19:54:55] or when he'll be back? [19:55:01] he wont'be back today [19:55:04] oh, derf [19:55:06] half day [19:55:06] duh. [19:55:07] he went to guardia airport [19:55:07] right. [19:55:43] i am also heading out to the airport, picking up my mom :) [19:55:43] okay. anyway. i have to bolt in ~2h. new housemates moving in today, and i'm switching rooms (finally!) so I need to get that done beforehand. [19:55:46] nice. [19:55:57] i guess we all are taking a slow day [19:56:26] don't feel particularly guilty as worked whole weekend on hive [19:56:40] yeah, heh. [19:56:40] also we should talk about some more sqoop challenges [19:56:48] i usually work during the weekend. [19:57:12] and take a 5 day weekend :D :D :D :D ? [19:57:13] i also usually eat dinner after work, dick around for a few hours, and then write code at night. [19:57:26] about sqoop..... [19:57:26] hence me committing at weird times. [19:57:38] like what? [19:57:44] there are a number of important mediawiki tables that delete data [19:57:50] or delete rows i should say [19:57:52] interesting. [19:57:54] why? [19:57:55] that sucks [19:57:59] design [19:58:19] so incremental updates become complicated [19:58:35] and doing a bulk import every week is really hammering the db's [19:59:01] yeah. [19:59:06] well, you remember my plan, right? [19:59:11] yes [19:59:16] we could just move on that a little faster [19:59:19] i don't think it'd be that hard. [19:59:19] we might need to bump it up [19:59:30] and it only depends on the pixel service going live [19:59:35] so once we get it set up in varnish, bam [19:59:44] i like the plan [19:59:45] okay i gotta go now [19:59:53] then i figure out what to bribe roan with so i don't actually have to understand mediawiki to make it happen :) [20:02:13] promise him 12 bottles of 'Beerenburg' [20:02:26] should do the trick [20:02:52] and that's not a joke [20:04:26] heh [20:04:31] What is that? [20:10:14] drdee, figured out anything from my conf files? [20:27:09] drdee: what was the answer about unsampled data? [20:27:19] is there any just sitting around? [20:37:22] I have a proposition [20:37:40] in order for people on various mediawiki projects to be able to kickstart/jumpstart their work extremely fast [20:38:00] I propose the creation of vagrant/virtualbox boxes equipped with all necessary project-specific software to be available [20:38:37] this decreases dramatically the time it takes for someone new to get up to speed and get to grips with a particular project [20:39:51] so when someone new comes, he can download the box and then just hit a git pull on that box [20:40:00] and he can have the current state of the project within minutes [20:40:08] with all configuration taken care of by the box itself [20:41:43] just a thought :) [20:44:31] drdee: https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=docs/hooks.txt [20:45:10] mediawiki has ~2200 lines worth of hooks. [20:45:19] and that is merely the documented ones. [20:45:28] we now need to wade through them and figure out what we care about. [20:46:37] average_drifter: github.com/wikimedia/wmf-vagrant [20:46:58] ori-l: wow [20:47:36] hehe [20:48:32] i paid him five bucks to ask that on this channel [20:49:25] *grin* [20:49:25] brb [20:49:30] what the WMF _really_ needs is someone handsome / dashing like ori. he works here, you say? wow! [20:49:54] average_drifter: j/k obviously [20:51:02] did I write that ? oh j/k [20:51:21] hey I really didn't know about the project btw [20:51:28] sorry about that :) [21:03:25] average_drifter: it isn't quite all that you hoped for. the vagrant puppet stuff is not the same puppet configs used to provision production server. more work is needed to get things to that state. [21:03:45] if you're familiar with vagrant and want to take over, go for it! [22:08:59] dschoon: answer about unsampled data: stat1:/a/squid/archive/sampled-1.log.gz or something like that [22:09:23] dschoon: about the hooks: i know what i'll be reading soon :) [22:10:59] dschoon: about the unsampled file, don't look in the sampled folder look one folder higher [22:12:15] hey, I'm trying to track down this dataset: SquidDataVisitsPerCountryDaily.csv [22:13:22] supposedly it should be on 'locke' [22:13:45] where did you get that information from? [22:13:59] I'm digging through the wikistats repo [22:14:07] it's best to look at wikistats.wikimedia.org [22:14:11] nothing is on locke anymore [22:14:21] it's on stat1 [22:14:28] ah, thx. The specific page I'm looking at is broken... and doesn't expose the granularity we (fundraising) needs [22:14:51] what exactly are you looking for? [22:15:07] pageviews by country, by hour or day [22:15:24] preferrably, broken out by project as well... [22:15:32] we are working on that, [22:15:36] sorry for not keeping you in the loop [22:15:40] will get back to it tomorrow [22:15:47] not at all... thanks for the tips [22:16:18] we were working today on improving the accuracy of geocoding [22:16:20] is this the place to lurk for current news, or is there a wiki page / listserv... [22:16:30] this is a good place to lurk [22:16:45] bigger things we usually announce on on the analytics mailinglist [22:17:02] can you reprocess a few days of archival stuff? We have an event from yesterday we would like to analyze [22:17:09] i am working on a breakdown by project / country / month / desktop or mobile [22:17:22] rad, that's *exactly* what we need [22:17:28] i know :) [22:17:32] sure, just send me an email with the specifics [22:17:56] ah, telling the future as well... ;) [22:29:44] awight: tell me more please [22:29:55] hehe [22:30:22] awight: if you already have made a bugreport for this, or we can do that right now if you want, I can take care of the exact problem you have [22:30:46] average_drifter: i am already working on the pig script to run this analysis, your help is appreciated though [22:31:33] drdee: oh I thought awight was mentioning a problem in the wikistats [22:31:36] 00:15 < awight> pageviews by country, by hour or day [22:31:36] 00:15 < awight> preferrably, broken out by project as well... [22:31:52] yes but that's outside the scope of wikistats [22:31:59] it's a special request for the fundraising folks [22:32:04] oh, alright [22:32:05] average_drifter: yes actually, here's the bug tracker: https://bugzilla.wikimedia.org/show_bug.cgi?id=41663 [22:34:25] drdee: so you're saying "sampled-1.log.gz" is unsampled? [22:34:34] because that is horrible naming if so. [22:36:55] fyi: i need to boogie now to move between rooms in my house now that $OLD_ROOMMATES have finally moved out [22:37:00] hopefully back online later. [22:37:45] drdee, anything for me to do?