[00:14:52] drdee: I finished writing a script for the star wars kid logs [00:15:01] awesome! [00:15:10] can you paste the script and the results in a gist? [00:15:38] drdee: https://gist.github.com/3869339 for the code [00:16:21] it's based on the code written by ottomata in the kraken repo [00:17:33] https://gist.github.com/3869351 for the results on the first 100 lines of the star wars kid logs [00:19:23] very cool, i am gonna take a look at it later tonight [00:19:28] first i need my dinner :) [00:19:31] drdee: report card down? [00:19:39] or rather, empty [00:19:47] are you guys working on it? [00:20:03] nothing urgent, have dinner first ;) [00:20:37] no not working on i [00:20:48] charts are empty? [00:20:59] erosen, I just had to pull data from outside the enwiki slave and I feel your pain about the fact that staging is not replicated (pinged you on trello) [00:21:17] JS error maybe? [00:21:21] indeed [00:21:25] drdee: forget it, for some reason they didn't show up in the browser [00:21:36] works fine now [00:22:21] erosen: I'll ask asher this Friday if he can think of a solution to help with that [00:22:29] great [00:22:37] let me know if I can be of use in that process [00:24:29] I'll talk to him this Fri at 11.30 if you want to join [00:25:31] not necessary for you to attend, but you can explain the non-enwiki pain better than I can [00:27:29] die Nicht-Englisch-wiki Schmerzen [00:27:37] as the germans call it [00:28:07] drdee: thanks for the email! [00:29:23] ori-l: you are welcome ;) [00:42:53] DarTar, didn't we have policy for publication at the Research Committee [00:43:04] or at least guidelines [00:43:30] there's an OA policy proposal, never enforced [00:43:59] maybe breath some new life into that? [00:44:02] but that's mostly for external researchers, not hard for us to self-inflict a policy [00:44:37] Daniel is still very active on that front, I am sure he could help draft some language for WMF staff/contractors [02:55:17] hey [02:59:55] ok got it [03:00:03] drdee: ok, so it depends on the changelog [03:00:42] I read a lot of stuff, tried to change Makefile.am with PACKAGE_VERSION or VERSION or PACKAGE_STRING or AC_* with all of the aforementioned variables [03:00:45] didn't work [03:00:54] so, the changelog dictates the latest version of the package [03:02:08] so although we generate our own versions in teh debianize.sh script [03:02:16] and we use them later on ... [03:02:29] the version of the package is still dictated by the debian/changelog [03:15:24] ottomata: [03:15:26] ottomata: ! [03:15:28] ottomata: !! [03:16:01] ottomata: oh hi, do you want to deploy a package together ? [03:16:17] hiiiiiiii [03:16:17] haha [03:16:18] not really [03:16:20] haha [03:16:23] deploy [03:16:26] whatchyouuuu talking bout? [03:18:26] webstatscollector [03:19:01] i think i do not know much about deployment [03:19:07] its a package upgrade? [03:19:17] needs installed on some machine and then restarted or somethign? [03:20:40] average_drifter: do you mean upload it to the debian repo? [03:20:51] ori-l: well not yet , just a test run [03:20:59] ottomata: just a test run, so for example, I can give you a package [03:21:02] do you have access to labs? [03:21:02] ottomata: and we can deploy together [03:21:08] ottomata: and we can see if it goes bad/wrong [03:21:47] heh [03:21:50] average_drifter: in wmf vernacular "deploy" typically means push code to the production mediawiki instances [03:22:01] sounds fun! buuuut, maybe not something I want to do right before my bedtime [03:22:22] ottomata: you live in brooklyn, your bed time is not for another 3-4 hours :P [03:22:38] j/k [03:23:10] geeeet outta heerreeee [03:23:15] i am a midnight bedtime kidna guy [03:23:18] on weeknights anyway :p [03:23:59] I switched my schedule to night-time [03:24:03] ah, well have a good night then [03:24:14] daytime sucks for me, I can't get my stuff together during the day [03:24:54] I mean I'm normal, it's just that I have neighbours and sometimes they go "OH I KNOW, I'll drill some holes and make a lot of noise so everyone knows that they have busy neighbours" [03:24:58] anyway.. [03:25:21] haha [03:26:51] where are you stefan? [03:26:57] Bucharest/Romania [03:27:00] (wait, you are stefan, right?) [03:27:02] aye ja [03:27:04] yes I am [03:27:08] hope to get out of here soon [03:27:35] yeah? [03:27:49] I'm waiting for some data from someone, I might be in .us at some point for some time [03:28:07] * average_drifter package is building.. [03:34:45] ottomata: where do I put the package ? [03:35:49] ottomata: please scp otto@build1.pmtpa.wmflabs:/home/diederik/webstatscollector/webstatscollector_0.2_amd64.deb . [03:36:03] ottomata: can we get on a machine and try this deb out ? [03:36:15] well, you can dpkg -i it [03:36:20] in labs no problem [03:36:30] right? [03:36:51] I can, but it can mess up the package system if it's not good [03:37:02] ottomata: is messing up build1 a concern ? [03:37:05] i like to check dpkg-deb --contents [03:37:13] that will show you ever file it is going to install and to where [03:37:20] ummm, it'd be annoying but not the end of the world [03:37:28] you could spawn up or use another labs instance to try it [03:37:55] ottomata: how do I spawn up another labs ? [03:38:05] ottomata: is it complicated ? [03:38:11] naw [03:38:56] can you access this? [03:38:57] https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=create&project=analytics®ion=pmtpa [03:39:58] Sysadmin role required [03:39:59] You must be a member of the sysadmin role to perform this action. [03:40:21] aye [03:40:24] how about [03:40:47] yes ? [03:41:27] ah let's see i'll make a new instance [03:42:33] ok, i'm spawning up build2 [03:42:38] it'll take a few minutes before it comes online [03:42:50] no problem [03:42:58] ok I'll use build2 when it's ready [03:43:10] but, you should have the same access [03:43:14] also, it is ubuntu precise [03:43:15] not lucid [03:43:38] builddiederik@build1:~/webstatscollector$ cat /etc/lsb-release [03:43:38] DISTRIB_ID=Ubuntu [03:43:38] DISTRIB_RELEASE=10.04 [03:43:38] DISTRIB_CODENAME=lucid [03:43:38] DISTRIB_DESCRIPTION="Ubuntu 10.04.3 LTS" [03:44:02] says lucid there. maybe we can do something about that ? [03:44:10] should I change it to be something else ? [03:44:27] naw, you'll probably have to build the package for both OSes anyway [03:44:30] but it will mostly be the same [03:44:38] you just build 2 different debs, one on each OS [03:44:44] but [03:44:52] i'm sure for testing [03:44:53] it won't matter [03:45:00] you can probably jsut install it on build2 no problem [03:45:06] ok [03:45:30] i think apt likes to have things built for different distros specifically [03:45:44] and will give you trouble if you try to install one on the other [03:45:48] but i don't think dpkg cares [03:46:20] yeah, dpkg is more permissive definitely [04:02:27] ottomata: can you please ping me about build2 ? [04:05:51] shoudl be up by now.. [04:07:03] hmm, how do you log into labs? [04:07:05] i haven't done it in a while [04:07:08] what bastion do you use? [04:07:10] user@garage:~/test-git-dch$ ping build2.pmtpa.wmflabs [04:07:10] ping: unknown host build2.pmtpa.wmflabs [04:07:18] ottomata: I ssh into build1 [04:07:36] ottomata: bastion1 [04:07:51] Host *.pmtpa.wmflabs ProxyCommand ssh -a -W %h:%p bastion1.pmtpa.wmflabs [04:07:54] Host *.eqiad.wmflabs ProxyCommand ssh -a -W %h:%p bastion1.eqiad.wmflabs [04:07:58] got that in my .ssh/config [04:08:07] ah bastion1 [04:10:10] ummm, but hm [04:10:18] bastion1.pmtpa.wmflabs is not a public addy [04:10:24] doesn't resolve from my lcoal [04:11:05] Host bastion1.pmtpa.wmflabs Hostname bastion.wmflabs.org ProxyCommand none [04:11:09] Host bastion1.eqiad.wmflabs Hostname bastion2.wmflabs.org ProxyCommand none [04:11:12] ottomata: also got that in my .ssh/config [04:11:18] ah [04:11:20] http://bastion.wmflabs.org/ [04:11:21] http://bastion.wmflabs.org/ [04:11:22] bastion1 is actually bastion.wmflabs.org [04:11:23] yeah [04:11:27] wasn't responding for me [04:12:25] try build2 [04:12:26] i can't get into it [04:12:30] but it responds to pings [04:12:36] build2.pmpta.wmflabs [04:13:49] can you get in? [04:15:16] trying [04:15:39] ottomata: yay ! I'm in ! thanks ! [04:16:13] ottomata: I have the same directories as in build1 [04:16:17] ottomata: did you clone it ? [04:17:36] no, labs uses nfs [04:17:38] for home [04:18:05] so the /home is mounted on both build1 and build2 [04:18:18] but the /usr or /otherstuff are separate [04:18:20] right ? [04:20:28] right [04:20:38] ok time for bed [04:20:52] have fuuuuuun! [04:20:54] laters! [04:21:05] ttyl , thanks ! [04:30:29] ottomata, average_drifter: you guys are crazy [04:30:56] hey drdee [04:31:12] hey ori-l [04:31:44] have you guys made a decision re: data serialization formats? thrift, protocol buffers, avro, whatever? [04:31:51] i guess if you go with kafka it's got its own thing [04:34:02] no we haven't, but looking at what we have tested so far, it seems that thrift comes along quite often [04:35:13] any reason why? [04:36:06] this is purely an observation, that i have seen thrift support quite often, but then again avro seems to be the default for apache projects [04:37:00] not sure if we should make that decision right now, so in the end i think it will be between thrift and avro but it might be a bit too early to make that call [04:38:52] i gotta go to bed, it's too late :( [04:44:00] oops, sorry [05:40:58] hey Eloquence [05:40:59] hey StevenW [05:41:24] ori-l: I worked extensively with protobuf [05:41:48] ori-l: hardcore protobuf, I was actually wanting to write a Perl XS(C++bindnings for Perl) module for Protobufs [05:42:27] ori-l: I'd love to discuss them with you if you're interested on using them on any project :) [05:44:44] average_drifter: definitely [05:44:48] any snags? [05:44:59] hey man [05:45:02] snags= ? [05:45:14] have you run into problems, annoyances, etc? [05:45:24] well I can tell you waht I've used from it [05:45:26] things that seemed great at first and turned out to suck? ways in which protobufs were limiting? etc [05:45:56] so basically I worked on a custom cache with business specific logic... it was written in C/C++ [05:46:08] this is about my previous dayjob(before working on wikipedia) [05:46:20] and they had a clumsy data transmission set up(like real clumsy) [05:46:22] anyway [05:46:39] so I researched msgpack, thrift, protobuf [05:46:56] actually I didn't have time to research thrift because they kept cutting my research time [05:47:13] I was planning to research avro too [05:47:36] but they cut my research time drastically(part of teh reasons I didn't like it there so I left) [05:47:48] anyway, so I ended up with just protobuf and msgpack [05:47:51] protobuf is really really cool [05:48:01] you get to basically write a .proto file which describes how your data looks [05:48:50] then there's a protoc compiler which turns what you wrote into Java/C++/Python classes(there are parsers/transformers for other languages as well, actually I was planning to write a fully complete code generation thing like that for Perl) [05:49:13] so that protoc will generate C++/Python/Java classes that describe your messages you wrote in the .proto file [05:49:37] which is really cool because it cuts down development time, and you don't have to do all that low-level serializing/deserializing stuff yourself [05:49:46] and it has some options for compressing stuff [05:49:58] you can choose from a variety of different data types [05:50:06] like string, int32, int64, you have enums [05:50:31] bool, uint32 [05:50:42] yeah, i'm aware of the fundamental selling points [05:50:43] you can even say that some things in your message are required or optional [05:50:46] stuff like that [05:50:53] and it generates setters and getters for you [05:51:14] and also ! if you have some fields marked as "required" , it will tell you that you didn't set it [05:51:27] but i have the following concerns: 1) i like text. text is readable. text is debuggable. text is greppable. [05:51:29] or if you have it marked as "optional" it will tell you that you didn't set it [05:51:50] if you're using it (you can call stuff like has_ and it'll tell you if the message has that) [05:52:00] ori-l: yes text is all that [05:52:10] ori-l: but protobuf is really meant if you want like hardcore performance [05:52:43] ori-l: it packs up the data very neatly, it takes care of all the htons ntohs network->host and host->network byte ordering [05:53:16] the required / optional thing is great [05:53:19] ori-l: you can always inherit from the classes Protobuf generates and add a "print" method if you wanna debug your data [05:53:27] ori-l: yeah that's awesome ! [05:53:37] ori-l: and also you get a "repeated" thing [05:53:42] i think we wanna use it for data validation [05:53:51] yeah, i read the docs today and played around with it a little [05:54:18] ori-l: and what this does is, you can define a field that repeats, so basically an array, and you have no limitation on the size of that, you can just push a lot of stuff into it [05:54:31] ori-l: on which project ? [05:54:39] event logging [05:55:05] ori-l: I've done 2-3 months of development with Protobuf and I would be really interested and ready to bang if you have a project that involved it [05:55:13] ori-l: what language do you plan to use ? [05:56:35] ori-l: also, the neat thing about it is that you can define messages like a custom message called say... CustomMessage, and of course you can include that in other messages like [05:56:43] well, our data analysts uniformly love python, so for data analysis things, probably python. as for server / routing / etc, dunno. probably build a prototype in python and re-implement in C / go / java iff necessary [05:56:48] OtherMessage { [05:56:56] repeated CustomMessage msg; [05:56:57] }; [05:57:10] yeah, that's awesome [05:57:52] that sounds interesting [05:58:01] I know some Python [05:58:19] ori-l: do we have a datasource to test stuff on ? [05:58:47] not yet, but by friday [05:58:50] ori-l: what I'm actually asking is if kraken provides us with some hose that delivers some data [05:59:01] not to my knowledge [05:59:08] ori-l: that's cool [06:00:23] event log data starts its life as a simple http request log [06:00:24] ori-l: what I didn't like about msgpack is that although my findings were that it was slower than Google protobufs [06:00:33] ori-l: on msgpack's website they say "oh, it's the best" [06:00:57] ori-l: I guess it's kind-of a status quo that every such piece of software has it's author's saying "it's the best" [06:01:40] yeah [06:01:45] so I wouldn't rely much on benchmarks on the official website if you wanted to choose between thrift vs. msgpack vs. agro vs. google protobufs [06:02:40] well, so, the data starts out as a simple http request log and at least initially no further processing is done, it'll be sent in plaintext/utf-8 via udp [06:03:32] so throughput will likely be bottlenecked there regardless of what things we use further down the pipeline [06:04:18] so compression isn't a good motivation for protocol buffers in this particular instance. but we can allow producers to define a data model and then hold them to it [06:04:43] so use protobufs as a DSL for declaring data models and as a kind of "contract" [06:05:01] yeah [06:05:14] I've used it on both client and server [06:05:33] so basically you write data in a format on one side and you use the same classes generated by protobuf on the other side [06:05:40] and it'll throw you exceptions when stuff gets bad [06:05:45] jay kreps talks about this in his kafka talk at airbnb [06:05:48] considering we're using UDP this will be a concern [06:05:51] but it isn't part of kafka [06:06:09] it's another piece of the stack at linkedin that they haven't open-sourced. [06:06:14] jay = jeremyb ? [06:06:15] but it's a good idea [06:06:30] ori-l: which piece haven't they OSS-ed yet ? [06:06:46] no no, the lead author of kafka, he gave a talk at the hq of airbnb, an SF company -- i sent an email about it last month to the analytics mailing list [06:07:26] ori-l: does he mention protobuf in his talk ? [06:08:02] i don't remember if he mentioned protobufs specifically, but he mentioned having some centralize data store with definitions of all the different event models [06:08:57] .proto files are just text, you can store them anywhere [06:09:39] yeah [06:10:06] anyways i ended up not being too impressed by kafka but liking some of the other stuff he described, like this approach [06:14:00] ori-l: so I guess on friday there's gonna be a release of kraken somewhere right ? [06:14:54] dunno [06:15:30] 08:56 < average_drifter> ori-l: do we have a datasource to test stuff on ? [06:15:31] this is a separate system designed for a much smaller scale, i think if it works well enough parts of it will be incorporated into kraken [06:15:34] 08:56 < ori-l> not yet, but by friday [06:15:35] ^^ [06:17:31] no, the data source is just a subset of the request log on bits.wikimedia.org [06:18:20] which is going to get broadcast via udp to some machine purposed for data collection [13:31:15] waammp waaammmp goodmorninnnnng [13:37:51] bam bam bam bam another big data day, hiiyayaaaaaaa!!!! [13:38:56] morning ottomata, milimetric, average_drifter [13:39:16] good morning drdee :) [13:39:50] haha, morning [13:39:56] man I was going on a week long coffee hiatus [13:39:59] just broke it [13:40:03] :) [13:40:07] only lasted 4 days [13:40:17] i coulda lasted longer but we were just served coffee soooo [13:40:28] counts as a work week 'cause of columbus day? [13:40:37] hmmmm, yeah! [13:40:38] hahaa, average_drifter asks ottomata to deploy a new package at 11:30 PM, which means it must have been 5:30AM for him [13:40:46] hah yup [13:41:02] he is crazy :) [13:41:07] big data does not sleep [13:41:17] apparently not [13:41:32] but we should test the packages today and try to get them deployed [13:41:36] i had a funny scary dream about this [13:41:51] do tell! [13:42:04] in the dream i was awoken by our pet horse who was eating napkins [13:42:16] i panicked and said to Stephanie - omg omg, we forgot to feed the horse it looks terrible [13:42:32] so i went and prepared the food and then couldn't find the horse [13:43:11] which is exactly how I feel about trying to change Limn. I keep realizing what's wrong and as I'm fixing it I can't find it any more. [13:43:31] :) [13:43:52] * drdee is rolling on the floor..... [13:43:54] i couldn't be happier that dschoon is back today [13:45:26] ottomata, what do you feel like doing today? [13:46:21] working on ganglia atm [13:46:30] gotta understand how we have it set up and how it works [13:47:21] this might be useful: http://www.ryangreenhall.com/2010/10/monitoring-hadoop-clusters-using-ganglia/ [13:47:26] but maybe slightly outdated [13:49:27] reading [14:23:24] ottomata, useful reference https://ccp.cloudera.com/display/CDH4DOC/Configuring+Ports+for+CDH4 [14:50:32] ottomata, can you access build1 instance on labs? [14:53:36] nm [14:53:40] it's back [15:24:43] ottomata, i am going tweak some more hadoop settings, cool? [15:25:18] yeah that's cool [15:28:27] drdee, i'm going to be making hadoop changes soon too [15:28:29] for ganglia [15:28:32] might be restarting t hings too [15:28:40] np [15:28:41] i'll check with you and job list to make sure you aren't running anythign [15:28:49] not running anything atm [15:31:00] can i run puppet on kraken? [15:31:30] yup [15:31:36] running.... [15:32:55] gonna run a quick benchmark [15:33:03] someday soon I'm going to bring the cdh4 and kraken-puppet stuff over to operations/puppet [15:33:06] ok [15:57:10] ottomata, can i rerun puppet again? [15:59:20] ottomata ^^ [16:03:59] is there a way to put out an APB for someone on IRC? [16:04:00] :) [16:04:06] yup [16:36:12] ottomata, milimetric: https://gist.github.com/559a5e9f6ffb7bd4cab9 [16:44:38] 2.5TB, is that more than expected? [16:45:03] i vaguely remember David's Kraken writeups talking about 1.5TB [16:45:16] but I thought that was compressed [16:47:07] so this is uncompressed data [16:47:47] and we can drop stuff from the logs like hostnames, sequence numbers, the type of request, so there is a lot of room for improvement [16:47:57] so i think there are two takeaways: [16:48:13] 1) it's more data then we expected (based on first 24 hour sample) [16:48:28] 2) there are very strong temporal dynamics going on [16:48:37] stronger than i expected [16:50:06] hm, seems pretty intuitive. It's saying that for roughly every 12 people generating traffic at 2pm there's one generating traffic at 6am [16:50:47] I'd expect less people to be wikiing at 6am :) [16:50:57] but this is worldwide traffic right [16:51:06] oh rly? [16:51:06] so you would expect asia to compensate [16:51:15] but it doesn't [16:51:20] oh that's a little wild [16:51:35] so the fact that the numbers make sense for the US is in fact illogical [16:52:17] wait, what timezone is that [16:52:25] UTC 0 [16:52:33] GMT [16:54:27] oh so traffic ramps up around 5-6EST and stops around 4-5EST. Are you sure these are all servers serving everyone? [16:54:37] Keep in mind spikes tend to follow EU timezones more than US. [16:55:28] paris wise that means the traffic starts around 11-12pm and stops around 10-11pm [16:55:56] i assume that this is all traffic, if it's not the case then we might have a bigger issue :) [16:56:16] this is the squid log data that we use to calculate total pageviews [16:56:21] so it should be all traffic [16:56:29] yup [16:57:10] moooornin [16:57:26] morning! [16:57:54] hiiiyaaaa! [16:58:24] dschoon, we were just looking at how much raw unfiltered uncompressed traffic data we can expect based on a 24 hour sample: [16:58:24] https://gist.github.com/559a5e9f6ffb7bd4cab9 [16:58:28] morning! [16:58:52] cool [16:58:54] i'll check it out [16:59:55] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [16:59:56] nice [16:59:57] that's awesome [17:00:12] we will run the script for at least 7 days [17:00:14] so for, Beijing this means ramp up at 6-7pm and ramp-down around 5-6am [17:00:25] hm, makes sense if they're at work all day maybe :) [17:01:28] beijing? [17:12:37] brb 20-30, heading into the office. [17:17:24] Asanas just launched subtasks [17:22:20] hey Eloquence, welcome to the analytics hangout! [17:23:16] heyhey [17:28:30] louisdang, your gist only contained 10 lines, can you paste a slightly larger snippet with the output of your pig job? [17:29:36] drdee: one sec [17:30:20] k [17:37:13] drdee: right, so the code is still a work in progress. [17:37:35] from what I remember I still need to group the counts by day? [17:37:42] are there any other requirements? [17:42:22] here's the gist: https://gist.github.com/3869339 [17:45:31] drdee? [17:46:47]