[00:14:52] drdee: I finished writing a script for the star wars kid logs [00:15:01] awesome! [00:15:10] can you paste the script and the results in a gist? [00:15:38] drdee: https://gist.github.com/3869339 for the code [00:16:21] it's based on the code written by ottomata in the kraken repo [00:17:33] https://gist.github.com/3869351 for the results on the first 100 lines of the star wars kid logs [00:19:23] very cool, i am gonna take a look at it later tonight [00:19:28] first i need my dinner :) [00:19:31] drdee: report card down? [00:19:39] or rather, empty [00:19:47] are you guys working on it? [00:20:03] nothing urgent, have dinner first ;) [00:20:37] no not working on i [00:20:48] charts are empty? [00:20:59] erosen, I just had to pull data from outside the enwiki slave and I feel your pain about the fact that staging is not replicated (pinged you on trello) [00:21:17] JS error maybe? [00:21:21] indeed [00:21:25] drdee: forget it, for some reason they didn't show up in the browser [00:21:36] works fine now [00:22:21] erosen: I'll ask asher this Friday if he can think of a solution to help with that [00:22:29] great [00:22:37] let me know if I can be of use in that process [00:24:29] I'll talk to him this Fri at 11.30 if you want to join [00:25:31] not necessary for you to attend, but you can explain the non-enwiki pain better than I can [00:27:29] die Nicht-Englisch-wiki Schmerzen [00:27:37] as the germans call it [00:28:07] drdee: thanks for the email! [00:29:23] ori-l: you are welcome ;) [00:42:53] DarTar, didn't we have policy for publication at the Research Committee [00:43:04] or at least guidelines [00:43:30] there's an OA policy proposal, never enforced [00:43:59] maybe breath some new life into that? [00:44:02] but that's mostly for external researchers, not hard for us to self-inflict a policy [00:44:37] Daniel is still very active on that front, I am sure he could help draft some language for WMF staff/contractors [02:55:17] hey [02:59:55] ok got it [03:00:03] drdee: ok, so it depends on the changelog [03:00:42] I read a lot of stuff, tried to change Makefile.am with PACKAGE_VERSION or VERSION or PACKAGE_STRING or AC_* with all of the aforementioned variables [03:00:45] didn't work [03:00:54] so, the changelog dictates the latest version of the package [03:02:08] so although we generate our own versions in teh debianize.sh script [03:02:16] and we use them later on ... [03:02:29] the version of the package is still dictated by the debian/changelog [03:15:24] ottomata: [03:15:26] ottomata: ! [03:15:28] ottomata: !! [03:16:01] ottomata: oh hi, do you want to deploy a package together ? [03:16:17] hiiiiiiii [03:16:17] haha [03:16:18] not really [03:16:20] haha [03:16:23] deploy [03:16:26] whatchyouuuu talking bout? [03:18:26] webstatscollector [03:19:01] i think i do not know much about deployment [03:19:07] its a package upgrade? [03:19:17] needs installed on some machine and then restarted or somethign? [03:20:40] average_drifter: do you mean upload it to the debian repo? [03:20:51] ori-l: well not yet , just a test run [03:20:59] ottomata: just a test run, so for example, I can give you a package [03:21:02] do you have access to labs? [03:21:02] ottomata: and we can deploy together [03:21:08] ottomata: and we can see if it goes bad/wrong [03:21:47] heh [03:21:50] average_drifter: in wmf vernacular "deploy" typically means push code to the production mediawiki instances [03:22:01] sounds fun! buuuut, maybe not something I want to do right before my bedtime [03:22:22] ottomata: you live in brooklyn, your bed time is not for another 3-4 hours :P [03:22:38] j/k [03:23:10] geeeet outta heerreeee [03:23:15] i am a midnight bedtime kidna guy [03:23:18] on weeknights anyway :p [03:23:59] I switched my schedule to night-time [03:24:03] ah, well have a good night then [03:24:14] daytime sucks for me, I can't get my stuff together during the day [03:24:54] I mean I'm normal, it's just that I have neighbours and sometimes they go "OH I KNOW, I'll drill some holes and make a lot of noise so everyone knows that they have busy neighbours" [03:24:58] anyway.. [03:25:21] haha [03:26:51] where are you stefan? [03:26:57] Bucharest/Romania [03:27:00] (wait, you are stefan, right?) [03:27:02] aye ja [03:27:04] yes I am [03:27:08] hope to get out of here soon [03:27:35] yeah? [03:27:49] I'm waiting for some data from someone, I might be in .us at some point for some time [03:28:07] * average_drifter package is building.. [03:34:45] ottomata: where do I put the package ? [03:35:49] ottomata: please scp otto@build1.pmtpa.wmflabs:/home/diederik/webstatscollector/webstatscollector_0.2_amd64.deb . [03:36:03] ottomata: can we get on a machine and try this deb out ? [03:36:15] well, you can dpkg -i it [03:36:20] in labs no problem [03:36:30] right? [03:36:51] I can, but it can mess up the package system if it's not good [03:37:02] ottomata: is messing up build1 a concern ? [03:37:05] i like to check dpkg-deb --contents [03:37:13] that will show you ever file it is going to install and to where [03:37:20] ummm, it'd be annoying but not the end of the world [03:37:28] you could spawn up or use another labs instance to try it [03:37:55] ottomata: how do I spawn up another labs ? [03:38:05] ottomata: is it complicated ? [03:38:11] naw [03:38:56] can you access this? [03:38:57] https://labsconsole.wikimedia.org/w/index.php?title=Special:NovaInstance&action=create&project=analytics®ion=pmtpa [03:39:58] Sysadmin role required [03:39:59] You must be a member of the sysadmin role to perform this action. [03:40:21] aye [03:40:24] how about [03:40:47] yes ? [03:41:27] ah let's see i'll make a new instance [03:42:33] ok, i'm spawning up build2 [03:42:38] it'll take a few minutes before it comes online [03:42:50] no problem [03:42:58] ok I'll use build2 when it's ready [03:43:10] but, you should have the same access [03:43:14] also, it is ubuntu precise [03:43:15] not lucid [03:43:38] builddiederik@build1:~/webstatscollector$ cat /etc/lsb-release [03:43:38] DISTRIB_ID=Ubuntu [03:43:38] DISTRIB_RELEASE=10.04 [03:43:38] DISTRIB_CODENAME=lucid [03:43:38] DISTRIB_DESCRIPTION="Ubuntu 10.04.3 LTS" [03:44:02] says lucid there. maybe we can do something about that ? [03:44:10] should I change it to be something else ? [03:44:27] naw, you'll probably have to build the package for both OSes anyway [03:44:30] but it will mostly be the same [03:44:38] you just build 2 different debs, one on each OS [03:44:44] but [03:44:52] i'm sure for testing [03:44:53] it won't matter [03:45:00] you can probably jsut install it on build2 no problem [03:45:06] ok [03:45:30] i think apt likes to have things built for different distros specifically [03:45:44] and will give you trouble if you try to install one on the other [03:45:48] but i don't think dpkg cares [03:46:20] yeah, dpkg is more permissive definitely [04:02:27] ottomata: can you please ping me about build2 ? [04:05:51] shoudl be up by now.. [04:07:03] hmm, how do you log into labs? [04:07:05] i haven't done it in a while [04:07:08] what bastion do you use? [04:07:10] user@garage:~/test-git-dch$ ping build2.pmtpa.wmflabs [04:07:10] ping: unknown host build2.pmtpa.wmflabs [04:07:18] ottomata: I ssh into build1 [04:07:36] ottomata: bastion1 [04:07:51] Host *.pmtpa.wmflabs ProxyCommand ssh -a -W %h:%p bastion1.pmtpa.wmflabs [04:07:54] Host *.eqiad.wmflabs ProxyCommand ssh -a -W %h:%p bastion1.eqiad.wmflabs [04:07:58] got that in my .ssh/config [04:08:07] ah bastion1 [04:10:10] ummm, but hm [04:10:18] bastion1.pmtpa.wmflabs is not a public addy [04:10:24] doesn't resolve from my lcoal [04:11:05] Host bastion1.pmtpa.wmflabs Hostname bastion.wmflabs.org ProxyCommand none [04:11:09] Host bastion1.eqiad.wmflabs Hostname bastion2.wmflabs.org ProxyCommand none [04:11:12] ottomata: also got that in my .ssh/config [04:11:18] ah [04:11:20] http://bastion.wmflabs.org/ [04:11:21] http://bastion.wmflabs.org/  [04:11:22] bastion1 is actually bastion.wmflabs.org [04:11:23] yeah [04:11:27] wasn't responding for me [04:12:25] try build2 [04:12:26] i can't get into it [04:12:30] but it responds to pings [04:12:36] build2.pmpta.wmflabs [04:13:49] can you get in? [04:15:16] trying [04:15:39] ottomata: yay ! I'm in ! thanks ! [04:16:13] ottomata: I have the same directories as in build1 [04:16:17] ottomata: did you clone it ? [04:17:36] no, labs uses nfs [04:17:38] for home [04:18:05] so the /home is mounted on both build1 and build2 [04:18:18] but the /usr or /otherstuff are separate [04:18:20] right ? [04:20:28] right [04:20:38] ok time for bed [04:20:52] have fuuuuuun! [04:20:54] laters! [04:21:05] ttyl , thanks ! [04:30:29] ottomata, average_drifter: you guys are crazy [04:30:56] hey drdee [04:31:12] hey ori-l [04:31:44] have you guys made a decision re: data serialization formats? thrift, protocol buffers, avro, whatever? [04:31:51] i guess if you go with kafka it's got its own thing [04:34:02] no we haven't, but looking at what we have tested so far, it seems that thrift comes along quite often [04:35:13] any reason why? [04:36:06] this is purely an observation, that i have seen thrift support quite often, but then again avro seems to be the default for apache projects [04:37:00] not sure if we should make that decision right now, so in the end i think it will be between thrift and avro but it might be a bit too early to make that call [04:38:52] i gotta go to bed, it's too late :( [04:44:00] oops, sorry [05:40:58] hey Eloquence [05:40:59] hey StevenW [05:41:24] ori-l: I worked extensively with protobuf [05:41:48] ori-l: hardcore protobuf, I was actually wanting to write a Perl XS(C++bindnings for Perl) module for Protobufs [05:42:27] ori-l: I'd love to discuss them with you if you're interested on using them on any project :) [05:44:44] average_drifter: definitely [05:44:48] any snags? [05:44:59] hey man [05:45:02] snags= ? [05:45:14] have you run into problems, annoyances, etc? [05:45:24] well I can tell you waht I've used from it [05:45:26] things that seemed great at first and turned out to suck? ways in which protobufs were limiting? etc [05:45:56] so basically I worked on a custom cache with business specific logic... it was written in C/C++ [05:46:08] this is about my previous dayjob(before working on wikipedia) [05:46:20] and they had a clumsy data transmission set up(like real clumsy) [05:46:22] anyway [05:46:39] so I researched msgpack, thrift, protobuf [05:46:56] actually I didn't have time to research thrift because they kept cutting my research time [05:47:13] I was planning to research avro too [05:47:36] but they cut my research time drastically(part of teh reasons I didn't like it there so I left) [05:47:48] anyway, so I ended up with just protobuf and msgpack [05:47:51] protobuf is really really cool [05:48:01] you get to basically write a .proto file which describes how your data looks [05:48:50] then there's a protoc compiler which turns what you wrote into Java/C++/Python classes(there are parsers/transformers for other languages as well, actually I was planning to write a fully complete code generation thing like that for Perl) [05:49:13] so that protoc will generate C++/Python/Java classes that describe your messages you wrote in the .proto file [05:49:37] which is really cool because it cuts down development time, and you don't have to do all that low-level serializing/deserializing stuff yourself [05:49:46] and it has some options for compressing stuff [05:49:58] you can choose from a variety of different data types [05:50:06] like string, int32, int64, you have enums [05:50:31] bool, uint32 [05:50:42] yeah, i'm aware of the fundamental selling points [05:50:43] you can even say that some things in your message are required or optional [05:50:46] stuff like that [05:50:53] and it generates setters and getters for you [05:51:14] and also ! if you have some fields marked as "required" , it will tell you that you didn't set it [05:51:27] but i have the following concerns: 1) i like text. text is readable. text is debuggable. text is greppable. [05:51:29] or if you have it marked as "optional" it will tell you that you didn't set it [05:51:50] if you're using it (you can call stuff like has_ and it'll tell you if the message has that) [05:52:00] ori-l: yes text is all that [05:52:10] ori-l: but protobuf is really meant if you want like hardcore performance [05:52:43] ori-l: it packs up the data very neatly, it takes care of all the htons ntohs network->host and host->network byte ordering [05:53:16] the required / optional thing is great [05:53:19] ori-l: you can always inherit from the classes Protobuf generates and add a "print" method if you wanna debug your data [05:53:27] ori-l: yeah that's awesome ! [05:53:37] ori-l: and also you get a "repeated" thing [05:53:42] i think we wanna use it for data validation [05:53:51] yeah, i read the docs today and played around with it a little [05:54:18] ori-l: and what this does is, you can define a field that repeats, so basically an array, and you have no limitation on the size of that, you can just push a lot of stuff into it [05:54:31] ori-l: on which project ? [05:54:39] event logging [05:55:05] ori-l: I've done 2-3 months of development with Protobuf and I would be really interested and ready to bang if you have a project that involved it [05:55:13] ori-l: what language do you plan to use ? [05:56:35] ori-l: also, the neat thing about it is that you can define messages like a custom message called say... CustomMessage, and of course you can include that in other messages like [05:56:43] well, our data analysts uniformly love python, so for data analysis things, probably python. as for server / routing / etc, dunno. probably build a prototype in python and re-implement in C / go / java iff necessary [05:56:48] OtherMessage { [05:56:56] repeated CustomMessage msg; [05:56:57] }; [05:57:10] yeah, that's awesome [05:57:52] that sounds interesting [05:58:01] I know some Python [05:58:19] ori-l: do we have a datasource to test stuff on ? [05:58:47] not yet, but by friday [05:58:50] ori-l: what I'm actually asking is if kraken provides us with some hose that delivers some data [05:59:01] not to my knowledge [05:59:08] ori-l: that's cool [06:00:23] event log data starts its life as a simple http request log [06:00:24] ori-l: what I didn't like about msgpack is that although my findings were that it was slower than Google protobufs [06:00:33] ori-l: on msgpack's website they say "oh, it's the best" [06:00:57] ori-l: I guess it's kind-of a status quo that every such piece of software has it's author's saying "it's the best" [06:01:40] yeah [06:01:45] so I wouldn't rely much on benchmarks on the official website if you wanted to choose between thrift vs. msgpack vs. agro vs. google protobufs [06:02:40] well, so, the data starts out as a simple http request log and at least initially no further processing is done, it'll be sent in plaintext/utf-8 via udp [06:03:32] so throughput will likely be bottlenecked there regardless of what things we use further down the pipeline [06:04:18] so compression isn't a good motivation for protocol buffers in this particular instance. but we can allow producers to define a data model and then hold them to it [06:04:43] so use protobufs as a DSL for declaring data models and as a kind of "contract" [06:05:01] yeah [06:05:14] I've used it on both client and server [06:05:33] so basically you write data in a format on one side and you use the same classes generated by protobuf on the other side [06:05:40] and it'll throw you exceptions when stuff gets bad [06:05:45] jay kreps talks about this in his kafka talk at airbnb [06:05:48] considering we're using UDP this will be a concern [06:05:51] but it isn't part of kafka [06:06:09] it's another piece of the stack at linkedin that they haven't open-sourced. [06:06:14] jay = jeremyb ? [06:06:15] but it's a good idea [06:06:30] ori-l: which piece haven't they OSS-ed yet ? [06:06:46] no no, the lead author of kafka, he gave a talk at the hq of airbnb, an SF company -- i sent an email about it last month to the analytics mailing list [06:07:26] ori-l: does he mention protobuf in his talk ? [06:08:02] i don't remember if he mentioned protobufs specifically, but he mentioned having some centralize data store with definitions of all the different event models [06:08:57] .proto files are just text, you can store them anywhere [06:09:39] yeah [06:10:06] anyways i ended up not being too impressed by kafka but liking some of the other stuff he described, like this approach [06:14:00] ori-l: so I guess on friday there's gonna be a release of kraken somewhere right ? [06:14:54] dunno [06:15:30] 08:56 < average_drifter> ori-l: do we have a datasource to test stuff on ? [06:15:31] this is a separate system designed for a much smaller scale, i think if it works well enough parts of it will be incorporated into kraken [06:15:34] 08:56 < ori-l> not yet, but by friday [06:15:35] ^^ [06:17:31] no, the data source is just a subset of the request log on bits.wikimedia.org [06:18:20] which is going to get broadcast via udp to some machine purposed for data collection [13:31:15] waammp waaammmp goodmorninnnnng [13:37:51] bam bam bam bam another big data day, hiiyayaaaaaaa!!!! [13:38:56] morning ottomata, milimetric, average_drifter [13:39:16] good morning drdee :) [13:39:50] haha, morning [13:39:56] man I was going on a week long coffee hiatus [13:39:59] just broke it [13:40:03] :) [13:40:07] only lasted 4 days [13:40:17] i coulda lasted longer but we were just served coffee soooo [13:40:28] counts as a work week 'cause of columbus day? [13:40:37] hmmmm, yeah! [13:40:38] hahaa, average_drifter asks ottomata to deploy a new package at 11:30 PM, which means it must have been 5:30AM for him [13:40:46] hah yup [13:41:02] he is crazy :) [13:41:07] big data does not sleep [13:41:17] apparently not [13:41:32] but we should test the packages today and try to get them deployed [13:41:36] i had a funny scary dream about this [13:41:51] do tell! [13:42:04] in the dream i was awoken by our pet horse who was eating napkins [13:42:16] i panicked and said to Stephanie - omg omg, we forgot to feed the horse it looks terrible [13:42:32] so i went and prepared the food and then couldn't find the horse [13:43:11] which is exactly how I feel about trying to change Limn. I keep realizing what's wrong and as I'm fixing it I can't find it any more. [13:43:31] :) [13:43:52] * drdee is rolling on the floor..... [13:43:54] i couldn't be happier that dschoon is back today [13:45:26] ottomata, what do you feel like doing today? [13:46:21] working on ganglia atm [13:46:30] gotta understand how we have it set up and how it works [13:47:21] this might be useful: http://www.ryangreenhall.com/2010/10/monitoring-hadoop-clusters-using-ganglia/ [13:47:26] but maybe slightly outdated [13:49:27] reading [14:23:24] ottomata, useful reference https://ccp.cloudera.com/display/CDH4DOC/Configuring+Ports+for+CDH4 [14:50:32] ottomata, can you access build1 instance on labs? [14:53:36] nm [14:53:40] it's back [15:24:43] ottomata, i am going tweak some more hadoop settings, cool? [15:25:18] yeah that's cool [15:28:27] drdee, i'm going to be making hadoop changes soon too [15:28:29] for ganglia [15:28:32] might be restarting t hings too [15:28:40] np [15:28:41] i'll check with you and job list to make sure you aren't running anythign [15:28:49] not running anything atm [15:31:00] can i run puppet on kraken? [15:31:30] yup [15:31:36] running.... [15:32:55] gonna run a quick benchmark [15:33:03] someday soon I'm going to bring the cdh4 and kraken-puppet stuff over to operations/puppet [15:33:06] ok [15:57:10] ottomata, can i rerun puppet again? [15:59:20] ottomata ^^ [16:03:59] is there a way to put out an APB for someone on IRC? [16:04:00] :) [16:04:06] yup [16:36:12] ottomata, milimetric: https://gist.github.com/559a5e9f6ffb7bd4cab9 [16:44:38] 2.5TB, is that more than expected? [16:45:03] i vaguely remember David's Kraken writeups talking about 1.5TB [16:45:16] but I thought that was compressed [16:47:07] so this is uncompressed data [16:47:47] and we can drop stuff from the logs like hostnames, sequence numbers, the type of request, so there is a lot of room for improvement [16:47:57] so i think there are two takeaways: [16:48:13] 1) it's more data then we expected (based on first 24 hour sample) [16:48:28] 2) there are very strong temporal dynamics going on [16:48:37] stronger than i expected [16:50:06] hm, seems pretty intuitive. It's saying that for roughly every 12 people generating traffic at 2pm there's one generating traffic at 6am [16:50:47] I'd expect less people to be wikiing at 6am :) [16:50:57] but this is worldwide traffic right [16:51:06] oh rly? [16:51:06] so you would expect asia to compensate [16:51:15] but it doesn't [16:51:20] oh that's a little wild [16:51:35] so the fact that the numbers make sense for the US is in fact illogical [16:52:17] wait, what timezone is that [16:52:25] UTC 0 [16:52:33] GMT [16:54:27] oh so traffic ramps up around 5-6EST and stops around 4-5EST. Are you sure these are all servers serving everyone? [16:54:37] Keep in mind spikes tend to follow EU timezones more than US. [16:55:28] paris wise that means the traffic starts around 11-12pm and stops around 10-11pm [16:55:56] i assume that this is all traffic, if it's not the case then we might have a bigger issue :) [16:56:16] this is the squid log data that we use to calculate total pageviews [16:56:21] so it should be all traffic [16:56:29] yup [16:57:10] moooornin [16:57:26] morning! [16:57:54] hiiiyaaaa! [16:58:24] dschoon, we were just looking at how much raw unfiltered uncompressed traffic data we can expect based on a 24 hour sample: [16:58:24] https://gist.github.com/559a5e9f6ffb7bd4cab9 [16:58:28] morning! [16:58:52] cool [16:58:54] i'll check it out [16:59:55] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [16:59:56] nice [16:59:57] that's awesome [17:00:12] we will run the script for at least 7 days [17:00:14] so for, Beijing this means ramp up at 6-7pm and ramp-down around 5-6am [17:00:25] hm, makes sense if they're at work all day maybe :) [17:01:28] beijing? [17:12:37] brb 20-30, heading into the office. [17:17:24] Asanas just launched subtasks [17:22:20] hey Eloquence, welcome to the analytics hangout! [17:23:16] heyhey [17:28:30] louisdang, your gist only contained 10 lines, can you paste a slightly larger snippet with the output of your pig job? [17:29:36] drdee: one sec [17:30:20] k [17:37:13] drdee: right, so the code is still a work in progress. [17:37:35] from what I remember I still need to group the counts by day? [17:37:42] are there any other requirements? [17:42:22] here's the gist: https://gist.github.com/3869339 [17:45:31] drdee? [17:46:47] yo [17:48:20] * drdee is reading [17:50:24] louisdang, and can you group by year/month as well? [17:50:28] ok [18:01:36] drdee: is there a preferred format for the output? [18:03:10] ping ping [18:05:21] ottomata: morning [18:05:26] drdee: morning [18:05:34] morning milimetric [18:05:38] morning! [18:05:53] :) that would be kind of nice waking up around now [18:05:57] :) good morning [18:06:16] wait what? It's like 9pm for you [18:06:20] it is [18:06:32] got blinds set up on the windows, I just shut them and go to sleep when mornin comes [18:06:49] trying to regulate my sleep so I get up at ~15:00 [18:06:53] oh interesting. Don't you miss sunshine? [18:07:03] no, I have flashlights [18:07:17] hm, wonder how good those are for your vitamin D levels [18:07:38] I get plenty of fruits [18:07:47] haha [18:09:11] ottomata: are libanon and libcidr packaged ? [18:11:13] yes [18:11:24] they should be in wikimedia apt, so you should be able to install [18:11:53] currently, one is on github and one in gerrit, I will fix that discrepency as soon as gerrit github sync is set up for all repos [18:11:55] which should be soon [18:12:03] footnote [18:12:05] i will do that soon [18:13:24] do what? [18:13:29] github gerrit sync [18:13:32] oh, hm [18:13:37] chad is setting it up for all projects [18:13:40] in gerrit [18:13:44] eh? [18:13:44] not a two way sync though [18:13:50] yeeah [18:13:50] but [18:13:55] we want it in the other direction! [18:13:59] github -> gerrit [18:14:02] NOT gerrit -> github [18:16:19] hey dschoon, you're back! [18:16:33] what do you sync repos with ? is this chad an OSS project for that ? [18:16:39] when can I steal you away? [18:16:43] perhaps a line in cron can fix that ? [18:16:56] one sec [18:16:57] meeting [18:17:14] np, ping me when you're free [18:18:18] mwerr, i don't care so much for those to repos [18:18:24] haha [18:18:28] no chad is a dude [18:18:51] guys, i'm going to work on merging my secondary puppet stuff into origin/production today [18:19:08] i think i'm running into issues now, and I'd rather eliminate this one point of potential confusion [18:25:51] heya milimetric -- i added an invite so i could reserve a room [18:25:55] i'll be back in a sec [18:26:05] cool, thx [18:27:56] milimetric: i'm in the hangout. join whenever. [18:48:10] just fyi, ottomata, i think i'm gonna dedicate most of today to working with dan [18:57:54] das cool [19:14:27] ottomata, pig question [19:14:50] i am adjusting status_count.pig in my home folder [19:15:00] yessuh? [19:15:07] and i want to do an extra grouping: whether the site is mobile or not [19:15:14] how would you go about this? [19:15:25] do you know how to id that? [19:15:27] from the url? [19:15:35] i have a double grouping example [19:15:37] ... [19:15:52] yes, the url contains .m. [19:16:10] https://github.com/wmf-analytics/kraken/blob/master/src/pig/monthly_subdomain_counts.pig [19:16:27] mainly this bit: [19:16:27] MONTH_SUBDOMAIN_GROUP = GROUP MONTH_SUBDOMAIN BY (month, subdomain) PARALLEL 3; [19:16:40] ideally i would like to do something like if url contains .m. then output 'mobile' else 'desktop' [19:16:48] i don't want it for all the different subdomains [19:16:50] aye [19:17:01] i think you can do that, add a custom field to a bag [19:17:40] you could probably do it as part of one of your generate statements [19:18:54] somethign like [19:20:44] SITE = FOREACH LOG_FIELDS GENERATE FLATTEN (RegexExtract(uri, '\.m\.') as site:chararray; [19:20:44] SITE = FOREACH SITE GENERATE $0, ($0 == '.m.' ? 'mobile' : 'desktop'); [19:20:45] maybe? [19:20:47] then [19:20:56] GROUP SITE BY ($0, $1) [19:20:56] ? [19:20:59] sometjhing like that? [19:21:04] just guessing here [19:22:24] thx, i will give it a spin [19:26:32] drdee: you might also consider looking into the bitfield tools for these sorts of things [19:26:48] because in the distant future, it will be unpleasant to have an ever-expanding giant list of strings [19:26:51] for tags [19:26:53] brb lunch! [19:39:17] drdee: interesting analytics problem; if I wanted to know how long it took between a user loading a page; and then loading another page on the same request (ie: Article to Banner load) is there anything ready made for that? [19:39:49] Jeff_Green and I were tossing around a method of sampling on a wiki and then correlating to our banner logs on IP/User-Agent/Referrer [19:40:03] we would have to setup a new filter with very specify url's and 1:1 sampling to do this [19:40:12] once we have the data we can infer the loading time [19:40:47] but it's tricky, because browser rendering performance != server reply time [19:42:34] I think that's Ok though, because the banner load is one of the very last things that happens on a page request so it'll give us a minimum bound at least for how long a banner takes to load [19:44:49] it'll also tell you *whether* a banner loaded at all [19:45:10] now that I think of it, that's what I was looking for when I tried this before [19:45:51] so it was a good thing then that you got the bot s/n ratio [19:46:15] mostly i got confused and overwhelmed :-( [19:50:49] ah; in any case drdee; is this something that can be set up? or if it would be too much of a hassle; do we have any data (that would include the ip/user-agent/request url) on any wiki from the last month or so that I can run against fundraising's collection of banner logs? [19:51:27] it would involve setting up a new instance of udp--filter [19:51:38] that's not very hard an Jeff_Green knows how to do that :) [19:51:43] ottomata and I can be of assistance [19:52:15] so one of doing this I guess [19:52:41] would be to setup a filter that filters for en.wikipedia.org/wiki/Main_Page and for the URL that contains the banner [19:53:04] set that to 1:1 and you have sufficient data in a heartbeat [19:54:42] ok; so... the banner URL prefix (because we add the country on for some legacy reason) is http://meta.wikimedia.org/w/index.php?title=Special%3ABannerLoader&banner=B12_JimmyBlank&campaign=C12_bitest&userlang=en&db=enwiki&sitename=Wikipedia&country= [19:55:38] yeah so it would be something like: [19:56:26] udp-filter -d en.wikipedia.org -p /wiki/Main_Page,title=Special%3ABannerLoader&banner=B12_JimmyBlank&campaign=C12_bitest&userlang=en&db=enwiki&sitename=Wikipedia&country= [19:56:46] i assume that the banner url is unique and that the same banner is not being hosted on different wikis as well [19:57:08] this actually might not work :( [19:57:12] we can make that happen [19:57:24] I can target enwiki specifically [19:57:31] because you are trying to filter from two different domains [19:57:38] crud [19:57:39] can you run the banner from en.wikipedia.org? [19:57:45] sadly no [19:57:47] else we have to make a small change to udp-filter [19:58:07] ohhh wait [19:58:20] udp-filter -d en.wikipedia.org,meta.wikimedia.org -p /wiki/Main_Page,title=Special%3ABannerLoader&banner=B12_JimmyBlank&campaign=C12_bitest&userlang=en&db=enwiki&sitename=Wikipedia&country= [19:58:23] this might work [19:58:41] but please be careful with packet loss [19:58:59] ottomata, do you think that udp-filter would work [19:59:01] ^^ [20:00:02] the only downside is that this filter will also log meta.wikimedia.org/wiki/Main_Page [20:00:12] so you would have to discard those observations [20:00:32] that I can do; or, does it help if we run two different udp-filter instances? [20:00:44] I can deal with multiple log files [20:00:47] that's also an idea [20:01:10] however that might really trigger packet loss because you need to run both at 1:1 to make the joining possible [20:01:35] ottomata: sudo apt-get install curl on stat1, if/when you have a moment? [20:01:52] also my example assumes that your banner is shown on the main page :) [20:02:56] it is indeed shown on the main page [20:03:17] ori-l: done [20:03:19] I'm thinking actually that we get enough traffic on enwiki though; that if we downsample we will still get usable results [20:03:24] um, you know drdee [20:03:45] if you don't mind turning off our hourly bytecount thing on an01 [20:03:49] we can easily run a 1:1 filter there now [20:03:56] since log2udp is relaying everythign over [20:04:00] and not worry about breaking a thing [20:04:03] or run it next to it? [20:04:06] good point [20:04:14] can we run? [20:04:15] oh yeah why not? [20:04:22] next to it [20:04:23] let's do that! [20:04:24] hmm [20:04:26] drdee: did the libdb deps on the webstatscollector package work for you ? [20:04:37] average_drifter: i think they did [20:04:39] IIRC [20:04:41] drdee: I just tried to install the package and got errors because of the deps although I have it instaled [20:05:24] ahh, drdee, naw, its not a multicast doody, i can't run next to it [20:05:49] ottomata, me slightly confused [20:06:00] udp2log instance is running [20:06:07] why can' we deploy another filter on that box? [20:06:21] it isn't udp2log [20:06:25] it could be though…that would be cool [20:06:26] that would do it [20:06:38] log2udp is relaying to a udp port on an11 [20:06:44] and I am currently just netcating [20:07:03] i think my script as is wouldn't work though [20:07:11] k, wasn't aware of that [20:07:17] we can temporarily disable the script [20:07:35] k, uhhhhh, what filter do you want me to run? [20:07:42] average_drifter:mmmmmmmmmm sounds like more debugging :D [20:08:35] this? [20:08:35] udp-filter -d en.wikipedia.org,http://meta.wikimedia.org/ -p /wiki/Main_Page,title=Special%3ABannerLoader&banner=B12_JimmyBlank&campaign=C12_bitest&userlang=en&db=enwiki&sitename=Wikipedia&country= [20:09:27] i could do this: [20:09:27] udp-filter -d en.wikipedia.org,meta.wikimedia.org -p /wiki/Main_Page,title=Special%3ABannerLoader&banner=B12_JimmyBlank&campaign=C12_bitest&userlang=en&db=enwiki&sitename=Wikipedia&country= | grep -v meta.wikimedia.org/wiki/Main_Page [20:09:39] filter out the meta.wm.org/wiki/Main_Page right there [20:09:42] should I do that now? [20:09:45] drdee^ [20:10:25] yes [20:10:26] sounds good [20:10:38] probably only needs to run for a couple of minutes [20:10:40] k [20:10:42] ... if that [20:10:44] udp-filter needs support for full url filtering [20:18:54] ottomata: thanks! [20:21:55] reading [20:22:59] drdee: can you please give me more details in relation to the debugging ? [20:23:22] I just read what you guys wrote above but I don't know all the context [20:26:54] debugging what? [20:27:55] average_drifter: can you help me and ottomata fix a bug in udp-filter? [20:28:03] apparenty the -d option does not work accurately [20:28:34] nopy, i mean, i can awk + grep it or something [20:29:47] drdee: yes [20:30:06] the -d option is not working correctly [20:30:25] maybe you can have a look at that as well? [20:31:20] ummm, drdee, -p doesn't work either [20:31:32] uhhhhh?????? [20:31:35] drdee: please tell me how to replicate the bug [20:31:43] just run [20:31:58] udp-filter -d en.wikipedia.org and it should capture all url's with the domain en.wikipedia.org [20:32:00] $ cat /tmp/main.log | grep '/wiki/Main_Page' | wc -l [20:32:00] 5775 [20:32:04] $ cat /tmp/main.log | /home/diederik/udp-filter -p '/wiki/Main_Page' | wc -l [20:32:04] 0 [20:32:25] ottomata, not sure if you need apostrophes [20:32:39] same without [20:32:47] $ cat /tmp/main.log | /home/diederik/udp-filter -p /wiki/Main_Page | wc -l [20:32:47] 0 [20:32:52] k [20:33:18] hm [20:33:37] drdee: there were some rules we put as default [20:33:50] drdee: I think one of those is misbehaving [20:34:00] I'm checking it out [20:34:01] k [20:34:02] * mwalker amused [20:34:16] btw, it is also not working in the already packaged version of udp-filter [20:34:23] so unless you've recently deployed a new one to our apt [20:34:26] it is not something you broke! [20:34:28] ottomata: can I have that /tmp/main.log please ? [20:34:30] right? [20:34:31] sure [20:34:36] umm, where? [20:34:39] can I put it on stat1? [20:34:40] ottomata: build1 please [20:34:43] errggh [20:34:47] i can't get into labs stuff right now [20:34:53] but the weird thing is that the unit tests are passing [20:34:57] ottomata: stat1 [20:35:00] ok [20:35:28] what ip address does the hit come from ottomata? [20:35:42] average_drifter: stat1:/tmp/main.log [20:35:43] drdee: which of the unit tests is testing this ? I should look in to see [20:35:49] ottomata: thanks, I'm scp-ing it [20:35:52] th ehit? [20:35:56] the hit? [20:35:57] yes [20:35:58] yes [20:36:22] lots of them? [20:36:25] what hit? [20:36:38] ottomata: what do you mean by hit ? [20:37:06] i dunno, what' drdee mean by hit? [20:37:13] it's working fine on build1 [20:37:17] bwa [20:37:41] doesn't work on stat1 [20:38:01] it doesn't work on oxygen [20:38:03] drdee: skype ? [20:38:12] stat1 is precise, oxygen is lucid [20:38:13] drdee , ottomata skype ? [20:38:14] output format seems to have change [20:38:15] dd [20:39:52] it seems that there are two sequence numbers are present in the log file [20:39:57] field 1 and field 3 [20:40:05] obviously that would break udp-filer [20:40:15] field 1 should be hostname [20:40:32] ottomata ^^ [20:41:15] i am not sure what field 1 is..... [20:41:40] drdee: is that output produced by udp-filter ? [20:41:46] oh format of logs! [20:41:49] yes [20:41:50] drdee: you mean field1 in output of udp-filter ? [20:42:00] i mean field 1 of main.log [20:42:14] hm, interesting [20:42:23] i'm getting those out of netcat [20:42:30] i betcha udp2log removes them [20:42:44] it could be millisecond since epoch [20:42:46] or something like that [20:42:47] hmmm, noooooo, because i'm relaying [20:42:58] naw [20:42:58] 1349988174 [20:42:59] that' snow [20:43:20] no, the are seq numbers [20:43:26] maybe log2udp relay adds them? [20:43:44] sequence numbers are in field 3 [20:44:07] or at least, they used to be right after the hostname [20:44:16] yup [20:44:16] prefixLength = sprintf(outBuffer, "%llu ", counter); [20:44:25] user@garage:~/wikistats/udp-filters$ head -1 main.log | perl -MDateTime -ne '/^(\d+)/ && print DateTime->from_epoch(epoch => $1)' [20:44:32] looks like a timestamp to me [20:44:34] 2008-11-09T16:13:37u [20:44:39] :D [20:44:42] i think that is a coincidence [20:44:42] yep it is [20:44:57] if I look at the raw output from the relay [20:45:03] the numbers increment for each line [20:45:12] log2udp relay is prepending seq nums [20:45:14] so good to know! [20:45:17] it is not udp-filter fault [20:45:21] phew [20:45:24] sorry for the false alarm, i'll strip them out [20:45:24] pffeeww :) [20:45:27] and pipe them through [20:45:29] cool [20:45:45] and this is great way to test v 0.3.14 of udp-filtesr:) [20:45:56] :) [20:46:17] I'm here, you guys know all software has bugs. If any are my fault I commit to solving them [20:46:42] don't worry [20:46:55] let's fix webstatscollector :) [20:47:12] drdee: ok, working on deploying on a vm currently, in testing phase, will be ready soon [20:47:22] just use build1 [20:47:28] Actually, average_drifter, our software has no bugs. [20:47:41] we on the wmf analytics team are strongly committed to the theory of "user error" [20:48:22] fyi, ottomata, i talked with chad, and he's only planning to do gerrit sync for mediawiki/* [20:49:20] oh, hmm, i think maybe doing all of that by default [20:49:31] i think once he has that all set up he'll do custom requests [20:52:44] theerrreeee we go [20:52:48] how long do you want me to run this? [20:52:53] 20M so far [20:52:55] in a few seconds [20:53:17] where can i tail the data? [20:53:31] /home/otto/en.wikipedia_and_banner.log [20:53:50] on an11? [20:53:52] ja [20:53:57] 100M now [20:55:17] aaah! kill it! [20:55:23] 200M [20:55:24] hehe ok [20:55:35] but i don't see hits for the Main_Page [20:55:43] about 3 minutes of data [20:55:45] waaa [20:55:45] no? [20:55:53] oh; shit; ya, let me target just the US with a banner [20:56:19] you're getting worldwide; and almost every wiki has that banner enabled :p [20:56:22] what there are tons [20:56:46] okay there are en.wikipedia hits [20:56:53] but they are crowded out by the banners [20:56:59] there are both [20:57:10] grep http://en.wikipedia.org/wiki/Main_Page en.wikipedia_and_banner.log | wc -l [20:57:10] 41435 [20:57:24] right but 41k is not a lot [20:57:26] wait we could do this different [20:57:38] drdee, i gotta run kinda, i'm sorta here though [20:57:42] to get the firehose [20:57:47] just run this [20:57:48] on an11 [20:57:48] netcat -lu 10.64.36.111 8420 [20:57:51] use the -f option and set that to en.wikipedia.org/wiki/Main_Page [20:57:55] you can pipe that into whatever you want [20:58:05] k [21:01:15] mwalker, what is the url of the banner? [21:01:58] drdee: creating it now; hold 5 please [21:02:12] k [21:13:30] drdee: ok, the banner url will start with http://meta.wikimedia.org/w/index.php?title=Special%3ABannerLoader&banner=B12_JimmyBlank_MW [21:14:08] we have to wait for some cache to clear before we'll start seeing hits on it though [21:16:22] ok; should be seeing hits from that banner now [21:24:53] average_drifter: familiar with awk? [21:26:54] drdee: no, but Perl > awk [21:26:56] :D [21:27:03] i'll use cut [21:27:07] :) [21:28:23] drdee: you can easily replace awk/cut functionality with split in Perl [21:29:39] cut did the job [21:29:52] oh i forgot how to tell you to strip the seqs [21:29:52] i did [21:30:05] awk '{$1="";print}' | sed 's/^\s//' [21:30:11] cut -d ' ' -f2-15 [21:30:13] kinda hakcy, ahh [21:30:20] i wanted to delete, instead of select [21:30:23] that's cool [21:31:01] ottomata, it kills my second filter after launch.. [21:31:58] did you escape stuff? [21:37:12] wmalker: i am not seeing any traffic on the banners yet..... [21:37:18] mwalker ^^ [21:39:04] curious; it's definitely live [21:40:47] drdee ...and it's going to every page on enwiki served through any datacentre; so... there should be lotsa stuff coming at you [21:41:06] stuff is coming, just not that particular banner [21:42:38] so nothing at all with the prefix http://meta.wikimedia.org/w/index.php?title=Special%3ABannerLoader&banner=B12_JimmyBlank_MW [21:42:39] ? [21:42:58] a full US URL will always look like http://meta.wikimedia.org/w/index.php?title=Special%3ABannerLoader&banner=B12_JimmyBlank_MW&campaign=C12_1011_MW_test_FR&userlang=en&db=enwiki&sitename=Wikipedia&country=US [21:43:57] i am hitting all kinds of special pages but not your banner [21:45:07] ottomata: can you please tell me what to add to my /etc/apt/sources.list so I can have the wmf deb repo where libanon and libcidr reside on ? [21:47:17] ottomata: oh I can just make them debs from here https://github.com/wmf-analytics/libcidr.git [21:48:43] yep except for libanon which isnt there [21:48:55] is libanon packaged somewhere ? because we have it as a dep [21:49:03] check gerrit? [21:49:05] i think it's there. [21:49:16] https://gerrit.wikimedia.org/r/#/admin/projects/analytics/libanon [21:49:20] average_drifter: ^^ [21:49:47] yes but average_drifter is looking for the deb [21:50:08] I'll git clone this https://gerrit.wikimedia.org/r/p/analytics/libanon.git [21:50:13] I saw some debian files in there [21:55:47] eh? [21:55:51] well they are on apt.wikimedia.org [21:55:56] they should be available on labs [21:56:02] aptitude search libanon [21:56:02] no? [21:56:08] aptitude search libcidr [21:57:19] average_drifter: [21:57:24] http://apt.wikimedia.org/wikimedia/pool/main/libc/libcidr/ [21:57:24] http://apt.wikimedia.org/wikimedia/pool/main/liba/libanon/ [21:57:25] ottomata: yeah, adding it now [21:58:39] ottomata, what was your full command line [21:58:41] ? [21:58:50] i am barely seeing any banner requests [21:59:19] eh? [21:59:20] umm, this [21:59:26] netcat -lu 10.64.36.111 8420 | awk '{$1="";print}' | sed 's/^\s//' | udp-filter -d en.wikipedia.org,meta.wikimedia.org -p "/wiki/Main_Page,title=Special%3ABannerLoader&banner=B12_JimmyBlank&campaign=C12_bitest&userlang=en&db=enwiki&sitename=Wikipedia&country=" | grep -v meta.wikimedia.org/wiki/Main_Page > en.wikipedia_and_banner.log [21:59:37] no wait [21:59:39] that's wrong [21:59:50] wait again, yes that's right [21:59:55] thought I had a quote weird [21:59:57] that's the one [22:01:14] alrgiht, i'm outty [22:01:16] laters boys@ [22:02:51] lates [22:04:31] average_drifter: can you make filtering internal traffic a command line option in udp-filter? [22:05:10] drdee: yes [22:14:41] drdee: enabled/disabled by default ? [22:14:50] disabled by default [22:15:39] ok [22:22:56] drdee: that renders the test to fail, I will update them [22:24:44] or better yet, I can add -t to all tests so they can pass [22:24:52] -t is the internal traffic rules param [22:24:57] the new one [22:30:04] can someone add me as a collaborator to wikimedia/limn? I want to add a label to an issue [22:31:31] dschoon, drdee, milimetric ^^ [22:31:44] sure [22:31:47] one sec [22:32:47] thanks [22:32:50] it's not urgetn [22:32:51] what's your username? [22:32:54] embr [22:33:22] i verified you by limnpy :) [22:33:25] check if it's working [22:36:40] thanks [22:36:42] works [22:56:51] later dschoon, milimetric, erosen [22:57:00] adieu [23:05:22] later dee [23:16:20] drdee, alolita is looking into feature analytics for the i18n stuff to get some basic usage data .. I've suggested she talk to you & ori-l about it [23:16:53] Eloquence: sure, happy to. [23:17:07] not much we can do just yet, tho. [23:47:36] drdee: got counts by day, month and year