[00:00:50] drdee asana q when you get a chance [00:01:04] shoot (will reply a bit later) [00:01:26] basically I am going to push the grant making and programs team into using some sort of task management system for requests [00:01:38] ideally we could just start formally using the analytics instance [00:01:45] but I wanted to check if that seems like an okay idea [00:09:49] totally [00:09:50] np [00:11:39] awesome [00:11:46] thanks [08:09:50] map-reduce :) https://gerrit.wikimedia.org/r/#/c/41979/9/pageviews_reports/lib/PageViews/ParallelModel.pm [08:09:56] a poor man's map-reduce [08:10:12] but it does work, it seems that 6-7 children are somewhere around the optimum [08:10:31] stat1 is now processing a lot of stuff [08:18:38] hmm some problems [08:18:41] but I'll fix them [08:20:36] average_drifter: Are you working for the Wikimedia Foundation? [08:21:42] Susan: yes [08:21:50] Susan: my name is Stefan Petrea [08:22:07] Hi. Nice to meet you. [08:22:24] my pleasure Susan :) [08:22:24] I looked at https://meta.wikimedia.org/wiki/Wikimedia_Foundation_contractors and https://wikimediafoundation.org/wiki/Staff?showall=1 [08:22:33] But maybe those pages haven't been updated yet. [08:22:40] You're working with the analytics team? [08:22:44] yes [08:22:47] Cool. [08:23:04] I'm a bit pessimistic about Wikimedia's analytics abilities lately. [08:23:09] I was going to write a timeline one day. [08:23:15] It has an interesting history. [08:23:33] But maybe I'll wait for the pleasant chapters... [13:13:01] morning everyone! [13:13:21] milimetric: morning Dan :) [13:13:38] hey Stefan, how's it going [13:14:22] so Susan there's plenty of reason to be pessimistic, but I think this is the beginning of the reasons to be optimistic: http://dev-reportcard.wmflabs.org/graphs/pageviews_mobile_hourly [13:14:34] milimetric: she's sleeping [13:14:50] I think that's just MZMcbride :) [13:15:00] yes, Susan is MZMcbride [13:15:15] But some people read chat logs like me :) [13:15:41] man, analytics is hard [13:15:52] I think we're on the verge of something amazing [13:16:05] that graph, using your dClass thing, and real-time updated data [13:16:18] if I can slim down Limn and stuff [13:16:18] :) is limn real-time ? [13:16:27] it can very very easily be [13:16:33] nice :) [13:16:43] the question would only be how often to pull the data [13:17:03] 'cause obviously that's a big perf. thing [13:17:16] i suppose it could consume fragments of data and save them back to the datafile [13:17:19] that would be pretty slick [13:18:03] hm, nah [13:18:22] 'cause then you'd have to synchronize across all open instances of graphs pointing to that datafile [13:18:34] the devil's in the details [13:18:59] I'm off for a few hours [13:19:10] bbl [13:21:34] have fun :) [14:13:33] drdee, pushed some small cleanups to funnel [14:13:47] ty and good morning! [14:13:51] morning [14:14:04] look at http://meta.wikimedia.org/wiki/Research:Metrics#Funnel_metrics [14:14:14] we need to talk a bit more, i spoke to dario [14:14:21] but i think we are on the right track [14:14:45] cool [14:15:32] so do you think we should let them build funnels as defined here? [14:15:36] that'd be fine by me [14:16:00] the question is, would this be relevant for other people doing funnel work, do they all agree on this? [14:18:28] dario's work is focused on having this become the standard way to think about funnels @ wmf [14:19:03] so if we support that as well then we would accomplish that [14:20:52] heh, this is an awesome link that Erik Zachte sent [14:20:52] http://stats.wikimedia.org/wikimedia/animations/requests/AnimationEditsOneDayWp.html [14:21:14] cool re: dario's funnels [14:21:21] wanna hangout? [14:21:23] we should call them darnels [14:21:25] sure [14:21:27] um [14:22:16] yes [14:22:27] had to kick out the cat and close the door 'cause stephanie's sleeping - her day off [14:22:27] :) [14:22:46] hangout link? [14:22:59] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [14:54:13] drdee: ping [14:54:16] drdee: back [14:54:21] piong [14:54:28] :D [15:21:50] ottomata, still looking at the data loss issue? [15:22:16] naw, playing with flume at the moment, its pretty cool for this use case [15:22:35] i'm playing with the timestamp extractor to do bucketing, it works pretty well, just trying to understand how it creates these file names [15:23:29] drdee, check this out [15:23:29] https://gist.github.com/4511451 [15:23:46] looking [15:23:56] this is flume-ng conf file, i'm using netcat source right now [15:24:06] so, this [15:24:07] analytics1009.sources.udp2log.interceptors.request-timestamp.regex = ^.+\\s\\d+\\s(\\d\\d\\d\\d-\\d\\d-\\d\\dT\\d\\d:\\d\\d:\\d\\d) [15:24:13] examines the content for hte timestamp in the log [15:24:24] and converts it to a timestamp header that flume events can have [15:24:29] and then this [15:24:40] analytics1009.sinks.hdfs-sink.hdfs.path = /user/otto/tmp/flume/%Y-%m-%d_%H.%M.%S [15:24:40] analytics1009.sinks.hdfs-sink.hdfs.roundValue = 10 [15:24:40] analytics1009.sinks.hdfs-sink.hdfs.roundUnit = second [15:24:48] says to create buckets every 10 seconds in hdfs [15:25:00] buckets will be created based on timestamp header in events [15:25:04] that can get set by that regex [15:25:07] sweeeeeeeet!! [15:25:10] i *think* [15:25:13] you figured that very fast!, awesome [15:25:21] that I can make this listen directly to the udp2log multicast stream [15:25:34] and then turn off kafka? [15:25:34] not sure thogh, I might need to actually extend flume to do that, it has a udp syslog source [15:25:35] and a netcat source [15:25:44] ok [15:25:44] but netcat won't do udp by default [15:25:47] we'll see [15:25:51] but also, check this bit out too: [15:26:13] http://flume.apache.org/FlumeUserGuide.html#regex-filtering-interceptor [15:26:28] let's you filter sources based on content as well [15:26:50] so we could pretty easily set up the same udp2log filters, using just flume configs [15:28:01] one question about the logs, is the x-carrier thing now working? [15:28:24] no i haven't looked at it, i've never seen x-carriers in our log files [15:30:45] we should fix that :) [15:33:26] shouldn't ummmmmm someone who wrote it fix it? [15:34:15] me ? [15:34:43] * average_drifter doesn't remember/know who did the x-carriers [15:37:09] naw, i think preily [15:41:13] can you ping him or asher about this? [16:15:11] sure, putting it on todo list [16:52:24] ottomata, two questions about how do the eventlogging [16:52:42] 1) it seems that the 'product' thing is not yet working, everything is stored in event-unknown [16:52:53] yes, that is correct [16:53:04] products have to be predetermined [16:53:11] and set up as special filters [16:53:15] ok [16:53:15] right now there are no products [16:53:39] but ideally we should store different event logging streams into different folders [16:54:07] we could use the 'schema' as a product identifier [16:54:37] 2) we are storing event logging data as a key, value pairs string, is it possible to store that as json? [17:02:24] event.gif?product_code=whateeeverrrrrr [17:02:26] key value pairs? [17:02:31] its https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Format [17:02:39] https://www.mediawiki.org/wiki/Analytics/Kraken/Data_Formats#Event_Data_Schema [17:02:46] the data is whatever json data you want [17:12:04] uhmmmm, not really :) [17:12:16] if you look at (for example) http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw/event/event-unknown/2012-12-07/part-1353342609923_2187-m-00000?file_filter=any [17:12:29] then you see that we store it as one single key,value string [17:12:48] so we don't have any information about the data types of the values [17:13:31] ori seems to be writing that as a json object , see for example stat1:/a/eventlogging/archive/client-side-events-json.log-20130111.gz [17:14:02] that makes parsing much easier because you would know the datatype of each key,value pair [17:14:08] brb relocating to another place [17:29:16] its whatever you log [17:29:20] you can log json if you want to [17:29:32] he is converting an entirely different stream [17:29:47] we consume arbitrary data, if you want to log json data, then log it [17:51:19] back [18:03:54] ottomata, coming? [18:04:15] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [19:56:37] hey drdee, I'm gonna estimate what's there now and finish grooming for "must do" items [19:56:52] sounds good! [19:56:59] once I have a full sprint I'll let you know and you can change [19:57:55] i gotta grab lunch, bbl [19:58:49] aight [20:58:29] :D [20:58:48] what? [20:58:49] it is totally busted and disorgnaized right now, but I just got udp2log multicast traffic into a custom flume source [20:58:53] !!!! :D [20:59:01] YOU ARE THE mAN! [20:59:07] proof of concept works! [20:59:38] that's awesome dude! [21:00:28] man, i'm so glad, I'm going to clean this code up real nice then and try it out, hopefully will have some regular flume imports up early next week. [21:00:38] i'll try to replicate the current kafka import structure we have now [21:01:46] this is really really good news, mad props! [21:02:26] yeah well, we'll see how it goes in practice with all that data [21:09:04] that's so awesome ottomata, good job [21:09:53] ja, thanks [21:10:09] gonna clean it up, and make it a generic udp source, with multicast as a config option [21:10:32] very cool, let us know the repo so we can help if you need [21:10:42] ok cool, yeah actually [21:10:45] that's something i'm not sure about, [21:11:01] i coded this in the flume source itself [21:11:06] the maven setup there made it real easy [21:13:27] ah cool, so you can just fork flume and send them a pull request? [21:15:37] yeah, or submit a jira [21:15:37] but yeah [21:19:01] guys, read this: http://blog.wikimedia.org/2013/01/11/mobile-beta-a-sandbox-for-new-experimental-features/ [21:19:12] this is one of the things that we should be tracking [21:19:59] i will organize a meeting about this [21:20:14] (with devs from mobile) [22:12:04] later fellas, have an awesome weekend [22:14:20] laterz