[11:52:08] hey milimetric [13:38:04] mooooorrnniinininnininining [13:38:18] average_drifter, milimetric [13:38:25] morning [13:38:32] average_drifter, can we finally merge your wikistats patch ;) [13:38:39] heh [13:38:45] * milimetric backs away quietly [13:38:54] no no milimetric, this is also on your hands :D [13:39:08] I know I'm just playing around [13:39:20] did you fire up repo by any chance [13:39:21] morning drdee working on it, we will have working patch, I found a solution to do it but I'm still merging [13:39:27] I can help but I'm lagging a bit because of yesterday's ruby nightmare [13:39:41] not yet [13:39:41] or were you stuck in ruby ladn? [13:39:44] yep [13:39:51] so what happened ? [13:40:04] upgrade that went south? [13:41:06] average_drifter, flv doesn't play [13:41:22] oh it does [13:42:45] yep, my upgrade to Ubuntu 12.10 messed with my vim and ruby interconnectedness [13:42:45] drdee: :) [13:42:59] and I went down the wrong path of using rvm which average_drifter helped explain is a stinking pile of doo doo [13:46:05] so what average_drifter has now done twice, and i *REALLY* enjoy it, is that he makes webcasts of stuff that he has worked on, demo his code and explain stuff [13:46:09] really really NICE [13:46:19] milimetric: http://garage-coding.com/wikipedia-diederik-anonymize.flv [13:46:43] hey, that's awesome average_drifter [13:46:51] i'll take a look in a sec I'm obsessed with something right now [13:47:45] k [13:49:54] drdee: thanks [13:49:59] if anyone wants to make a screencast https://gist.github.com/3946150 [13:50:06] there are many other ways [14:24:15] morning ottomata1 [14:30:10] morning! [14:31:40] ottomata, can you please look at https://gerrit.wikimedia.org/r/#/c/29779/ [14:33:07] looking [14:35:49] hm, ok in udp-filter.c, I don't think I wrote that part (did I?) [14:35:57] i'm not really sure what is going on there [14:36:11] average_drifter, you around? [14:37:00] looks like he just removed the pointerness of field_count_this_line, which is good because I don't think there is a need to access it by reference [14:37:16] ottomata: I'm here [14:38:21] ottomata: so basically I just added some error handling [14:38:42] hmm, ok yeah, i see the conditional [14:38:45] ottomata: there are some leftover "else{}" , I used those for debugging because I wanted to see what was going on [14:38:56] if( len_last_old_field-2 >= 0) { [14:40:05] yeah uhhhh, i think the code is cool, i don't really understand what is going on at this point though (the comment isn't very helpful) [14:40:09] but you didn't write this so I dunno [14:40:14] looks good to me! [14:40:18] i have a comment about the other file, will put it in the change [14:40:28] ottomata: have a look at fix_geoip also please [14:40:49] ottomata: so basically the if you wrote about above is involved in the following [14:40:55] link? [14:41:00] ottomata: yes, moment pls [14:41:24] ottomata, i asked average_drifter to put the geocode in a separate field at the end of the line, not as part of the IP address [14:41:38] oh! interesting [14:41:41] because that makes it much harder to import those sample files into hadoop [14:41:44] and work with them [14:41:52] aye makes sense [14:41:52] cool [14:41:56] average_driftere can also fix the wikistats code [14:42:07] that was my reasoning [14:42:07] ottomata: https://gerrit.wikimedia.org/r/#/c/29469/ [14:42:26] ok, i think um [14:42:28] if that is the case [14:42:35] we should do some rearranging [14:42:43] replace_ip_addr should only deal with anonimizing IP, right? [14:42:52] actually, this function could probably go away altogether then [14:42:55] two functions: [14:43:22] oh, you can just call anonymize_ip_address from the caller of replace_ip_addr [14:43:32] and then have another function [14:44:05] hmm, you already have the area [14:44:07] maybe a simple function [14:44:14] append_field() [14:44:14] that you can just pass area too [14:44:30] i like that [14:44:42] ottomata: if I make a function append_field I also need to pass it a pointer to field_count_.. so I can increase it [14:45:00] ottomata: right ? [14:45:03] yeah probably [14:45:06] makes sense [14:45:12] is that what you are doing with field_count_this_line [14:45:12] ? [14:45:41] too bad each line isn't a struct that you can just pass around...:) [14:46:00] the struct could have the fields and the field count [14:46:11] then you could just pass that between funcs [14:46:15] ottomata: I am increasing it in udp-filter.c line 992 "i++" then a couple lines below there's field_count_this_line = i; [14:46:55] ah right [14:46:56] see it [14:47:11] yeah, i'd genericize this, and make the funciton names actually make sense [14:47:43] there's no reason for replace_ip_addr to append the geocoded area [14:47:55] other than a historical one, so we should clean that up :) [14:48:06] ottomata: because the name replace_ip_addr does not say anything about appending right ? [14:48:11] right [14:48:17] and [14:48:20] these are really separate concepts anyway [14:48:24] 1. [14:48:26] anon the IP [14:48:32] 2. geocode the IP [14:48:35] (and sorta 3, append a field) [14:48:53] agree [14:49:28] i mean though, udp-filter is full of stuff like this, so you k now, whatever :p [14:49:32] if you don't want to refactor it right now [14:49:37] because you tested this an dit works [14:49:45] then just add a really nice comment about why and what you plan to do [14:49:50] to make it better later [14:50:59] ottomata: the refactor will occur soon, I have to take care of some bugs in asana, but I'll get to it [14:52:49] mmk cool [14:53:02] you should def add a big ol' comment about it somewhere near the replace_ip_addr func [14:53:15] comments about intentions are really important [14:53:48] drdee, is your MR job still running? [14:53:58] nope [14:54:06] cool, gonna mess with ganglia and will need to restart some services [14:54:30] nice [14:54:55] also, snappy byte count is runnin on an26 [14:55:19] bu t ummm there are negative byte counts occasionally? [14:55:24] duhhhhhh [14:55:38] whaaa [14:57:35] is it overflowing? [14:58:30] doesn't look like it, these are the hourly counts, not the totals, and the totals are bigger [15:01:00] odd [15:13:52] milimetric, do you use vim? [15:14:00] as your main editor? [15:14:00] yep [15:14:09] vim is my sun and stars :) [15:14:14] average_drifter was just asking me for tab/spaces vim advice [15:14:34] k, i'll read up [15:14:44] (that was in a PM) [15:15:53] k, I pm-ed him [15:26:53] dschoon, milimetric: another question on limn coloring [15:27:00] shoot [15:27:01] what is the default behavior when there are too many colors [15:27:07] for a single colormap [15:27:13] I can look in the code, I realize [15:27:22] but if you know off the top of your head [15:27:40] on new limn it would just start over and repeat colors [15:27:43] the color brewer clomps have a max of 11 [15:27:51] cool [15:27:52] that is what I did [15:27:59] I'm not sure on old limn. But I'll be changing that eventually to expand the pallette [15:28:00] I just wasn't sure if there was something smarter to do [15:28:07] cool [15:28:12] i'll just leave it as is then [15:28:48] the thing is, once the pallette goes over some 10-20 number people can't really keep track any more so at that point it's more important to have contextual cues [15:32:22] cool [15:32:40] yeah I just noticed that there is no way to see the name of a curve by clicking on it [15:32:51] i sort of assume you'll be doing something ilke that, but just in case... [16:09:16] restarting hadoop [16:46:05] on old limn, erosen milimetric, i think it just makes them all black [16:46:09] heh [16:46:15] hehe [16:46:18] cool [16:59:22] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:00:21] ottomata drdee erosen milimetric? [17:00:29] coming [17:00:53] finishing with robla [17:01:06] one sec [17:20:03] git config core.autocrlf and core.safecrlf can be set for maximum pleasure [17:20:19] yea, i have those set in ~/.gitconfig [17:20:23] that's what I was thinking of. It's not at ALL intuitive to me how they work but basically as long as everyone has them set the same, all is good [17:20:33] which is probably why i don't really follow what the big deal is [17:21:02] well, if someone has them different from you, merges could be bad [17:24:17] owait. it appears i don't have them set any more. [17:24:17] huh [17:24:55] anyway [17:25:10] this hacker needs a bagel [17:25:24] i'll be back in a muni minute [17:25:24] :) [17:25:29] (by which i mean approximately 45 minutes.) [17:25:53] i got caught up in n+1 [17:26:02] this is interesting, if not exactly good. http://nplusonemag.com/the-theory-generation [17:26:07] i love this mag [17:26:10] bbl [17:29:41] hey drdee [17:29:44] yo [17:29:55] we didn't specify where the analyst scrum is happening, did we? [17:30:11] the internetz [17:30:17] for some at least [17:30:18] there is hangout invite attached [17:30:28] oh right [17:30:38] https://plus.google.com/hangouts/_/33bb08418c094ead3577050db709b808c2007e86 [17:30:39] f^&% I don't do Google hangouts on wmf.org [17:31:05] stupid Google asks me to create a G+ profile in order to be able to use it [17:31:33] hmm maybe not [17:33:13] changing locations, back in a bit [17:46:50] from erosen: https://github.com/embr/limnpy [18:09:36] back [18:15:19] back [18:16:21] front [18:16:40] right [18:16:54] wrong [18:17:23] beside [18:18:25] contra [18:21:41] band [18:21:56] aid [18:22:18] wrong [18:22:27] the answer was dance [18:22:48] haha [18:22:58] i didn't feel that had a binary [18:51:06] milimetric: http://michaelporath.com/projects/manifest-destiny/ [18:51:17] i like how the legend is implied off the side [19:01:16] is gerrit down ? [19:01:28] I get 503 [19:01:52] ok up again [19:18:17] dschoon that's cool. I like the story telling aspect of it. I wonder if we can create a simple "slideshow" maker or something along those lines. [19:18:31] yeah, that'd be cool. [19:18:31] have you seen timeline.js? [19:18:45] absolutely gorgeous [19:18:45] http://timeline.verite.co/ [19:22:23] i've actually used timeline js [19:22:25] it really is cool [19:22:41] it's really quite beautiful. [19:25:01] dschoon, wanna hammer out another avro schema? [19:25:08] in a sec. [19:25:18] i've already started on it, but i wanted some syntax hiliting [19:25:27] so i'm writing a quick language def for textmate [19:27:46] haha [19:41:01] https://github.com/dsc/avro-tmbundle [19:41:01] okay. [19:41:05] now that THAT's done [19:54:56] drdee: i added an EventData schema http://etherpad.wikimedia.org/avro [19:55:38] I'm not sure how I feel about this, though. [19:56:11] It's a fine first-pass, but if we're going to see heavy use of the event logging infrastructure -- the way GA uses it, for example, then there are major changes we'd want to make. [19:56:38] brb [20:09:59] back [20:10:39] so, drdee, here's my thought about event logging [20:11:01] as more and more events are sent per page/session, more and more redundant data is generated. [20:11:17] basically everything except the timestamp and event data is the same [20:11:45] there are several problems though [20:12:18] collecting all those related events together doesn't make sense if they're not temporally local, which is why i don't think it matters much to consolidate on any level less granular than per-pageload [20:12:50] the process of collecting them together requires logic, and because they're temporally local, it really makes the most sense to do it as they come in [20:12:58] hence the original design of the bundler [20:13:06] check [20:13:16] but that fell out of favor as we found that most of the important features could be handled by kafka [20:13:16] and the best code is no code [20:13:46] the only thing kafka doesn't do is consolidate messages that have common fields, because kafka doesn't give a shit about the structure, it only works at the byte level [20:14:10] this turns out to be exactly the feature we're talking about mattering for high-volume events [20:14:28] big data, big problems :D [20:14:48] so i suspect we're going to want to think about replacing the front-line servers that do event logging as our first release *after* we go to production with kraken [20:14:59] because i don't think we'll encounter a bottleneck until after adoption picks up [20:15:43] but (for example) the pageload timing data collection that ops was talking about is exactly the sort of thing that would benefit from this [20:15:55] each page load would generate multiple packets [20:16:07] on page time could be measured by a decaying heartbeat [20:16:13] per-page scroll behavior could be measured [20:16:35] average events per page would skyrocket -- 10-20 or more events per load [20:16:35] IMHO Ii really don't think that is in our top 5 of priorities [20:16:50] of course, we're only talking about pages. so it'd still only be a small multiplier on traffic [20:17:13] i disagree, it's intimately related to A/B testing and conversion funnel analysis. [20:17:38] you need those pageload IDs and visit IDs to do those analyses [20:17:53] and i think those two cases are high priority, are they not? [20:18:15] no, they are not, we are focused on editors not on readers [20:18:20] they are, quite literally, what the ClickTracking extension does [20:18:28] uh. [20:18:42] neither of them are specific to either group. [20:18:57] they're techniques to measure how users behave in a certain context. [20:19:17] they're the essence of evaluating success of a feature, no matter who a feature is aimed at. [20:19:44] but i'll turn the question on you: what are the top 5 goals of the system? [20:19:54] isn't that data collected client side and send as part of the payload [20:20:11] yes? [20:20:58] i think we are miscommunicating then, i was responding to your remark regarding page load timing data for ops [20:21:13] yes, reading what i type would help :P [20:21:23] it was merely an example of something that would generate a lot of data quickly. [20:21:50] right, and i said that example is not within the top 5 priorities [20:21:50] sure. [20:21:52] agreed on that. [20:21:57] that's all [20:22:00] how about the other four or five examples i gave? [20:23:22] 1 sec [20:23:49] ps https://github.com/wmf-analytics/kraken/tree/master/src/avro [20:24:08] top 5 themes: 1) editor retention metrics, 2) endpoint for experimental data 3) overall community health (report card), 4) WMF program impact analysis 5) readers (unique visitors) [20:24:31] those aren't really goals. [20:24:44] no they are themes [20:24:44] or at least, they don't imply a plan of action [20:25:13] my point in the question was just to make sure we agree that supporting A/B testing and conversion analysis is high-priority [20:25:23] i'd argue that all those things except (3) require it. [20:25:32] er, (3) and (5) [20:26:19] depends on how you define supporting A/B testing and conversion, if you mean we should be able to store that type of data then yes [20:26:34] ... [20:26:41] you tell me how to support it. [20:27:10] if you mean we build those libraries as well then i'll say 'not within our scope' [20:27:24] because i don't see being able to correlate user behavior with decision points without session IDs and such [20:27:26] i totally disagree. [20:27:40] i think those libraries are essential to what we're providing. [20:27:53] we're providing infrastructure for people to instrument and measure things [20:28:08] everyone shouldn't have to write their own library to do the same thing [20:33:56] ottomata: the append_filter, I've added it [20:34:52] ottomata: https://gerrit.wikimedia.org/r/#/c/29779/ [20:34:56] ottomata: please have a look [20:35:35] cool [20:35:42] what's going on in anonimize.c? [20:36:03] spaces ( :p ), and [20:36:03] else { [20:36:03] }; [20:36:03] ? [20:40:04] ottomata: yes, that's true, the spaces I need to fix and the elses [20:40:56] looks like there aren't really any changes in that file, right? [20:42:15] ok, added comments inline [20:42:25] we can discuss them here [20:47:39] brb [20:50:29] mk [21:21:07] Anyone around? [21:23:12] what's up chad21? [21:23:27] Hey drdee how's it going [21:23:53] doing good, how can we help you? [21:24:16] Have been doing some research using wikipedia data dumps to try and help people suffering from diabetes and had a quick question regarding some datasets [21:24:28] Any chance that might be something you could help with? [21:24:47] possibly [21:24:54] shoot [21:26:54] In the research I'm trying to build a case for needs diabetics have.. I've been at it for about 4-5 months now and just stumbled onto this: https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/ [21:28:06] Being able to map queries to wiki pages/categories, and also identify other non-semantically related queries performed by a user would really be helpful [21:28:34] yeah i totally agree [21:28:47] but we had to withdraw the log files, i am sorry [21:28:47] It looks like the dataset was pulled for some privacy issue. I wanted to find out if there was any possible way of getting the data set? Possibly applying for it or signing an NDA? [21:29:03] no, not at the moment [21:29:10] Shoot :| [21:29:24] Any idea if there will be an opportunity sometime in the future? [21:31:13] i certainly hope so, but it's not a high priority [21:33:25] Yeah I can imagine as much, but that's unfortunate... really hard to assemble correlated needs of people suffering without data like this. Any chance you know of any alternatives..? I've had no luck sadly [21:33:56] no i am afraid not [21:36:12] I'd offer to volunteer my time for free to sanitize it of any privacy information, but somehow I don't think that would help you much as it's a low priority :) [21:37:06] drdee, pagecounts-raw data currently being loaded into hdfs [21:37:18] great! [21:37:19] look in /user/diederik/pagecounts-raw/2012/{2012-09,2012-10} [21:37:27] aigt [21:37:32] it'll take a while to finish, but it is going [21:37:52] Anyway thanks a lot for the help Diederik and to the rest of you for making this stuff available to the research community [21:42:40] chad21: sorry not being able to help you more right now, just keep an eye on the blog [21:45:56] Thanks, will do [22:35:25] diff --git a/squids/perl/EzLib.pm b/squids/perl/EzLib.pm [22:35:25] damn [22:35:25] that's not I wanted to write [22:35:25] ok so I have a question about gerrit [22:35:30] let's suppose I have a git review which has not been reviewed yet [22:35:45] hey drdee got your email [22:36:02] and I want to make another one on a different topic but it needs code from the one that needs review [22:36:43] my question is: how do I do this ? do I branch out of my previous review so that I can make this new one or do I have to wait for the existing one to be reviewed ? [22:36:51] currently I'm blocked on this [22:37:12] can you guys please help me with this ? [22:38:10] tl;dr => I want to make a new change on a different topic that depends on an unreviewed yet code change. how can I work on the different topic without being blocked by the review in progress ? [22:44:46] average_drifter: i am not even nearly an expert in gerrit trickery, but i'm pretty sure you're going to end up in trouble if you make dependent changes on different changesets [22:53:45] dschoon: hmm, ok [22:53:52] dschoon: that's what I also thought