[00:07:13] drdee, dschoon: do you know if this can be closed? https://rt.wikimedia.org/Ticket/Display.html?id=3098 [00:07:19] it looks resolved to me [00:07:53] ottomata would be the man who knows [00:07:54] don't think so [00:08:02] there are still issues iirc [00:08:33] drdee: any idea what? should i just ask ottomata tomorrow? [00:08:48] you should not worry [00:09:07] the stuff that doesn't get backed up is some history stuff from wikistats [00:09:16] the rest is all fine on stat1 [00:09:35] there's /a/eventlogging, which is new -- if the config blacklists wikistats stuff, then i should be ok [00:09:44] if it's a whitelist, it might need to be updated [00:09:59] you should be fine, but i will double-check tomorrow with ottomata [00:10:04] kk, thanks [00:10:29] while i'm here, i owe you guys an update re: kafka -- i didn't get around to implementing the producer since i saw you the test one mostly stablized [00:10:47] still intend to do it, but have a few other things to plough through first [00:10:53] saw you got [00:12:29] ….? [00:40:05] ori-l: that sounds awesome [00:40:12] can we sync up in the office tomorrow at some point? [00:40:28] (and i'll relate things back to the team) [00:52:20] brb [01:34:29] mk [01:49:42] drdee: http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r4/pageviews.html [01:49:49] drdee: right now we have ranking logic in there [01:50:17] drdee: also talked to milimetric and he showed me the d3 color ramps and I was able to use them to replace the ramps that are currently in wikistats [01:50:47] drdee: milimetric and I have experimented a bit with the color ramps here https://github.com/wsdookadr/d3-playground [01:51:41] drdee: I kept it simple, if you look at the source the code for the color ramps it's like really short, couple of lines compared to hundreds of lines in wikistats [02:20:14] ok looking now [02:20:44] looks great! [02:51:23] I'll dig through the code to see how wikistats checks mobile/non-mobile [02:51:29] but probably webstatscollector does that for it [02:51:51] if an url is mobile, can it also have a language code ? [02:52:06] I mean I take the language from the url right ? [02:52:17] and I detect if it's mobile from the User Agent ? [02:52:58] I'm wondering how webstatscollector can know if something is mobile or not because it only knows that depending on stuff like http://m.wikipedia... [02:53:08] but then you don't have the language anymore , or am I wrong ? [02:53:38] webstatscollector does that by looking for .m. in the url [02:53:55] if it is mobile it will also have a language code [02:53:57] so for example [02:54:03] en.m.wikipedia.org [02:54:22] so only use the URL to determine whether it's mobile [02:54:28] http://de.m.wikipedia.org/ would be the german wikipedia [02:54:32] yes [02:54:34] ok, only the url then [02:54:37] yup [02:54:46] how about the mobile/non-mobile ? [02:54:56] can I write a quick XS wrapper over dclass for Perl ? [02:56:13] what do you mean "how about the mobile/non-mobile ?" [02:56:14] if I use the mobile/non-mobile from wikistats I'd have to decouple it from everything in there [02:56:33] drdee: I mean this report http://stats.wikimedia.org/EN/TablesPageViewsMonthlyMobile.htm [02:56:43] drdee: it shows mobile page views [02:56:47] yes [02:56:55] oh sorry ! [02:56:57] I got it [02:57:06] * average_drifter was slow [02:57:24] why do you need declass? [02:57:49] drdee: well I was just thinking that we have a classification like mobile/tablets/non-mobile [02:57:57] drdee: but if we detect mobile through the url then that's not needed [02:58:01] ok [02:58:02] drdee: I was overanalyzing I guess [02:58:20] :D [04:48:40] http://search.cpan.org/~rjh/Gzip-RandomAccess-0.91/lib/Gzip/RandomAccess.pm [04:48:51] I did what this guy did but a long time ago and for bz2 not gzip [04:48:54] heh [06:18:43] average_drifter: i thought you can't effectively seek in a bz2 file [06:19:15] existing implementations construct a table mapping compressed blocks to decompressed offsets [06:19:17] iirc [06:28:15] ori-l: yes, that's what I was doing too [06:28:34] milimetric: hey [06:29:05] milimetric: https://github.com/wsdookadr/d3-playground [06:29:13] milimetric: I was looking at some barcharts [06:29:19] milimetric: I took their example from the docs [06:29:33] milimetric: made a wrapper function called drawChart which does exactly that [06:29:41] milimetric: takes data array and a container id [06:29:57] milimetric: the problem is now that the smallest item is way too small in comparison to the other ones [06:30:42] probably me doing something wrong with their domain or rangeRound [06:32:06] ori-l: "This is achieved by streaming the gzip file in advance, building an index mapping compressed byte offsets to uncompressed offsets, and at each point storing the 32KB of data gzip needs to prime its decompression engine from that point" [06:32:17] ori-l: quote from Gzip::RandomAccess [06:32:24] ori-l: that's what the guy above is also doing [06:33:51] right [13:45:36] morning everyone [13:45:44] hey average_drifter, I tweaked your solution just a bit [13:46:05] I think you'd just need a constant for your output range, that way you don't have to mess with the domain [13:47:58] average_drifter: https://gist.github.com/4452715 [14:37:41] yoyoyo ottomata!!!! [14:38:00] did you see my patch for standardizing the filenames for kafka generated files? [14:42:38] morning milimetric, average_drifter [14:43:01] morning :) [14:43:24] average_drifter started doing d3 work, did you see? [14:44:14] not sure [14:44:32] mooorning! [14:44:33] rading it now [14:44:36] reading it [14:45:08] that would mean frequency would have to match the cron job, right? [14:45:15] can frequency be greater than 60? [14:45:52] oh i see, it just picks an even minute number [14:47:45] no it does not pick an even minute number and the frequency cannot be greater than 60 [14:47:58] event requests are consumed daily [14:48:23] ok, didn't know [14:48:29] have to add a fix for that [14:49:29] why aren't event requests stored hourly? [14:51:20] they could be, the data is just small [14:51:27] i was trying to keep it so file sizes were larger than block size [14:53:36] that's a good point [14:54:03] do you like the general approach of the patch? [14:57:48] how is the timestamp converted to a string? [14:57:51] output_path = "%s/%s/%s" % (output_dir, topic, timestamp) [14:57:57] its a dict, right? [14:59:47] no, it's not a dict [14:59:52] it's date time object [15:00:38] oh sorry, yeah doh [15:00:59] the function doc comment were throwing me off [15:01:17] i was trying to be clear in explaining how it worked but i was struggling [15:01:23] so it's not clear at all [15:01:33] there is dict generated for 'collapsing minutes' [15:01:42] but i am not even sure if collapsing is a clear term [15:02:04] aye, i'd keep function documentation to explaining how to use the functino and what it is for, and then put comments in the function about how it works [15:02:19] if you were reading generated python docs, you wouldn't really care about the internals of how a function worked [15:02:44] what does predictable minute and seconds component mean? [15:02:51] like [15:02:51] if the current time was 10:06 [15:02:52] and frequency was 15 [15:02:55] timestamp would be 10:00 [15:02:56] ? [15:03:10] 10:47 -> 10:45? [15:05:30] 10:47 should be 11:00 [15:05:45] 10:06 -> 10:15 [15:11:25] hm, why that way insead of the other? [15:11:57] there almost certainly won't be any logs in that import from between 10:06 and 10:15 [15:12:13] because it is very confusing to have a filename that says 2013-01-04_10.00.00 and then you have have data past 10 oclock [15:12:20] if your frequency was hourly [15:12:24] and your time was 10:01 [15:12:37] then you wouldn't really have any logs from the 10:00 hour [15:12:45] ok, well that's an easy fix [15:13:00] remove the +1 from collapsed_minutes = (dt.minute / frequency) + 1 [15:13:05] but, i don't think you should be relying on filenames for hard data boundaries anyway [15:13:13] even with the archived .gz files [15:13:15] they aren't reliable [15:13:27] not relying, indicating :) [15:13:29] aye ok [15:13:43] and i think with that hourly example, that is what would usually happen [15:13:45] since these are run by cron in a loop [15:13:49] ok [15:14:02] then the daily logs need a special case as well [15:15:16] q, what's this for? if filename timestamps aren't reliable anyway, isn't just using the cron's run date slightly more accurate? [15:15:50] hey, brb, i'm going to restart my computer [15:17:47] hookay back [15:19:50] drdee: got a new version [15:21:19] drdee: http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r5/pageviews.html [15:25:07] ottomata, this is for predictable filenames so we can use the coordinator in oozie [15:25:18] average_drifter:looking now [15:25:33] cool cool, making nice progress! [15:26:14] ottomata, because the minute and seconds components are arbitrary right now in the filenames we cannot tell oozie coordinator what input paths it should use [15:26:25] that's why we need to standardize them [15:26:37] oozie does not support globs in path matching [15:26:45] ah hm [15:27:18] so in a coordinator you can say frequency 5 minutes [15:27:21] i like this idea then, but another simpler one would be to take the dates out of the filenames [15:27:22] start time 10:00 [15:27:26] and just use sequence nums or whatever [15:27:32] you can see the dates on the mod times [15:27:45] no no oozie does not support that [15:27:46] or would oozie not like that either [15:27:48] aye ok [15:27:56] going back to the coordinator [15:28:03] example [15:28:07] frequency 5 minutes [15:28:12] start time 10'00 [15:28:32] then it will generate $MINUTE variable {0,5,10,15,20,25,30,35,40,45,50,55} [15:28:50] and there is no $SECONDS variable so that should always be hardcoded to 0 [15:29:50] ok cool, i like it, i think we should round down instead of up [15:29:58] and, could you explain this in comments in that function too? :) [15:30:09] include the reason why you are doing it [15:30:18] sure [15:30:31] i'll round down, improve docs [15:30:38] and then the daily filename [15:30:44] just make a special case? [15:31:06] naw, what if we had different intervals? [15:31:10] most webrequests are hourly [15:31:15] mobile is 15 minutes [15:31:18] event is daily [15:31:33] right so mobile and web requests will work [15:31:36] event won't right now [15:31:47] i wonder if there is a python cron syntax interpreter :p [15:32:09] for daily, what should the timestamp be? [15:32:46] 2013-01-03.00:00:00 or 2013-01-03.23:59:00 [15:32:48] or something else? [15:32:51] maybe helpful? [15:32:52] http://packages.python.org/APScheduler/cronschedule.html [15:33:09] round down, if you want, leave off the time [15:33:37] ok [15:51:03] ottomata, pushed fix [15:59:23] ok cool [15:59:24] looks good [15:59:25] oh [15:59:27] another thing [15:59:32] print datetime gives you a space, no? [15:59:35] >>> print datetime.datetime.now() [15:59:36] 2013-01-04 10:59:09.997872 [15:59:41] argh [15:59:44] good catch [15:59:55] will fix that as well [16:00:00] underscore, right? [16:00:05] you shoudl probably strftime or whatever you do in python to return the string intead of a datetime [16:00:10] yep [16:00:13] yeah, and I was using . instead of : in the time [16:00:22] ok, will fix that as well [16:00:41] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw/webrequest/webrequest-blog/2012-12-17_18.30.02?file_filter=any [16:00:42] oops [16:00:54] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw/webrequest/webrequest-blog/2012-12-17_18.30.02 [16:00:54] ack [16:00:56] 2012-12-17_18.30.02 [16:07:02] ottomata, fixed as wel [16:07:03] l [16:07:42] so then one more issue….. how do we rename all the existing folders? [16:07:51] ha do we have to? [16:08:06] uhhmmmm yes else we cannot use that data in oozie [16:08:08] can I just move them else where and archive them? [16:08:09] :p [16:08:14] and just start from now? [16:08:19] we've got holes in late dec anyway [16:08:29] ok, let's do that [16:08:41] actually, i don't mind scripting somethign to rename them [16:08:45] i could even use your function here [16:08:53] maybe we should do that, and just start from jan 01 [16:09:02] yes i like that [16:09:08] everything before that has had holes, and 2013 seems like a nice place to start [16:09:14] are you going to merge my patch now :D ? [16:09:15] ok [16:09:20] ha, sure [16:10:55] done [16:13:07] NameError: global name 'delta' is not defined [16:13:11] dt += delta [16:13:26] oh that line needs to be deleted [16:13:33] sorry [16:14:01] so [16:14:02] i already merged into master [16:14:05] so just make your change there [16:14:18] pushing now [16:14:22] oh its pushed [16:14:59] do i need to fix it? [16:15:02] i can [16:15:12] just remove that line? [16:15:35] yes [16:15:46] it's from the rounding up stuff [16:16:19] collapsed_minutes = (dt.minute / frequency) [16:16:21] ZeroDivisionError: integer division or modulo by zero [16:16:23] hehe, will fix [16:17:08] sorry, but i don't have test environment to properly test it :( [16:17:15] i'm running it on my mac [16:17:21] oohhhhh [16:17:22] don't need to do the hadoop stuff [16:17:27] just call the function and print the timestamp [16:17:41] i tested it using frequency=15 [16:17:42] :) [16:18:49] this ok? [16:18:49] # Special case were frequency=0 so we only return the date component [16:18:49] if frequency == 0: [16:18:49] return datetime.now().strftime('%Y-%m-%d') [16:19:29] you know, you could handle an arbitrary minute frequency [16:19:38] if you calculated the timestamp in epoch seconds [16:19:44] and then converted it to a datetime [16:19:45] you said to ignore the timestuff [16:19:52] yeah, no i mean [16:19:54] instead of doing a special case [16:20:01] then you could do every 2 days [16:20:03] or every 2 hours [16:20:05] or whatever you wanted [16:20:32] buuut, whatever, this will work for now [16:20:37] ok [16:24:55] ok, i just pulled on the consumer node [16:25:09] oh, i need to add freqency arg eh? [16:25:10] to cron jobs [16:26:33] it's default is 15 [16:26:58] so you would need to set it for the event and web requests jbos [16:27:32] yeah [16:27:36] is it 60 for webrequest jobs? [16:28:32] yes [16:29:41] k [16:53:33] ottomata, dario seems to have two accounts, dartar and DarTar [16:53:52] in hue? [16:54:35] yes [16:54:43] http://hue.analytics.wikimedia.org/filebrowser/view/user?file_filter=any [16:55:07] and we have a /users that is empty [17:06:39] hmm, drdee, I think that frequency == 15 isn't workign with > :45 timestamps [17:07:01] 2013-01-04_15.59.01 -> 2013-01-04_15.00.00 [17:12:49] drdee, shouldn't this: [17:12:49] blocks = 59 / frequency [17:12:49] be [17:12:49] blocks = 60 / frequency [17:12:51] ? [17:26:51] no :) [17:26:59] because 60 => 0 [17:27:09] that was a bug that i fixed [17:37:19] ? i changed it, because otherwise 10:50 -> 10:00 [17:37:22] with frequence=15 [17:37:35] blocks = 59 / 15 == 3 [17:37:50] 60 / 15 = 4 [17:38:10] ok [17:52:40] drdee, does this look good? i tested on fr-banner because it odesn't really have any data in it: [17:52:40] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw/webrequest/webrequest-fr-banner?file_filter=any [17:52:43] the dir names? [17:52:55] YES PERFECT! [17:59:32] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:15:14] ottomata, it seems that the blog web requests are no longer updated since december 19th….. [18:15:42] oh SORRY! [18:15:50] wrong and old tab [18:51:56] http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r6/pageviews.html [18:52:02] I plugged in some real data [18:52:04] but just for one day for each month [18:52:52] YIHAAAAA Oozie coordinator using blog page views is working! [18:53:47] average_drifter: cool, if you scroll all the way to the right then you see that some of your country detection stuff is not working properly [18:54:02] woot, i'm finishing up mobile stuff now [18:54:11] mobile log name standardization [18:54:15] aight [18:54:41] ottomata, so maybe we need a regular hdfs user that runs these type of jobs, i don't think it should run under our own accounts, what do you think? [18:54:55] hdfs too priviledged? [18:55:12] hey ottomata, do you know if /a/eventlogging on stat1 gets picked up by the tape backups? [18:55:18] oh right hdfs is already a user [18:55:26] hmmm [18:55:37] i mean like the wikistats user on stat1 [18:56:35] ori-l, i think it hsould be [18:56:56] ottomata: how can i know definitively? [18:57:04] where should i check / whom should i ask? [18:57:18] well, i'm looking at the amanda disklist file in puppet [18:57:27] and it backs up /a for sure, but only excludes certain dirs [18:57:31] and that one is not excluded [18:57:34] beyond that i'm not sure what happens [18:57:39] mutante might know more and be able to check [18:58:41] yeah, drdee, that might be good [18:59:10] ok [18:59:18] i don't mind using the same user [18:59:19] 'stats' [18:59:21] that is on stat1, etc. [18:59:32] or do you think we should have a new one? [18:59:33] ottomata: thanks [19:00:30] let's use the same one to make our lives easier [19:00:45] ok [19:03:03] this is also cool: https://issues.cloudera.org/browse/HUE-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel [19:03:12] hue now supports mr2 job browsing [19:05:16] nice! [19:06:04] waiting for CDH 4.2 -> pig 0.11 and job browsing with hue! [19:07:34] dschoon, erosen, milimetric [19:07:40] aye [19:07:41] what should I do about getting data into Limn format [19:07:43] ? [19:07:50] what is the current format? [19:07:51] right now, my data is 3 columns [19:08:02] timestamp, continent, count [19:08:21] i want to pivot so that each row is a timestamp, each column is a continent, and each cell is a count [19:08:28] i use pandas for pivoting [19:08:33] pandas [19:08:34] it works pretty well and is pretty simple [19:08:38] yeah one sec' [19:09:00] http://pandas.pydata.org/pandas-docs/stable/reshaping.html [19:11:45] yeah, ok cool, so it would be good to have a generic way to do this for any data [19:12:04] so maybe a python script in hadoop to do this at the end of the oozie job [19:12:21] the way I was doing it in pig was only going to work for my data [19:12:24] basically you would do something like: python -c "import pandas as pd; import sys; df=pd.read_table(sys.argv[1]).pivot().to_csv(sys.stderr,index=False)" test [19:12:42] test is name of file with data? [19:12:46] yeah [19:12:56] and the pivot args need to be filled in [19:13:26] with something like this: df.pivot(index='date', columns='variable', values='value') [19:13:43] i can get the pandas thing working if you want to give me a some data [19:15:59] yeah that would be great, maybe we should check it into limn repo as a helper script or somethign [19:16:02] could do kraken [19:16:03] for now [19:16:40] yeah I might actually put it as a command line script in limnpy [19:17:04] oh limnpy right! [19:17:04] yeah cool [19:21:42] erosen: [19:21:42] https://gist.github.com/4455163 [19:21:46] sample small dataset [19:21:49] with only two timestamps [19:22:14] tsv [19:25:15] great [19:33:50] ottomata, the only thing that we need to figure out is how oozie and timezones work :) [19:34:36] yeah, it lets you select them how everyou want, eh? [19:34:53] can we just always select UTC and not think about it? [19:35:20] i hope so :) [19:36:51] running my continent job on all of the mobile data right now :) [19:36:59] 2013 data [19:38:29] COOOOOOOOOOOOOOOOOOOOOL [19:52:59] erosen, here is a larger dataset: [19:53:00] http://analytics1001.wikimedia.org:8085/wmf/public/tmp/mobile-by-hour-by-continent0.out/part-r-00000 [19:53:04] that's all 2013 mobile by hour by continent [19:55:02] looks really good, only it seems that unknown continents is an empty field [19:55:53] yeah, i shoudl change that eh [19:55:58] why are there so many unknowns, too? [19:56:19] dunno :) [20:01:23] average_drifter: did you see my gist? [20:06:22] ottomata, when do you want to deploy the new wikipedia zero filter that uses the x-carrier header and turn off the other ones? [20:06:40] oh good q! let's make sure we've got good x carriers! [20:06:47] true :) [20:06:58] and maybe don't run the new filter on oxygen but just on kraken [20:07:03] http://hue.analytics.wikimedia.org/filebrowser/view/wmf/raw/webrequest/webrequest-wikipedia-zero/2013-01-04_19.00.00/part-1355947335321_2741-m-00000?file_filter=any [20:07:06] not looking like much [20:07:11] i see accept-language [20:07:13] no x-carrier [20:08:08] how is that data generatedd/ [20:08:36] ottomata: that continent dataset looks awesome. [20:11:13]