[00:00:46] k [00:02:52] so dudes [00:03:15] i think what you are talking about it relevant to my next todos [00:03:21] but now that the new data is coming in [00:03:26] new! data! [00:03:34] how should we make it available? [00:03:41] you guys are talking about the device stuff, right? [00:03:44] which data? [00:03:45] but the zero is pretty settled? [00:03:49] yeah, device-props [00:03:54] http://stats.wikimedia.org/kraken-public/webrequest/mobile_country_device_vendor/ [00:04:02] http://stats.wikimedia.org/kraken-public/webrequest/zero_carrier_country/ [00:04:40] yeah, i think zero is nailed down, afaik [00:04:44] *nod* [00:04:52] i can, make a cron job to cat out and sort, and probably even limnify [00:04:52] so, the output from that job is not close to right yet [00:04:57] lots of things missing [00:05:05] from zero? [00:05:05] or from device? [00:05:05] and things incorrect [00:05:05] device. [00:05:11] zero is hot shit, imo [00:05:14] ok, that's fine [00:05:19] i can work on zero [00:05:27] the work I do for that will be good for device once we get there too [00:05:54] date format on zero is shit [00:06:00] but otherwise i think it's fine? [00:09:13] device indeed needs some more love [00:09:24] meaning the user agent classifier pig udf [00:10:20] ok, well [00:10:26] what should I do with zero files at the moment then? [00:10:26] yep. [00:10:34] cron job to cat + limnify? [00:10:34] i made progress on it today. [00:10:34] sure. [00:10:56] drdee, for limnify, do we want to pivot on x-carrier name or x-cs? [00:11:02] that'll get us to testing, and then evan can take a crack at replacing his datastreams with this [00:11:47] ottomata: ask erosen, he needs to analyze the data in the end [00:11:59] erosen: around? [00:16:17] drdee, any idea what this might mean? [00:16:18] http://localhost:19888/jobhistory/logs/analytics1014:8041/container_1363221635790_1228_01_000002/attempt_1363221635790_1228_m_000000_0/stats [00:16:27] loooking [00:17:52] those logs are not revealing a lot [00:18:00] that might be me [00:18:02] strangely [00:18:08] i am mucking about in the grunt shell [00:18:14] and i see in that log: [00:18:26] :D [00:18:30] ERROR 2244: Job failed, hadoop does not return any error message [00:18:31] org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message [00:18:31] at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140) [00:18:32] at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193) [00:18:34] at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) [00:18:36] at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) [00:18:38] at org.apache.pig.Main.run(Main.java:430) [00:18:42] * drdee is off to diner, might look at things later [00:19:00] that happened for 3 different device hourly runs [00:19:44] oh dschoon: i am sending email about 61 now, ok? [00:20:04] 3 [00:20:05] 2 [00:20:05] 1 [00:20:06] ok [00:20:12] nooo [00:20:16] not done yet! [00:20:29] after all [00:20:31] there's only one metric [00:20:35] we can afford to not lie :) [00:21:53] yes there should only be 1 metric [00:22:15] sorry for the wait :) [00:22:17] just a few more minutes [00:22:37] yes there should only be 1 metric [00:22:40] milimetric! [00:22:46] :D [00:22:51] haha [00:25:30] there shall be only 1 metric [00:25:36] thou shalt have no other metrics before me [00:27:15] guys, dschoon, what if I make separate concat_sort coordinators? [00:27:23] that define input datasets that are the output of the others [00:27:36] that way the concat sort won't interfere with the actual data generation [00:27:37] hmmm [00:30:02] i think that's a good plan [00:30:10] wellll [00:30:10] hm [00:30:14] gimmee a sec [00:30:17] need to finish this for dieds [00:30:36] k [00:33:00] weird, those hours that failed for device, run fine as pig scripts, just not in oozie [00:33:04] even as individual workflows [00:37:58] lol ori-l [00:48:18] erosen! [00:48:22] you there? [00:49:26] ja [00:54:17] heyyy [00:54:18] i'm using limnify again! [00:54:20] can you help? [00:54:33] getting ValueError: time data '6' does not match format '%Y-%m-%d_%H' [00:54:34] erosen [00:54:40] yo [00:54:58] are you familiar with datetime.datetime.strptime? [00:55:08] a wee bit [00:55:09] but not really [00:55:10] basically does it match? [00:55:27] http://docs.python.org/2/library/datetime.html#strftime-strptime-behavior [00:55:37] i'm happy to help, i'm just pointing you to my first thoughts [00:55:39] ja, i mean, i read that to get my fromat [00:55:45] k [00:55:46] --datefmt "%Y-%m-%d_%H" [00:55:50] so it shoulllld be workign [00:55:52] ka [00:55:55] 2013-03-14_22 [00:55:55] k* [00:55:56] looks like that [00:56:03] but, i don't know where '6' is from [00:56:09] ValueError: time data '6' does not match format '%Y-%m-%d_%H' [00:56:15] aah [00:56:21] maybe a delimiter problem? [00:56:41] - [00:56:42] can you give me a line or two [00:56:48] --delim "\t" [00:56:50] log into an02 [00:56:53] k [00:56:54] i can show you where i'm working [00:57:01] oo, and the shared screen is still running! [00:57:03] mabye we can both join it [00:57:07] hehe [00:57:59] what is the name again [00:58:00] ? [00:58:03] otto/shared [00:58:35] i'm there [00:58:37] where should i go [00:59:20] we're both in the screen? [00:59:25] err [00:59:28] i thought so [00:59:47] k [00:59:47] now i am [01:00:14] cool [01:01:24] there we go [01:01:53] great [01:01:57] interesting [01:02:40] ottomata: i don't see a head here [01:02:47] how does it know what the val col is? [01:02:58] --valcol 3 [01:02:59] ? [01:03:02] -header [01:03:05] i just missed that [01:03:06] my b [01:03:35] btw, do you know what the 3rd field is? [01:03:37] iso? [01:03:42] i think so [01:03:45] what is that? [01:03:46] iso-2 [01:03:46] iso what? [01:03:50] country? [01:03:53] iso-3991....? [01:03:55] yeah [01:04:00] so it needs to pivot and sum? [01:04:14] so is this the zero data? [01:04:17] i ah yeah it is country [01:04:17] yes [01:04:30] because ironically my script take as input a file of this format [01:04:34] yeah, i guess also btw, what output to you want? [01:04:35] modulo some reordering [01:04:42] what script? [01:04:49] the script that make the dashboards [01:04:56] basically there is a lot of custom stuff in there [01:05:00] hm, welp [01:05:09] http://stats.wikimedia.org/kraken-public/webrequest/zero_carrier_country/ [01:05:12] this is updated every hour [01:05:27] nice [01:05:49] i'm trying to use limnify to create a datafile for limn again [01:05:55] yeah [01:05:55] is that useful? [01:06:05] i mean it might be for debugging purposes [01:06:08] yeah [01:06:14] well, i guess this is one of our deliverables anyway? [01:06:21] but for the end use, I'll need to group it and what not [01:06:42] group it? [01:06:49] by? [01:06:55] by carrier, i guess [01:06:58] which is what we're doing [01:07:12] i guess i would just as well not have to group it myself [01:08:31] limnify will sum with —pivot, right? [01:09:00] yup [01:09:03] it will do what we want [01:09:05] it is a good idea [01:09:16] i don't like the amount of custom munging that happens in my dashboard scripts [01:09:38] and a lot of that has to do with the fact that I start off with the data in a weird format [01:09:39] yours are working on the IP based files on oxygen still, right? [01:09:44] yup [01:09:53] yeah, this is unsampled filtering on x-cs [01:10:04] nice [01:10:08] yeah, htis is the future [01:10:10] i'm happy about that [01:10:28] you need 4 header names? [01:10:32] i thought header was for the output file [01:10:37] hrm [01:10:43] i don't think so [01:10:48] hm, k [01:11:00] the help should be clearer but i think it names the input file [01:11:01] eyah i guess not, it probably just picks the headers you give it based on the columns [01:11:05] and then you can use those names for pivoting [01:11:06] yeah you're right [01:11:12] cause you can put them in your inputs and not use --header [01:11:25] yeah [01:11:30] good deduction [01:12:33] i htink it is the quotes! [01:12:43] yay [01:12:45] help says datecol is int [01:12:54] ahhhhh! [01:13:02] yeah it is either [01:13:06] cool [01:13:11] coool thank you [01:13:15] if you specify a header, it will use the names [01:13:20] glad it worked [01:13:23] do you want this rolled up via country too? I could easily run both pivots [01:13:26] keep me updated when it acts up again [01:13:42] no country rollup [01:13:44] k [01:13:53] i need mobile views by country in order to make dashboards thats all [01:14:04] total mobile views? [01:14:26] we have been calling that page views so far [01:14:30] right [01:14:37] total mobile pageviews by country [01:14:47] that's probably a very easy one, [01:14:53] drdee, is that a card? [01:15:02] (sorry, I am not paying attention to the card discussion) [01:15:08] it's part of 61 [01:15:09] (maybe that's what you guys were talking about earlier) [01:15:10] aye cool [01:15:31] thanks erosen!~ [01:32:38] dschoon: okay if i send email about card 61/ [01:36:08] okay [01:36:20] just NOW have determined the pageview number is wrong in the script [01:36:28] but the definition is doable [01:36:37] hey brain bounce time [01:36:47] i can cat and sort and limnify via the cli [01:36:59] 1. where should I run that as a job (analytics cluster or stat1)? [01:37:02] to be clear, it will be the count of unique (timestamp, country, device_os) tuples [01:37:04] 2. where should the output files live? [01:37:14] drdee: ^^ [01:37:21] i vote you do it in the cluster [01:37:41] output can be where it pleases you [01:37:49] but it needs to be public, so [01:37:50] you could stuff it back in HDFS if you really wanted [01:37:52] i could put it back into hdfs [01:37:52] yeah [01:37:56] and then it would get synced to stats. [01:38:00] /wmf/public [01:38:02] i guess that is easiest [01:38:04] or stat1001, right? [01:38:06] yeah [01:38:09] dschoon: great news! and that matches with my intuition :) [01:38:11] stats.wikimedia.org/kraken-public [01:38:16] okay. [01:38:20] (which is on stat10010) [01:38:22] 1001* [01:38:36] oh. [01:38:40] it's an rsync right now [01:38:42] i forgot :( [01:38:45] ja [01:38:46] i thought it was still proxied. [01:38:48] derf. [01:39:36] hmm, ok [01:39:41] this won't be puppetized then…which I guess is ok [01:47:55] just fyi, drdee [01:48:24] i was talking with evan, and we both agree that the fields that appear in the output need to be the fields that generate the count [01:48:31] otherwise it's ... lying [01:48:55] this is why i wanted to make more files [01:49:18] which i still think is the right answer -- capture all the dimensional metrics of interest, and do it in 2D tables (CSVs) [01:49:43] we can talk more later, or tomorrow [01:49:52] i'm heading to the pizza & beer now [01:49:53] :) [01:49:54] bbl [01:52:04] ergh, this seems so hacky! [01:52:05] hadoop fs -cat /wmf/public/webrequest/zero_carrier_country/*/*/*/* | sort | ../bin/limnify --delim "\t" --header Date Carrier Country Pageviews --datefmt "%Y-%m-%d_%H" --valcol Pageviews --name "Pageviews by Wikipedia Zero Carriers" --id "zero_carriers" --datecol Date --pivot [01:52:09] every time! [01:52:11] cat and sort [01:52:15] cat and sort [02:00:41] milimetric, you still around? [02:24:58] [travis-ci] master/0f38c4b (#92 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5516296 [02:26:19] [travis-ci] master/957cd96 (#93 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5516323 [02:27:53] [travis-ci] master/d4a5441 (#94 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5516351 [02:29:04] [travis-ci] master/e3ff8e1 (#95 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5516394 [02:34:18] ottomata: yes, I'm here [02:34:23] cool! [02:34:28] what's up [02:34:31] http://stats.wikimedia.org/kraken-public/webrequest/zero_carrier_country/_limn/pageviews_by_zero_carrier/ [02:34:35] updated every hour! [02:34:42] woa! [02:34:43] dude [02:34:43] as long as the cron job doesn't fail me [02:34:55] it has run on its own yet :p [02:34:57] I shall beer thee next time we meet :) [02:35:02] its hacky and not puppetized [02:35:05] but ya know [02:35:10] no that's awesome [02:35:27] thank you sir, I will point some graphs at this tomorrow [02:35:31] enjoy your vacation [02:35:42] woohooo [02:35:43] cool [02:38:16] lattaas [03:56:53] (back) [03:57:55] (i'm here too) [04:25:41] btw, dschoon, yay! [04:25:42] http://stats.wikimedia.org/kraken-public/webrequest/zero_carrier_country/ [04:25:49] http://stats.wikimedia.org/kraken-public/webrequest/zero_carrier_country/_limn/ [05:07:48] nice! [11:09:33] got two shots of anaesthesics [11:09:40] @dentist [11:11:16] today that wisdom tooth is put to death [11:11:36] cant feel half of my fac lol [11:11:41] *face [12:11:44] got back [12:11:51] average_drifter -= wisdom_tooth; [12:12:32] average_drifter += pain + bandage; [12:20:29] so it's done [13:29:18] mooooooorning [13:51:55] drdee: hi [14:00:10] hi [14:05:20] average_drifter: check email [14:05:57] saw it [14:06:41] so you have tuples date,count [14:06:45] what are you counting? [14:06:54] pageviews overall ? [14:06:55] mobile pageviews? [14:07:00] mobile pageviews for english language? [14:07:12] pageviews overall [14:07:35] can you try with en.m.wikipedia.org/wiki [14:07:41] you will see the bump [14:08:08] what do you mean "can you try with en.m.wikipedia.org/wiki" [14:08:34] mobile pageviews of the english language [14:08:59] because you said in the e-mail you did pageviews overall [14:09:01] 16:07 < drdee> pageviews overall [14:09:07] including mobile [14:09:19] right i am using the regular sampled log files [14:10:10] i did the count for pre and post dec 14 and on average the page views are the same [14:10:20] so that suggests to me that the java implementation does not have the bump [14:10:42] but we are doing mobile pageviews, which are just a percentage of pageviews overall [14:11:06] you are using the sampled1000 files right? [14:11:09] yes [14:11:17] so that includes mobile, right? [14:11:24] it does [14:11:57] so the java implementation handles mobile page views accurately and hence there is no bump [14:12:23] but above you said that you did pageviews overall and not mobile pageviews [14:12:35] i meant page views overall including mobile pageviews [14:13:37] ok [14:14:42] if mobile is 10% of overall and the bump is on mobile and I was looking at overall, I wouldn't notice the bump [14:15:30] anyhow, so kraken doesn't have the bump [14:15:50] I guess we have to replicate all the logic in the pageview definition of kraken.. [14:16:20] or.. not sure if there's a point in having the same logic in Python [14:16:43] drdee: should I continue to look for the cause of the 500M bump in the code? [14:18:18] well look at the java code and see if you see clear differences and those clear differences to python so we are sure we fixed it [14:19:53] ok [15:51:31] mornin [15:57:38] ack! [15:59:50] is a pretty useful program! [16:21:32] re: the Features with Questions view, can we remove the qqq tags when the questions are resolved? I see 5 cards that still have the tag, and I want to make sure there's nothing outstanding on those... [16:47:26] hihi [16:47:41] I will be attending the standup [16:47:44] but I cannot talk [16:47:46] otto isn't in today, right? [16:47:52] because of tooth extractio [16:47:53] n [16:47:57] heh [16:48:01] s'all good, average_drifter [16:48:04] ok [16:48:06] we'll just put words in your mouth [16:48:12] lol [16:48:15] "yes, i do want to be responsible for everything!" [16:48:21] "of course i'll have that tomorrow!" [16:48:26] oh noes [16:48:30] :P [16:48:32] :D [16:48:34] worry not [16:50:06] it'll be an FFT -> SVD -> compressive sensing UDF [16:50:09] cake! [17:20:38] erosen: http://stats.wikimedia.org/kraken-public/webrequest/zero_carrier_country/_limn/pageviews_by_zero_carrier/ [17:45:57] dschoon: can you save card 319 so i can start editing it? [17:47:58] saved [17:48:14] thx [18:14:13] dschoon, want to sync up about zero dashboad card? [18:14:18] sure [18:14:22] gimme 10? [18:14:26] sounds good [18:14:27] ping me when ready [18:51:42] erosen, i can look at the zero dashboard card with you [18:51:47] cool [18:51:55] let's see if i understand correctly [18:52:03] right now it's filtering on x-cs != '-' [18:52:17] and that's ok for the carrier obviously [18:52:17] https://mingle.corp.wikimedia.org/projects/analytics/cards/244 [18:52:20] but not for the country [18:52:23] yup [18:52:25] ok [18:52:39] so we need a separate pig script that does the same exact thing without the filter [18:52:47] yup [18:52:56] ok, cool [18:53:08] also there is another requirement that I didn't realize until recently [18:53:10] i'll add that as soon as i can [18:53:15] uh oh [18:53:17]