[00:04:52] dr0ptp4kt : yes, just got back [00:06:25] nuria: hi! i was wondering if the "TABLESAMPLE(BUCKET 1 OUT OF 1000 ON rand())" essentially just picks up every thousandth row from the data set. i ran some long hive jobs today and was thinking that i could maybe get away with sampling...of course being aware of sampling biases depending on the type of query [00:06:55] tch. I told you to email! :P [00:07:10] qchris tested teh sampling to make sure it was random [00:07:15] *the sampling [00:07:27] dr0ptp4kt: and it is (not all samplings are) [00:07:34] what nuria said [00:07:44] so, for example, TABLESAMPLE(N ROWS) grabs N rows from each partition [00:08:02] ...which, of course, is not /exactly/ random. [00:09:31] dr0ptp4kt: with amount of data we have 1/1000 seemed adequate for mobile [00:10:14] dr0ptp4kt: zero traffic should be smaller but even if it is one order of magnitude smaller, 1/1000 will give you ~2 milion records for last month [00:10:42] dr0ptp4kt: makes sense? [00:13:54] nuria and Ironholds, cool. makes sense. the stuff i was doing today had to do with the global homepage. i did some other queries on the mdot global homepage. on mdot redirects to en.m.wikipedia.org likely aren't the right thing to do - so we're going to try first in wikipedia zero land to see how Accept-Language sensitive redirects would cause fluctuations in pageviews. and depending on that, we may be interested in exploring mdot [00:13:55] redirects in general that don't fall into the wikipedia zero bucket. as for the global homepage at www.wikipedia.org, it's more complicated. the top 20 primary langs of end users as determined through a dumb check on the Accept-Language header seems to suggest that most of the languages are already above the fold...i'll send you the link. [00:14:23] Analytics / EventLogging: VE instrumentation is not showing up in databases - https://bugzilla.wikimedia.org/72173#c4 (nuria) Roan.. can you clarify a bit your last comment? me no comprendo. [00:15:33] dr0ptp4kt: sounds good. remember there is only 30 days of data so you want to keep arround results of your reports [00:15:54] nuria: yeah, good reminder - thanks! [00:17:38] nuria: i also have some familiarity with the collected sets type notation in the hive manuals, which i was using earlier, but the subqueries seemed easier and perhaps just as computationally efficient. don't laugh too hard when you look at the queries! [01:36:56] (PS1) Nuria: i[WIP] Improving retrieval of user names on cvs report [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/167356 (https://bugzilla.wikimedia.org/71255) [01:38:35] (PS2) Nuria: [WIP] Improving retrieval of user names on cvs report [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/167356 (https://bugzilla.wikimedia.org/71255) [12:30:23] Analytics / Tech community metrics: Contributor pages without data should include an explanation - https://bugzilla.wikimedia.org/56111#c10 (Helder) Same problem with my new user account: http://korma.wmflabs.org/browser/people.html?id=3208538&name=He7d3r http://korma.wmflabs.org/browser/people.html?id...