[00:47:53] (PS1) Milimetric: survivor cleanup 2 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87300 [00:49:31] (CR) Milimetric: [C: 2 V: 2] survivor cleanup 2 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87300 (owner: Milimetric) [01:01:01] yooo milimetric [01:01:07] wassabi [01:01:09] yoyoyooy [01:01:12] oh man that's old :) [01:01:15] haha [01:01:18] so [01:01:20] sup [01:01:31] i think the order of entries in the kafka camus imports is going to be way off [01:01:39] gotcha [01:01:40] not sure if you were doing this, but we'll definitely need to sort first [01:01:44] in the query [01:01:45] so you found the missing numbers [01:01:52] but they're not in order? [01:01:56] yeah so, they weren't missing, just way out of order [01:01:57] i was sorting [01:02:01] but wasn't looking at the full stream [01:02:06] k [01:02:10] i thought i was looking at enough [01:02:13] well, why would we need to sort? [01:02:15] That error means it was successful [01:02:22] oh i guess not, hmmmm [01:02:22] right [01:02:23] won't hive magically figure out all that [01:02:38] because you just say where exists seq + 1, or seq −1 [01:02:39] kinda, [01:02:40] right/ [01:02:41] i mean, I'm more curious to see how it handles the naive approach [01:02:51] yeah, i'm joining on it now in my new version [01:02:53] but yeah, basically [01:04:09] i mean ideally it would just run through, create an index of that column, then join on it, then spit out the results [01:04:19] that's what any normal DB would do [01:04:28] ayek [01:04:32] we will see, eh? [01:05:50] ok, i thikn we can do some at least prelim testing with at least 10 mins of data [01:05:58] i'm going to let this run for 10 mins and then run camus and create some tables [01:23:53] ok milimetric! let's try your query! [01:23:57] 210006 rows [01:24:04] not that many [01:24:05] only a few mins [01:24:07] no that's small [01:24:08] but we can at least try [01:24:16] one sec, i'll paste it into an etherpad? [01:24:19] ok [01:24:31] this is with only one partition [01:24:47] we'd ahve to wait another 36 mins before we get the next one [01:25:03] here i'll run camus again anyway [01:25:07] http://etherpad.wikimedia.org/p/swhYPuS04T [01:25:07] I'm not familiar with it so I didn't fix it in case I made it worse [01:25:42] lol, there's a w-h-y in the etherpad address and it triggered CommanderData [01:25:48] drdee, your baby's naughty ^^ [01:25:51] hah [01:27:22] its ruunnning [01:28:15] lots of different jobs [01:29:46] ok milimetric got quite a few results [01:30:09] 1704 [01:30:11] results [01:30:30] yeah, one bad thing is it's not telling you how many there are missing at each spot [01:30:39] but that's cool that it at least ran [01:30:42] and didn't choke [01:30:58] so I'd just take the first result and grep for that +1 [01:31:17] so, what are the results? [01:31:38] you said they should be the sequence numbers after which or before which there are missing sequence numbers [01:31:40] and so they are [01:31:46] a seq where one of the bordering numbers is not contiguous? [01:31:47] (hopefully if I didn't screw up the query) [01:31:52] yes [01:32:07] so for example [01:32:13] 764075 [01:32:13] 764077 [01:32:19] hm, but in the results i should never get contingous numbers, right? [01:32:22] that means 764076 isn't in there [01:32:31] right, that would be weird [01:32:32] oh no, because before it might be [01:32:36] 770168 [01:32:36] 770171 [01:32:36] 770172 [01:32:36] 770176 [01:32:38] for example [01:32:42] 770170 is missing [01:32:46] so is 770173 [01:32:49] so 171 and 172 [01:32:50] ok [01:33:50] hm, so ideally the query would return two columns like this: [01:35:06] I laid it out in the etherpad [01:35:19] with that format, we could compute what exactly is missing and how many are missing [01:35:34] but first let's see if the query is even close to right and if it performs reasonably well [01:35:54] but i mean 200k rows should fly in any DB, we gotta try this with tens of millions [01:46:57] I'm going to sign off for the night ottomata [01:47:10] ok [01:47:13] I updated the etherpad with the idea to improve the query [01:47:18] ok cool [01:47:18] tahnks [01:47:19] let me know if you run into problems [01:47:21] i'm about to sign off too [01:47:29] cool, good night [01:47:32] thanks man, ttyl [01:47:34] nighters [01:54:43] * drdee is reading up [03:49:13] grr, the reason survival was slow was because someone else was pounding enwiki_p [03:49:20] how dare they use a public database! [03:49:23] :) [03:50:05] so I increased the timeout for celery tasks to 1 hour, but we should at some point look into implementing "kill task" [13:51:36] milimetric: i thikn I have good news! [13:51:46] now that we have complete hours imported [13:51:50] i just checked one of them [13:51:55] drums please. ... [13:51:58] and there were 0 missing seqs returned from your query [13:52:02] from esams [13:52:03] so that is good! [13:52:11] that's great news! [13:52:18] i think the missing ones we saw last night were an extension of the consumers consuming out of order [13:52:20] which makes sense [13:52:31] which, kind of makes me want to use kafka partitions [13:52:33] keys [13:52:38] partition key on machien hostname [13:54:04] yay yay, thanks ottomata and milimetric for working on this! [13:55:11] so, i'm going to do a query witih all the complete hours [14:13:40] ok, in a 7 hour period, it looks like i'm missing a single message [14:13:49] from esams [14:13:50] not bad [14:14:04] which it was 0 [14:16:41] (PS1) Milimetric: tab and navigation improvements [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87336 [14:16:50] (CR) Milimetric: [C: 2 V: 2] tab and navigation improvements [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87336 (owner: Milimetric) [14:23:34] (PS1) Milimetric: increase refresh speed [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87338 [14:23:42] (CR) Milimetric: [C: 2 V: 2] increase refresh speed [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87338 (owner: Milimetric) [14:26:23] oh cool ottomata, sorry just read the above [14:26:37] so you think the query works though? [14:26:46] like the one you found missing was actually missing? [14:27:00] and was it a dog to run? [14:27:09] brb, getting some breakfast [14:30:10] i'm running a bigger one now [14:30:18] i'll tell you how much data and how long in a sec [14:34:41] ok great, just checked eqiad data for 02 − 12 hours [14:34:43] 10 hours of data [14:34:45] 0 missing seqs [14:35:06] took 15 minutes to execute that query [14:35:25] 4 maps, one reduce [14:41:38] milimetric: , check bottom of etherpad [14:41:40] http://etherpad.wikimedia.org/p/swhYPuS04T [14:41:41] I'm not familiar with it so I didn't fix it in case I made it worse [14:41:45] your query [14:41:47] 121281266 records [14:41:47] 73.9G [14:41:47] 891.879 seconds 15 minutes. [14:41:52] 891.879 seconds / 15 minutes. [14:42:15] hm... not terrible.. not great [14:42:28] we can fine-tune it later :) [14:42:42] impotent thing is that it seems to work! [14:42:54] but doesit actually work? [14:43:14] i think so! [14:43:28] you can see how I edited it [14:43:31] did I do it wrong? [14:43:35] i added L.HOUR BETWEEN 2 and 12 [14:43:37] for each subquery [14:44:34] oops, i'm selecting min and max in both minmax subqueries [14:44:36] lemme fix that [14:45:27] how come hour has to be between 2 and 12? [14:45:42] it doesn't have to be, but i wanted to make sure I didn't get any border cases, i wanted to select only 100% complete hours [14:45:48] 1 and 13 are on the ends of what we have right now [14:45:58] oh gotcha [14:45:59] cool [14:46:06] and it returned nothing this last run? [14:46:12] right [14:46:20] i've got two tables right now [14:46:26] webrequest_eqiad0 and webrequest_esams0 [14:46:31] i did between 2 and 7 for esams0 [14:46:34] it returned a single seq [14:46:38] well, 2 numbers from your query [14:46:42] the seqs bordering the missing one [14:46:52] and did you check that it was actually missing? [14:47:10] no hm, will do [14:47:10] like select * from ... where sequence = that_one? [14:47:20] yeah, 'im going to select where in those 3 seqs [14:47:38] the annoying thing is that it doesn't let us do cross joins [14:47:58] Ah, i don't think I saved the seq :/ [14:47:58] otherwise i'd be able to output the better version I wrote under my first query [14:48:17] fix your query with the minmax thing? and i'll run it again on esams [14:48:18] :) ok, well, it did work against my test mysql table [14:48:25] i fixed your version [14:50:29] ok [14:50:37] the one at the bottom? [14:55:25] yes, ottomata, the one at the bottom, it's no longer selecting both min and max in both MaxSeq and MinSeq subqueries [14:55:35] ok great, running it now on esams [14:55:39] with between 2 and 12 [15:06:19] oof milimetric, i have more missing seqs this time from esams [15:06:33] so let's check a couple of them [15:06:44] and if they're really missing, then silver lining - the script works [15:06:45] :) [15:07:24] here are a few of the ending ones [15:07:24] 109613334 [15:07:24] 109613354 [15:07:25] 109613359 [15:07:25] 109613464 [15:07:25] 109613474 [15:07:25] 109613549 [15:09:38] ooh, that seems to be quite a bit [15:10:00] there are a lot more too [15:10:02] so maybe do a select * from ... where sequence between 109613335 and 109613353 [15:10:12] yeah [15:10:28] so, the fact that most of the missing ones here are approximately at the same time [15:10:35] makes me think they are not actually missing, just not consumed yet [15:10:39] right [15:10:43] interesting [15:10:48] if I keyed on hostname [15:10:51] which i think i should do [15:10:58] then all requests from a single host would go to a single kafka partition [15:11:01] and then guarunteed to be in order [15:11:31] i'm going to run some tests on these numbers, and then restart varnishkafka in new topics partitioned by hostname [15:11:44] as long as the traffic from the hosts we are consuming from is approximately equal [15:11:47] it won't matter [15:11:48] i'm gonna check if there's a way to get you a better output [15:11:56] i just need a row_number function [15:12:11] hive (test)> select hostname, sequence, dt from webrequest_esams0 where year = 2013 and hour = 12 and sequence between 109613335 and 109613353; [15:12:15] hostname sequence dt [15:12:15] cp3003.esams.wikimedia.org 109613340 2013-10-03T12:51:53 [15:12:15] cp3003.esams.wikimedia.org 109613347 2013-10-03T12:51:53 [15:12:15] cp3003.esams.wikimedia.org 109613353 2013-10-03T12:51:53 [15:13:13] hmm, that doesn't look write milimetric [15:13:19] if there are actually records there, right? [15:13:25] 109613334 [15:13:25] 109613354 [15:13:27] is what we have in the output [15:13:39] hmmm [15:13:39] no [15:13:42] hm [15:13:49] yeah [15:13:52] ? [15:14:04] you're confusing me [15:14:10] haha you're confusing me! [15:14:11] but yes, those numbers shouldn't be in there [15:14:21] so the query failed imo [15:14:24] right* [15:14:25] unless i'm missing something [15:14:26] hm [15:14:55] ok, i'm going to consume with camus again [15:14:57] and rerun this [15:15:00] oh, did you send me consecutive pairs, because it might be from ...54 to ...59 instead of ...34 to ...54 [15:15:00] and see what pos out [15:15:12] i sent you what your query pops out [15:15:14] qchris around? [15:15:20] the last few lines of it [15:15:30] oh that's the wrong test then [15:15:35] ? [15:16:00] it'll output pairs of sequence numbers [15:16:00] that border missing ones [15:16:00] right [15:16:14] so it's important to take them as pairs starting with the first one [15:16:19] hm [15:16:20] ok [15:16:21] we just picked two random ones in the middle if what you pasted were the last ones [15:16:27] so this is a better test (one sec) [15:16:49] where sequence between 109613475 and 109613548 [15:16:53] here are the first 3 pairs [15:16:55] in the list [15:16:55] if those are actually the _very_ last 2 [15:16:56] 91965125 [15:16:56] 109579540 [15:16:56] 109579550 [15:16:56] 109579555 [15:16:56] 109579560 [15:16:56] 109579575 [15:17:00] oh :) [15:17:03] those were the last 2 [15:17:16] where sequence between 109579561 and 109579574 [15:17:21] do that one ^ [15:18:05] woa and it seems to think there were a lot missing in the first pair [15:18:18] yeah [15:19:19] hostname sequence dt [15:19:20] cp3003.esams.wikimedia.org 109579573 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579568 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579569 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579571 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579572 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579561 2013-10-03T12:51:46 [15:19:21] cp3003.esams.wikimedia.org 109579566 2013-10-03T12:51:46 [15:19:21] cp3003.esams.wikimedia.org 109579562 2013-10-03T12:51:46 [15:19:22] cp3003.esams.wikimedia.org 109579564 2013-10-03T12:51:46 [15:19:22] cp3003.esams.wikimedia.org 109579565 2013-10-03T12:51:46 [15:19:23] cp3003.esams.wikimedia.org 109579567 2013-10-03T12:51:46 [15:19:23] cp3003.esams.wikimedia.org 109579570 2013-10-03T12:51:46 [15:19:36] gonna order by [15:20:29] 109579561 [15:20:29] 109579562 [15:20:29] 109579564 [15:20:29] 109579565 [15:20:29] 109579566 [15:20:30] 109579567 [15:20:30] 109579568 [15:20:31] 109579569 [15:20:31] 109579570 [15:20:32] 109579571 [15:20:32] 109579572 [15:20:33] 109579573 [15:20:34] sort -n was faster :p [15:20:57] ok, i'm going to run another camus import and see how it looks [15:37:00] ottoman is this page accurate: https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Access ? [15:41:14] yes, i updated it recently [15:41:45] milimetric: does that mean your query isn't working though? [15:41:48] those results? [15:41:52] it looks like just 63 is missing [15:41:57] ty [15:42:21] sec, in talk with toby [15:43:12] k [15:48:06] oh and awesome, milimetric, after I consumed with kafka again, the missing seq (63) in that range is there [15:48:19] camus* [15:48:26] running your full query again [15:48:29] just to see [16:01:55] ok, after the latest camus run [16:02:03] fewer results from your query, milimetric [16:02:08] 91965125 [16:02:08] 103290991 [16:02:09] 91965123 [16:02:09] 1856454 [16:02:09] 103290989 [16:02:09] that's it [16:02:18] also, that's 5 numbers :p so not even pairs [16:04:00] ok, i'm now producing from kafka partitioning on hostname [16:04:14] so, i betcha we'll see this behavior less now [16:04:27] will wait a bit to collect more data [16:12:54] ottomata: how can I run test queries, I want to fix the script because I can't tell whether it's broken or not [16:13:14] brb [16:14:00] —auxpath has been treating me weird [16:14:00] so [16:14:03] hive [16:14:16] add jar /home/otto/hive-serdes-1.0-SNAPSHOT.jar; [16:14:19] use test; [16:14:21] (i'm using the test db) [16:14:24] the tables are there [16:14:48] (Abandoned) Milimetric: Add .gitreview [analytics/limn] - https://gerrit.wikimedia.org/r/86169 (owner: QChris) [16:15:04] (Abandoned) Milimetric: Add .gitreview [analytics/limn] (develop) - https://gerrit.wikimedia.org/r/86170 (owner: QChris) [16:17:06] (CR) Milimetric: [C: 2 V: 2] Remove "mobile" part [analytics/global-dev/dashboard] - https://gerrit.wikimedia.org/r/84310 (owner: QChris) [16:17:50] (CR) Milimetric: [C: 2 V: 2] Remove "geowiki" part [analytics/global-dev/dashboard] - https://gerrit.wikimedia.org/r/84311 (owner: QChris) [16:19:16] (CR) Milimetric: [C: 2 V: 2] Move editor fraction computations to geowiki [analytics/global-dev/dashboard] - https://gerrit.wikimedia.org/r/84312 (owner: QChris) [16:20:04] (CR) Milimetric: [C: 2 V: 2] Add argument parsing skeleton for monitoring script [analytics/geowiki] - https://gerrit.wikimedia.org/r/85605 (owner: QChris) [16:22:56] (CR) Milimetric: [C: 2 V: 2] "I don't really understand script code so my merges on these change sets are done under the assumption that the scripts work. I'm just loo" [analytics/geowiki] - https://gerrit.wikimedia.org/r/85606 (owner: QChris) [16:25:05] (CR) Milimetric: [C: 2 V: 2] Rename DEBUG to USE_CACHE for monitoring [analytics/geowiki] - https://gerrit.wikimedia.org/r/85607 (owner: QChris) [16:25:39] (CR) Milimetric: [C: 2 V: 2] Add caching option to monitoring script [analytics/geowiki] - https://gerrit.wikimedia.org/r/85608 (owner: QChris) [16:27:28] ottomata: FAILED: Error in metadata: ERROR: The database test does not exist. [16:27:42] where's the test database? [16:27:58] i've got "database_name" and "default" [16:29:04] (CR) Milimetric: [C: 2 V: 2] Add verbosity option to show downloaded urls [analytics/geowiki] - https://gerrit.wikimedia.org/r/85609 (owner: QChris) [16:29:13] hm [16:29:20] show databases; [16:29:33] milimetric: ^ [16:30:07] yeah, I did, it gives me the two I listed above [16:30:26] you're on kraken-namenode standby? [16:30:28] in labs? [16:32:37] no [16:32:38] in prod [16:32:41] :) [16:32:47] of course [16:32:49] use analytics1011.eqiad.wmnet [16:32:50] it's real data [16:32:51] for now [16:32:54] ja [17:02:34] scrum guys? [17:02:42] milmetric [17:02:57] milimetric: ^^ [17:03:06] https://plus.google.com/hangouts/_/bece4b70939f46134efa00bcb4dd4449bc197785 [17:53:26] milimetric around? [17:54:11] hey ottomata, I think the query works [17:54:16] but I'm confused about your results from earlier [17:54:33] so we should really test with a controlled file [17:54:47] like, make up a fake file in hour 13 or something [17:54:50] and run it against that [17:55:50] I updated the top of the etherpad with two versions we could run [17:56:17] the "find the missing sequences" version is the first one [17:56:20] that took about 9 minutes [17:56:37] the "just figure out whether there ARE any missing sequences" version is the second one [17:56:41] that took about 5 minutes [17:56:56] the second one returns just a count and if it's > 2, then there are missing sequences [17:57:16] the first one shows you the ranges of missing sequences in a nice table along with a count of how many are missing. [18:05:07] we're ready for the wikimetrics demo? [18:05:28] i think we are; dan is in the hangout [18:05:37] Jaime is here [18:05:41] cool [18:05:46] then we are good to go [18:16:35] oo milimetric so fancy [18:17:00] hm, i don't see the too versions, but that's ok? [18:19:20] yay -- nice work guys [18:19:23] :) [18:24:08] ottomata: ugh, etherpad ate my code [18:24:12] i'll make a gist [18:25:00] k [18:25:15] i'm importing from more hostname partitioned stream from esams now [18:25:25] we can run queries on it when it is done [18:25:31] also, yeah we should get a standard set of data [18:25:32] hm [18:25:33] to test on [18:25:34] hm [18:25:56] i will consume from kafka and just create a file that should ahve all seqs [18:26:03] and then I will remove some random lines from the file [18:26:06] and just upload those to hdfs [18:29:12] ottomata: https://gist.github.com/milimetric/6814636 [18:41:18] awesome, thanks milimetric [18:41:44] does the second way print an accurate count of missing seqs? or just a number that if > 2 means that missing seqs exist? [18:41:44] yeah if those queries don't work I have to eat my shorts :) [18:41:47] no [18:42:04] (n - 2)/2 is the number of *runs* of missing sequences [18:42:41] oo UDFRowSequence! [18:42:44] yea [18:42:46] that's useful [18:42:55] and cool that it's easily available [19:19:14] ok milimetric running your queries on some standard data [19:19:25] 100000 rows in a good table, no missing seqs [19:19:33] 90000 rows in bad table, 10000 missing seqs [19:19:40] good passed your count query [19:19:42] returnred 2 [19:19:48] running your count on bad now [19:20:25] 17045 [19:22:16] seems good ottomata, except very weird that the count is odd [19:22:30] maybe some of the runs start or end at the edges [19:22:37] yeah maybe? [19:22:40] but you can run the non-count one [19:22:46] it isn't sorted [19:22:50] yeah i'm running that on the good data now [19:22:58] but sorting shouldn't matter, right? [19:23:00] and the last column in the result should add up to 1000 [19:23:06] or 10000 rather [19:23:10] sorting shouldn't no [19:23:11] great, 0 results on the good data [19:23:12] running on bad now [19:23:36] oh! other caveat - in order to make the first query work in "strict" mode I had to use limit [19:23:56] so I just limited to 1 trillion [19:24:02] ha, i see nice [19:24:06] that's fine [19:24:15] yeah, it'll never bump into that limit (I hope!) [19:27:30] coool milimetric [19:27:31] cat bad.missing.txt | awk '{s+=$3} END {print s}' [19:27:31] 10000 [19:27:41] looks good! [19:27:44] great, ok [19:31:29] nice [19:31:35] psh, hive is great [19:31:56] yay -- hive is great [19:31:59] I'm glad you like it [19:32:06] ;) [19:32:20] ok, running the count query on 2 hours of recent hostname partitioned esams input [19:32:36] milimetric: if we use these queries to validate imports in produciton [19:32:43] we're going to have to add a filter or group by hostname [19:33:03] gotcha [19:33:16] yeah, that should be fine [19:33:27] i've gotta go eat now [19:33:29] ok coo [19:33:37] but we can sync up on this tomorrow at the latest [19:33:45] #bigdataiscoming [19:34:45] later everyone, I've gotta go not starve [19:37:56] laaata [19:39:18] what's the new active editors URL? [21:05:17] ottomata: one possible way that query could mess up - if map reduce screws with order by or rowSequence() but from my tests on the full data it doesn't seem affected [21:05:39] were you able to add group by hostname? [21:05:52] or should I give it a shot (just tell me what table I should hit) [21:06:23] yeah give it a shot! [21:06:33] um, let me get you one, i was about to combine topics [21:06:38] was just writing this up [21:06:38] https://wikitech.wikimedia.org/wiki/Analytics/Kraken/ETL [21:07:10] hm, so I ran your query on more recent hostname partitioned data [21:07:16] looks like i'm missing two seqs! [21:07:18] dunno why [21:07:19] The third party API is not responding [21:07:31] https://gist.github.com/ottomata/6816039#file-gistfile1-txt [21:08:34] milimetric: i'll get you a standard data file with 2 hosts in it [21:08:35] gimme just a few [21:13:27] ok milimetric, webrequest_good and webrequest_bad [21:13:35] those tables now have 2 hosts in them [21:13:50] webrequest_good has 1000000 messages, and webrequest_bad has 90000, 10K randomly removed [21:14:07] 100000 / 90000 [21:24:08] ottomata: the sequence numbers are sourced from vk (varnishkafka)? [21:26:53] yes [21:27:07] it is a capture from kafka [21:27:28] ottomata: with multiple vk instances, how will you be using the seq? [21:28:27] together with the hostname [21:28:34] ah, neat [21:29:10] ja milimetric is going to either filter or group by with his fancy queries [21:30:19] k, updating fancy queries now :) [21:33:41] ottomata: so hostname is the column I'm concerned with right? [21:34:49] yes [21:36:33] hm, this definitely kills these queries [21:38:39] heheh [21:39:26] milimetric: , one sec, i need to rename the location of those standard tables I gave you [21:39:42] are you running a query right now? [21:39:55] nope [21:39:57] k [21:40:14] just thinking about the massive overhead that grouping this by hostname will cause :) [21:40:24] ok they're back [21:40:32] i'm ok with filtering by a hostname, if you like [21:40:42] adding that to the where [21:42:22] nvm, that's dumb :) [21:42:29] not your idea [21:42:32] my ponderance [21:42:48] oh ok, btw, the right way to do this is to keep an index of just the sequence numbers somewhere [21:42:55] and run these queries on it periodically [21:44:00] just the sequence numbers with the hostnames of course [21:49:04] ooook..? [21:49:22] like, extract them out elsewhere as part of ETL and then do the queries on those? [21:49:34] or does hive have a different indexing mechanism you mean? [21:49:39] milimetric: ? [21:50:06] yeah, i meant your first option [21:50:26] but I was looking at some presentation where you can store the files in ORC (Optimized Row Columnar) [21:50:31] and that mentioned something about indexing [21:50:40] I'm totally new at this though, so we'll see [21:51:08] the way I'd do it is just write the (hostname, sequence) tuples somewhere for super basic raw analytics [21:51:24] and maybe that's where we can do things like check for the canary events [21:51:32] (which could have a unique hostname) [21:51:53] i think that's a great idea, at least for now [21:51:59] hmmm, yeah! [21:52:07] because then we can partition on hostname [21:52:12] we might want to write the timestamp too [21:52:22] or at least partition by that too [21:53:35] milimetric: i was just messing with this [21:53:35] https://github.com/mate1/camus2hive/ [21:53:39] i am totally going to rewrite that thing :p [21:53:43] maybe in python :p [21:53:52] the idea is good though [21:54:55] i'll check in a sec, I think I have a workable query [21:55:27] yeah no worries, just excited about it and I have to run soon [21:55:34] also, i'm taking the day off tomorrow [21:57:47] oh cool, I'll look and write you an email ottomata [21:57:57] you don't happen to know how to do variables in hive? [21:58:03] i did set tablename=blah; [21:58:11] and then tried to "select * from ${tablename}" [21:58:14] but no luck yet [21:58:55] ah! [21:59:04] "select * from ${hiveconf:tablename}" [21:59:13] hmmm [21:59:17] interseting [21:59:18] ok also, btw [21:59:25] i just set up a hacky cron job to run camus hourly [21:59:26] if it works [21:59:34] it should import data from both hosts into the varnish6 directory [21:59:41] i don't have automatic partition adding yet [21:59:46] but you can add them if you want to query it [22:00:00] see here: [22:00:00] https://wikitech.wikimedia.org/wiki/Analytics/Kraken/ETL#Hive_queries [22:00:14] for a quick and dirty way to ls the dirs and create hive add partition queries [22:00:18] hadoop fs -ls -d '/wmf/raw/webrequest/test/data/varnish6/hourly/*/*/*/*' | grep -v Found | awk '{print $NF}' | awk -F '/' '{print "ALTER TABLE webrequest_test0 ADD PARTITION (year=" $9 ",month=" $10 ",day=" $11 ",hour=" $12 ") location \"" $0 "\";"}' > partitions.sql [22:00:19] sudo -u hdfs hive --auxpath /home/otto/hive-serdes-1.0-SNAPSHOT.jar --database test -f ./partitions.sql [22:00:33] ooo, i think the table is called varnish6 [22:02:39] oo, and camus is running now :) [22:04:20] oook, laters [22:06:00] later ottomata [22:06:09] latr ottomata