[00:47:53] (PS1) Milimetric: survivor cleanup 2 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87300 [00:49:31] (CR) Milimetric: [C: 2 V: 2] survivor cleanup 2 [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87300 (owner: Milimetric) [01:01:01] yooo milimetric [01:01:07] wassabi [01:01:09] yoyoyooy [01:01:12] oh man that's old :) [01:01:15] haha [01:01:18] so [01:01:20] sup [01:01:31] i think the order of entries in the kafka camus imports is going to be way off [01:01:39] gotcha [01:01:40] not sure if you were doing this, but we'll definitely need to sort first [01:01:44] in the query [01:01:45] so you found the missing numbers [01:01:52] but they're not in order? [01:01:56] yeah so, they weren't missing, just way out of order [01:01:57] i was sorting [01:02:01] but wasn't looking at the full stream [01:02:06] k [01:02:10] i thought i was looking at enough [01:02:13] well, why would we need to sort? [01:02:15] That error means it was successful [01:02:22] oh i guess not, hmmmm [01:02:22] right [01:02:23] won't hive magically figure out all that [01:02:38] because you just say where exists seq + 1, or seq −1 [01:02:39] kinda, [01:02:40] right/ [01:02:41] i mean, I'm more curious to see how it handles the naive approach [01:02:51] yeah, i'm joining on it now in my new version [01:02:53] but yeah, basically [01:04:09] i mean ideally it would just run through, create an index of that column, then join on it, then spit out the results [01:04:19] that's what any normal DB would do [01:04:28] ayek [01:04:32] we will see, eh? [01:05:50] ok, i thikn we can do some at least prelim testing with at least 10 mins of data [01:05:58] i'm going to let this run for 10 mins and then run camus and create some tables [01:23:53] ok milimetric! let's try your query! [01:23:57] 210006 rows [01:24:04] not that many [01:24:05] only a few mins [01:24:07] no that's small [01:24:08] but we can at least try [01:24:16] one sec, i'll paste it into an etherpad? [01:24:19] ok [01:24:31] this is with only one partition [01:24:47] we'd ahve to wait another 36 mins before we get the next one [01:25:03] here i'll run camus again anyway [01:25:07] http://etherpad.wikimedia.org/p/swhYPuS04T [01:25:07] I'm not familiar with it so I didn't fix it in case I made it worse [01:25:42] lol, there's a w-h-y in the etherpad address and it triggered CommanderData [01:25:48] drdee, your baby's naughty ^^ [01:25:51] hah [01:27:22] its ruunnning [01:28:15] lots of different jobs [01:29:46] ok milimetric got quite a few results [01:30:09] 1704 [01:30:11] results [01:30:30] yeah, one bad thing is it's not telling you how many there are missing at each spot [01:30:39] but that's cool that it at least ran [01:30:42] and didn't choke [01:30:58] so I'd just take the first result and grep for that +1 [01:31:17] so, what are the results? [01:31:38] you said they should be the sequence numbers after which or before which there are missing sequence numbers [01:31:40] and so they are [01:31:46] a seq where one of the bordering numbers is not contiguous? [01:31:47] (hopefully if I didn't screw up the query) [01:31:52] yes [01:32:07] so for example [01:32:13] 764075 [01:32:13] 764077 [01:32:19] hm, but in the results i should never get contingous numbers, right? [01:32:22] that means 764076 isn't in there [01:32:31] right, that would be weird [01:32:32] oh no, because before it might be [01:32:36] 770168 [01:32:36] 770171 [01:32:36] 770172 [01:32:36] 770176 [01:32:38] for example [01:32:42] 770170 is missing [01:32:46] so is 770173 [01:32:49] so 171 and 172 [01:32:50] ok [01:33:50] hm, so ideally the query would return two columns like this: [01:35:06] I laid it out in the etherpad [01:35:19] with that format, we could compute what exactly is missing and how many are missing [01:35:34] but first let's see if the query is even close to right and if it performs reasonably well [01:35:54] but i mean 200k rows should fly in any DB, we gotta try this with tens of millions [01:46:57] I'm going to sign off for the night ottomata [01:47:10] ok [01:47:13] I updated the etherpad with the idea to improve the query [01:47:18] ok cool [01:47:18] tahnks [01:47:19] let me know if you run into problems [01:47:21] i'm about to sign off too [01:47:29] cool, good night [01:47:32] thanks man, ttyl [01:47:34] nighters [01:54:43] * drdee is reading up [03:49:13] grr, the reason survival was slow was because someone else was pounding enwiki_p [03:49:20] how dare they use a public database! [03:49:23] :) [03:50:05] so I increased the timeout for celery tasks to 1 hour, but we should at some point look into implementing "kill task" [13:51:36] milimetric: i thikn I have good news! [13:51:46] now that we have complete hours imported [13:51:50] i just checked one of them [13:51:55] drums please. ... [13:51:58] and there were 0 missing seqs returned from your query [13:52:02] from esams [13:52:03] so that is good! [13:52:11] that's great news! [13:52:18] i think the missing ones we saw last night were an extension of the consumers consuming out of order [13:52:20] which makes sense [13:52:31] which, kind of makes me want to use kafka partitions [13:52:33] keys [13:52:38] partition key on machien hostname [13:54:04] yay yay, thanks ottomata and milimetric for working on this! [13:55:11] so, i'm going to do a query witih all the complete hours [14:13:40] ok, in a 7 hour period, it looks like i'm missing a single message [14:13:49] from esams [14:13:50] not bad [14:14:04] which it was 0 [14:16:41] (PS1) Milimetric: tab and navigation improvements [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87336 [14:16:50] (CR) Milimetric: [C: 2 V: 2] tab and navigation improvements [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87336 (owner: Milimetric) [14:23:34] (PS1) Milimetric: increase refresh speed [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87338 [14:23:42] (CR) Milimetric: [C: 2 V: 2] increase refresh speed [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87338 (owner: Milimetric) [14:26:23] oh cool ottomata, sorry just read the above [14:26:37] so you think the query works though? [14:26:46] like the one you found missing was actually missing? [14:27:00] and was it a dog to run? [14:27:09] brb, getting some breakfast [14:30:10] i'm running a bigger one now [14:30:18] i'll tell you how much data and how long in a sec [14:34:41] ok great, just checked eqiad data for 02 − 12 hours [14:34:43] 10 hours of data [14:34:45] 0 missing seqs [14:35:06] took 15 minutes to execute that query [14:35:25] 4 maps, one reduce [14:41:38] milimetric: , check bottom of etherpad [14:41:40] http://etherpad.wikimedia.org/p/swhYPuS04T [14:41:41] I'm not familiar with it so I didn't fix it in case I made it worse [14:41:45] your query [14:41:47] 121281266 records [14:41:47] 73.9G [14:41:47] 891.879 seconds 15 minutes. [14:41:52] 891.879 seconds / 15 minutes. [14:42:15] hm... not terrible.. not great [14:42:28] we can fine-tune it later :) [14:42:42] impotent thing is that it seems to work! [14:42:54] but doesit actually work? [14:43:14] i think so! [14:43:28] you can see how I edited it [14:43:31] did I do it wrong? [14:43:35] i added L.HOUR BETWEEN 2 and 12 [14:43:37] for each subquery [14:44:34] oops, i'm selecting min and max in both minmax subqueries [14:44:36] lemme fix that [14:45:27] how come hour has to be between 2 and 12? [14:45:42] it doesn't have to be, but i wanted to make sure I didn't get any border cases, i wanted to select only 100% complete hours [14:45:48] 1 and 13 are on the ends of what we have right now [14:45:58] oh gotcha [14:45:59] cool [14:46:06] and it returned nothing this last run? [14:46:12] right [14:46:20] i've got two tables right now [14:46:26] webrequest_eqiad0 and webrequest_esams0 [14:46:31] i did between 2 and 7 for esams0 [14:46:34] it returned a single seq [14:46:38] well, 2 numbers from your query [14:46:42] the seqs bordering the missing one [14:46:52] and did you check that it was actually missing? [14:47:10] no hm, will do [14:47:10] like select * from ... where sequence = that_one? [14:47:20] yeah, 'im going to select where in those 3 seqs [14:47:38] the annoying thing is that it doesn't let us do cross joins [14:47:58] Ah, i don't think I saved the seq :/ [14:47:58] otherwise i'd be able to output the better version I wrote under my first query [14:48:17] fix your query with the minmax thing? and i'll run it again on esams [14:48:18] :) ok, well, it did work against my test mysql table [14:48:25] i fixed your version [14:50:29] ok [14:50:37] the one at the bottom? [14:55:25] yes, ottomata, the one at the bottom, it's no longer selecting both min and max in both MaxSeq and MinSeq subqueries [14:55:35] ok great, running it now on esams [14:55:39] with between 2 and 12 [15:06:19] oof milimetric, i have more missing seqs this time from esams [15:06:33] so let's check a couple of them [15:06:44] and if they're really missing, then silver lining - the script works [15:06:45] :) [15:07:24] here are a few of the ending ones [15:07:24] 109613334 [15:07:24] 109613354 [15:07:25] 109613359 [15:07:25] 109613464 [15:07:25] 109613474 [15:07:25] 109613549 [15:09:38] ooh, that seems to be quite a bit [15:10:00] there are a lot more too [15:10:02] so maybe do a select * from ... where sequence between 109613335 and 109613353 [15:10:12] yeah [15:10:28] so, the fact that most of the missing ones here are approximately at the same time [15:10:35] makes me think they are not actually missing, just not consumed yet [15:10:39] right [15:10:43] interesting [15:10:48] if I keyed on hostname [15:10:51] which i think i should do [15:10:58] then all requests from a single host would go to a single kafka partition [15:11:01] and then guarunteed to be in order [15:11:31] i'm going to run some tests on these numbers, and then restart varnishkafka in new topics partitioned by hostname [15:11:44] as long as the traffic from the hosts we are consuming from is approximately equal [15:11:47] it won't matter [15:11:48] i'm gonna check if there's a way to get you a better output [15:11:56] i just need a row_number function [15:12:11] hive (test)> select hostname, sequence, dt from webrequest_esams0 where year = 2013 and hour = 12 and sequence between 109613335 and 109613353; [15:12:15] hostname sequence dt [15:12:15] cp3003.esams.wikimedia.org 109613340 2013-10-03T12:51:53 [15:12:15] cp3003.esams.wikimedia.org 109613347 2013-10-03T12:51:53 [15:12:15] cp3003.esams.wikimedia.org 109613353 2013-10-03T12:51:53 [15:13:13] hmm, that doesn't look write milimetric [15:13:19] if there are actually records there, right? [15:13:25] 109613334 [15:13:25] 109613354 [15:13:27] is what we have in the output [15:13:39] hmmm [15:13:39] no [15:13:42] hm [15:13:49] yeah [15:13:52] ? [15:14:04] you're confusing me [15:14:10] haha you're confusing me! [15:14:11] but yes, those numbers shouldn't be in there [15:14:21] so the query failed imo [15:14:24] right* [15:14:25] unless i'm missing something [15:14:26] hm [15:14:55] ok, i'm going to consume with camus again [15:14:57] and rerun this [15:15:00] oh, did you send me consecutive pairs, because it might be from ...54 to ...59 instead of ...34 to ...54 [15:15:00] and see what pos out [15:15:12] i sent you what your query pops out [15:15:14] qchris around? [15:15:20] the last few lines of it [15:15:30] oh that's the wrong test then [15:15:35] ? [15:16:00] it'll output pairs of sequence numbers [15:16:00] that border missing ones [15:16:00] right [15:16:14] so it's important to take them as pairs starting with the first one [15:16:19] hm [15:16:20] ok [15:16:21] we just picked two random ones in the middle if what you pasted were the last ones [15:16:27] so this is a better test (one sec) [15:16:49] where sequence between 109613475 and 109613548 [15:16:53] here are the first 3 pairs [15:16:55] in the list [15:16:55] if those are actually the _very_ last 2 [15:16:56] 91965125 [15:16:56] 109579540 [15:16:56] 109579550 [15:16:56] 109579555 [15:16:56] 109579560 [15:16:56] 109579575 [15:17:00] oh :) [15:17:03] those were the last 2 [15:17:16] where sequence between 109579561 and 109579574 [15:17:21] do that one ^ [15:18:05] woa and it seems to think there were a lot missing in the first pair [15:18:18] yeah [15:19:19] hostname sequence dt [15:19:20] cp3003.esams.wikimedia.org 109579573 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579568 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579569 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579571 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579572 2013-10-03T12:51:46 [15:19:20] cp3003.esams.wikimedia.org 109579561 2013-10-03T12:51:46 [15:19:21] cp3003.esams.wikimedia.org 109579566 2013-10-03T12:51:46 [15:19:21] cp3003.esams.wikimedia.org 109579562 2013-10-03T12:51:46 [15:19:22] cp3003.esams.wikimedia.org 109579564 2013-10-03T12:51:46 [15:19:22] cp3003.esams.wikimedia.org 109579565 2013-10-03T12:51:46 [15:19:23] cp3003.esams.wikimedia.org 109579567 2013-10-03T12:51:46 [15:19:23] cp3003.esams.wikimedia.org 109579570 2013-10-03T12:51:46 [15:19:36] gonna order by [15:20:29] 109579561 [15:20:29] 109579562 [15:20:29] 109579564 [15:20:29] 109579565 [15:20:29] 109579566 [15:20:30] 109579567 [15:20:30] 109579568 [15:20:31] 109579569 [15:20:31] 109579570 [15:20:32] 109579571 [15:20:32] 109579572 [15:20:33] 109579573 [15:20:34] sort -n was faster :p [15:20:57] ok, i'm going to run another camus import and see how it looks [15:37:00] ottoman is this page accurate: https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Access ? [15:41:14] yes, i updated it recently [15:41:45] milimetric: does that mean your query isn't working though? [15:41:48] those results? [15:41:52] it looks like just 63 is missing [15:41:57] ty [15:42:21] sec, in talk with toby [15:43:12] k [15:48:06] oh and awesome, milimetric, after I consumed with kafka again, the missing seq (63) in that range is there [15:48:19] camus* [15:48:26] running your full query again [15:48:29] just to see [16:01:55] ok, after the latest camus run [16:02:03] fewer results from your query, milimetric [16:02:08] 91965125 [16:02:08] 103290991 [16:02:09] 91965123 [16:02:09] 1856454 [16:02:09] 103290989 [16:02:09] that's it [16:02:18] also, that's 5 numbers :p so not even pairs [16:04:00] ok, i'm now producing from kafka partitioning on hostname [16:04:14] so, i betcha we'll see this behavior less now [16:04:27] will wait a bit to collect more data [16:12:54] ottomata: how can I run test queries, I want to fix the script because I can't tell whether it's broken or not [16:13:14] brb [16:14:00] —auxpath has been treating me weird [16:14:00] so [16:14:03] hive [16:14:16] add jar /home/otto/hive-serdes-1.0-SNAPSHOT.jar; [16:14:19] use test; [16:14:21] (i'm using the test db) [16:14:24] the tables are there [16:14:48] (Abandoned) Milimetric: Add .gitreview [analytics/limn] - https://gerrit.wikimedia.org/r/86169 (owner: QChris) [16:15:04] (Abandoned) Milimetric: Add .gitreview [analytics/limn] (develop) - https://gerrit.wikimedia.org/r/86170 (owner: QChris) [16:17:06] (CR) Milimetric: [C: 2 V: 2] Remove "mobile" part [analytics/global-dev/dashboard] - https://gerrit.wikimedia.org/r/84310 (owner: QChris) [16:17:50] (CR) Milimetric: [C: 2 V: 2] Remove "geowiki" part [analytics/global-dev/dashboard] - https://gerrit.wikimedia.org/r/84311 (owner: QChris) [16:19:16] (CR) Milimetric: [C: 2 V: 2] Move editor fraction computations to geowiki [analytics/global-dev/dashboard] - https://gerrit.wikimedia.org/r/84312 (owner: QChris) [16:20:04] (CR) Milimetric: [C: 2 V: 2] Add argument parsing skeleton for monitoring script [analytics/geowiki] - https://gerrit.wikimedia.org/r/85605 (owner: QChris) [16:22:56] (CR) Milimetric: [C: 2 V: 2] "I don't really understand script code so my merges on these change sets are done under the assumption that the scripts work. I'm just loo" [analytics/geowiki] - https://gerrit.wikimedia.org/r/85606 (owner: QChris) [16:25:05] (CR) Milimetric: [C: 2 V: 2] Rename DEBUG to USE_CACHE for monitoring [analytics/geowiki] - https://gerrit.wikimedia.org/r/85607 (owner: QChris) [16:25:39] (CR) Milimetric: [C: 2 V: 2] Add caching option to monitoring script [analytics/geowiki] - https://gerrit.wikimedia.org/r/85608 (owner: QChris) [16:27:28] ottomata: FAILED: Error in metadata: ERROR: The database test does not exist. [16:27:42] where's the test database? [16:27:58] i've got "database_name" and "default" [16:29:04] (CR) Milimetric: [C: 2 V: 2] Add verbosity option to show downloaded urls [analytics/geowiki] - https://gerrit.wikimedia.org/r/85609 (owner: QChris) [16:29:13] hm [16:29:20] show databases; [16:29:33] milimetric: ^ [16:30:07] yeah, I did, it gives me the two I listed above [16:30:26] you're on kraken-namenode standby? [16:30:28] in labs? [16:32:37] no [16:32:38] in prod [16:32:41] :) [16:32:47] of course [16:32:49] use analytics1011.eqiad.wmnet [16:32:50] it's real data [16:32:51] for now [16:32:54] ja [17:02:34] scrum guys? [17:02:42] milmetric [17:02:57] milimetric: ^^ [17:03:06] https://plus.google.com/hangouts/_/bece4b70939f46134efa00bcb4dd4449bc197785 [17:53:26] milimetric around? [17:54:11] hey ottomata, I think the query works [17:54:16] but I'm confused about your results from earlier [17:54:33] so we should really test with a controlled file [17:54:47] like, make up a fake file in hour 13 or something [17:54:50] and run it against that [17:55:50] I updated the top of the etherpad with two versions we could run [17:56:17] the "find the missing sequences" version is the first one [17:56:20] that took about 9 minutes [17:56:37] the "just figure out whether there ARE any missing sequences" version is the second one [17:56:41] that took about 5 minutes [17:56:56] the second one returns just a count and if it's > 2, then there are missing sequences [17:57:16] the first one shows you the ranges of missing sequences in a nice table along with a count of how many are missing. [18:05:07] we're ready for the wikimetrics demo? [18:05:28] i think we are; dan is in the hangout [18:05:37] Jaime is here [18:05:41] cool [18:05:46] then we are good to go [18:16:35] oo milimetric so fancy [18:17:00] hm, i don't see the too versions, but that's ok? [18:19:20] yay -- nice work guys [18:19:23] :) [18:24:08] ottomata: ugh, etherpad ate my code [18:24:12] i'll make a gist [18:25:00] k [18:25:15] i'm importing from more hostname partitioned stream from esams now [18:25:25] we can run queries on it when it is done [18:25:31] also, yeah we should get a standard set of data [18:25:32] hm [18:25:33] to test on [18:25:34] hm [18:25:56] i will consume from kafka and just create a file that should ahve all seqs [18:26:03] and then I will remove some random lines from the file [18:26:06] and just upload those to hdfs [18:29:12] ottomata: https://gist.github.com/milimetric/6814636 [18:41:18] awesome, thanks milimetric [18:41:44] does the second way print an accurate count of missing seqs? or just a number that if > 2 means that missing seqs exist? [18:41:44] yeah if those queries don't work I have to eat my shorts :) [18:41:47] no [18:42:04] (n - 2)/2 is the number of *runs* of missing sequences [18:42:41]