[10:43:52] (CR) Nuria: [WIP] Add cohort class hierarchy, refactor CohortService (9 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/134743 (owner: Milimetric) [11:46:50] (PS1) Nuria: Upgrading Alembic [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/135015 (https://bugzilla.wikimedia.org/64001) [12:30:30] (CR) QChris: [C: -1] "Just a few typos and questions. Looks good." (9 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/134377 (owner: Ottomata) [12:33:24] (CR) QChris: [C: -1] "Just marking that we postpone merging, as the referenced" (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/133519 (owner: Ottomata) [12:35:37] qchris, good point about the float [12:35:40] checking double...seems to be better [12:35:49] got 99.99999666666668 for your test case [12:35:58] that's with double. [12:36:02] isn't it? [12:36:03] yes [12:36:05] with double i mean [12:36:13] Ja. Ok. [12:36:32] These kind of issues were the ones, why I wanted to move away from floating point numbers. [12:36:43] double will at least buy us more margin. [12:36:49] decimal type also gives 99.99999666666668 [12:37:04] Decimal types are needed for use cases in which the (very close) approximation of a DOUBLE is insufficient, such as financial applications, equality and inequality checks, and rounding operations. [12:37:18] The precision of a Decimal type is limited to 38 digits in Hive. [12:38:21] Decimals are even harder for me :-) [12:38:40] float and double have a C history. [12:38:52] But if you prefer decimal, decimal it is. [12:39:05] nono [12:39:08] haha, i'm just testing [12:39:12] Anything with considerable more precision around 100 than float is fine by me. [12:39:17] i think doube is good for our case [12:39:22] i'm testing with larger values, and we seem fine [12:40:09] i just went 5 orders of magnitude up, still didn't get rounded to 100 [12:40:23] 2999999999999/3000000000000)*100) == 99.99999999996668 [12:40:29] With double, we should got ~50 bits of significance. That's good for most things. [12:40:32] aye [12:43:43] qchris: NULLS? [12:44:03] MAX(sequence) with NULLs in sequence might bite us. [12:44:17] if sequence is null? [12:44:22] yes. [12:44:26] as I wrote: [12:44:32] I am not sure how likely that is. [12:44:47] But then again ... we're preparing against the unknown error. [12:44:59] null sequence seems very very unlikely, since it is generated by varnishkafka itself, not something coming from the request [12:44:59] And guarding against NULLs is easy here. [12:45:05] but, who knows [12:45:05] bugs [12:45:09] in camus, vk, whatever could do it [12:45:25] Like the n/a that we saw for the response time? [12:45:34] Yes. Totally. [12:45:49] you are suggesting to add a column to count null sequences? [12:45:58] Yes. [12:46:04] cool w me, the more checks the better :) [12:46:15] The HQL I gave is checked and works for me. [12:46:18] cool [12:46:26] so [12:46:30] (because simple COUNT(sequence IS NULL) does not work as expected) [12:46:37] as count_null_sequence? [12:46:48] count_null_sequences [12:46:53] ok [12:47:07] didn't column names start with sequence ... let me check. [12:47:31] What about "sequence_null_count" [12:47:32] oh they did [12:47:36] but i'm thinking of removing it :p [12:47:41] :-D [12:47:46] seems silly now that i'm looking at it, since we are really counted records here, [12:47:46] so [12:47:50] count_actual [12:47:50] count_expected [12:47:50] etc. [12:47:54] If the column names change anyway, I do not care about it. [12:48:23] so, +1 for count(*) [12:48:29] \o/ [12:48:53] I thought about commenting on the common "sequence_" prefix, but then ... we are focused on sequences only. So I did not comment, as [12:49:03] it does not come out of thin air. [12:49:19] OO, i like your duplicate count suggestion! [12:49:20] cool! [12:49:26] will add that [12:49:42] i don't understand your MIN/MAX/COUNTs comment [12:50:01] reusing them [12:50:04] Hive does not cache the result for MAX. [12:50:15] So it has to recompute MAX for each time we use it. [12:50:21] were you ableto use the alias? [12:50:48] Work around it by using a udf thaht takes MAX as input once, and the UDF generates the required fields. [12:50:58] oh a UDF! [12:50:59] Thereby Hive has to compute MAX only once [12:51:06] we get the same columns. [12:51:11] awesome [12:51:13] I am for it [12:51:15] And the payoff is ~50 calls to an UDF. [12:51:23] that will be useful for other things too, i'm sure [12:51:31] i wonder what happens if I used a subquery... [12:51:40] But all in all, I do not think it's worth it, as the benefit is too small, and the UDF just hides the logic. [12:51:48] i could wrap the main select in a subquery, and then use the aliases in the outer query [12:52:00] but a UDF like that will probably be useful [12:52:04] Yes, you can do that, but that's harder to read for me than a UDF. [12:52:07] haha [12:52:17] nawww, barely! its right in front of you in the query! [12:52:18] Upgrade to CDH5.0. There we get macros :-) [12:52:20] itlooks the same! [12:52:22] oh? [12:52:25] linky? [12:52:35] I checked yesterday ... let me go through the history... [12:53:19] that's this? [12:53:19] • Stats-based short cuts of aggregated queries (e.g. min, max and count) [12:54:31] nope. [12:54:44] Those were just for constant expressions, weren't they? [12:54:44] hm k [12:54:50] i dunno, i'm just googling [12:57:39] oh, qchris [12:57:42] dupliates [12:57:48] we need to count distinct hostname AND sequence [12:57:49] no? [12:58:04] or is the groupby enough [12:58:07] we group by hostname. [12:58:12] jajaj, ok sorry [12:58:12] ja [12:58:17] count is limited to the group. [12:58:39] But Hive is doing strange things ... so I'll double-check. [12:58:46] http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Release-Notes/cdh5rn_new_in_5b1.html?scroll=concept_cqm_gvc_xl_unique_1 [12:59:01] ^ there you'll find that they backported 2655 [12:59:03] https://issues.apache.org/jira/browse/HIVE-2655 [12:59:44] And looking through [12:59:46] https://issues.apache.org/jira/secure/attachment/12586562/HIVE-2655-10.patch [12:59:59] that's just the thing that would make UDFs more transparent here. [13:00:11] So let's upgrade to CDH5 :-P [13:02:22] ok, we are doing that anyway,! :) [13:02:54] ok, in that case, we just leave the multiple MAXes there for now? [13:02:59] or should I try subquery? :) [13:03:17] for now? [13:05:02] btw, qchris, thanks for doing such an awesome review of this! you are awesome! [13:05:23] I'd just leave the multiple MAXs. [13:05:46] ottomata: review is easy. Coming up with the first solution is the hard part.:-P [13:05:52] So thanks for doing that \o/ [13:06:43] (CR) Ottomata: Add bin/sequence-file wrapper for refinery-tools (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/133519 (owner: Ottomata) [13:08:20] hey qchris, since we are being very nit-y about this, just curious [13:08:29] do you have an opinion on uppercase vs lower case in SQL? [13:08:33] e.g. SELECT vs select [13:08:51] i go back and forth on it [13:08:51] Not too much opinion. [13:08:51] we shoudl certainly be consistent [13:08:53] I like uppercase a tad more. [13:09:12] But if you prefer lowercase, that's totally fine by me. [13:09:13] but it seems unnecessary and a little annoying [13:09:19] i would prefer it if it was convention [13:09:35] i certainly don't type uppecase for ad-hoc queries [13:09:41] I think our MediaWiki use uppercases it. [13:09:46] yeah, let's stick with convention then [13:09:47] k [13:12:34] ottomata: I double checked, and grouping /is/ respected for counting distincts. so no need to add hostname there. [13:12:50] ha, ok thanks! [13:43:20] hm, qchris, i'm not so sure about the name 'presence' [13:43:30] what's better? [13:43:30] hm [13:43:35] 'check'? [13:43:40] percent_different [13:43:44] Neither am I, but I thought I nagged enough already :-) [13:43:57] naw, bring on all the nags, especially naming ones [13:44:00] those are important to me [13:44:01] :) [13:44:30] You said that you maybe rename other columns as well ... what will they be called? [13:45:06] (or was that just getting rid of the sequence_ prefix?) [13:45:06] here, will submit patch [13:45:11] just that mainly [13:45:57] Ok. [13:46:32] So about 'presence' ... mhmmm (I guess the 'presence' in the file name 'presence.hql') ... [13:46:56] It's a check for holes / duplicates in the sequence numbers. [13:47:04] sequence_check.hql [13:47:16] check_for_dense_sequence_numbers.hql [13:47:37] aggregate_sequence_stats.hql [13:47:40] (PS4) Ottomata: Add hive/webrequest/presence.hql to help monitor webrequest loss and duplication [analytics/refinery] - https://gerrit.wikimedia.org/r/134377 [13:48:09] (CR) Ottomata: Add hive/webrequest/presence.hql to help monitor webrequest loss and duplication (8 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/134377 (owner: Ottomata) [13:48:52] ... I guess I like "aggregate_sequence_stats.hql" that's really descriptive [13:49:08] and not too long (although long indeed) [13:49:33] hmm [13:49:43] what should the percent field be called? [13:50:13] yeah i was thinking about check too [13:50:21] sequence_stats.hql is good [13:50:30] but i think naming that percent field might inform the name [13:50:38] Sure. [13:50:40] About the percent field ... mhmm ... [13:51:14] The absolute one is called count_different ... [13:51:24] so what about percentage_different [13:51:37] percent_different [13:51:45] count_different_percent [13:51:57] i was thinking percent_different too [13:52:18] so, table name: [13:52:26] webrequest_sequence_stats? [13:52:36] Sounds good to me. [13:52:38] hmmm [13:52:38] ok [13:53:38] We put everything into the wmf database for now. Right? [13:53:58] (Or did we already split things into separate databases at some point?) [13:54:44] right now, yeah [13:54:48] Ok. [13:55:01] i'm notsure about that either,i think we need a wmf raw database for stuff like this maybe [13:55:07] and a different one for the refined data [13:55:35] Having two separate databases for raw and refined data sounds convincing to me. [13:56:27] yeah [13:56:32] ok that is a todo [13:56:37] maybe after the reinstall... [13:56:40] and cdh5 upgrade [13:56:42] maybe before! [13:56:52] (PS5) Ottomata: Add hive/webrequest/seqeunce_stats.hql to help monitor webrequest loss and duplication [analytics/refinery] - https://gerrit.wikimedia.org/r/134377 [17:10:31] (PS1) Milimetric: Fix running tests through ipython [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/135062 [17:26:44] (PS8) Milimetric: Add cohort class hierarchy, refactor CohortService [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/134743 [17:26:49] (CR) Milimetric: Add cohort class hierarchy, refactor CohortService (9 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/134743 (owner: Milimetric) [19:03:21] milimetric: hiii [19:03:23] yt? [19:27:34] need brain bounce! :) [20:17:02] ah! ottomata I missed your ping [20:17:05] you around still? [20:17:28] yup [20:17:36] cool, what's up? [20:17:41] oo man fan is going crazy, gettin gheadphones, going to batcave [20:17:46] omw [22:16:48] milimetric: if you are still around, i think i got something good! but now just trying to get python module stuff working right... [22:17:13] ottomata: I do not envy you [22:17:22] and sadly can't help either, people are over now [22:17:25] ok cool, np [22:17:36] gl though, or just put it off till tuesday [23:15:53] (PS1) Ottomata: [WIP] Add code to auto-drop old hive partitions and remove partition directories [analytics/refinery] - https://gerrit.wikimedia.org/r/135128