[10:28:51] (PS1) Gilles: Fix typo on Finnish wikipedia [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/130056 [10:32:18] (PS1) Gilles: Fix typo on Finnish wikipedia [analytics/multimedia] - https://gerrit.wikimedia.org/r/130057 [10:43:47] (CR) Nuria: [WIP]Adding test for cohort uploading for cohort with cyrilic and arabic usernames. (3 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129672 (https://bugzilla.wikimedia.org/63933) (owner: Nuria) [10:51:23] (PS4) Nuria: Adding test for cohort uploading for cohort with cyrilic and arabic usernames. [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129672 (https://bugzilla.wikimedia.org/63933) [14:48:34] (PS1) Erik Zachte: new comScore files [analytics/wikistats] - https://gerrit.wikimedia.org/r/130081 [14:55:25] (PS8) Milimetric: Fix user name display in CSV files [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129025 (https://bugzilla.wikimedia.org/64026) [14:56:43] (PS9) Milimetric: Fix user name display in CSV files [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129025 (https://bugzilla.wikimedia.org/64026) [15:01:40] (CR) Nuria: [C: 2] "We tested on development sandbox that for two users with the same id in different projects reports do return distinct results per user." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129025 (https://bugzilla.wikimedia.org/64026) (owner: Milimetric) [15:24:56] (PS5) Milimetric: Adding test for cohort uploading for cohort with cyrilic and arabic usernames. [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129672 (https://bugzilla.wikimedia.org/63933) (owner: Nuria) [15:25:03] (CR) Milimetric: [C: 2] Adding test for cohort uploading for cohort with cyrilic and arabic usernames. [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/129672 (https://bugzilla.wikimedia.org/63933) (owner: Nuria) [15:25:34] oriyt? [15:25:36] ori, yt? [16:16:15] ottomata: hey [16:18:37] heya, curious [16:18:58] do you know how eventlogging / kafka python module is partitioining the messages? [16:19:01] is it just random? [16:19:10] or is it using a keyed partitioner to select the partition? [16:22:19] ottomata: keyed partitioner iirc [16:22:31] keyed on what, do you know? [16:23:11] the schema name [16:24:53] does that change? [16:24:55] per message? [16:26:41] ottomata: not sure what you mean. there's a finite number of schemas that are active at any one time (around 30-40 atm), but the volume of messages per schema is uneven [16:29:55] ah ok [16:30:03] that is what i was wondering, and actually had not realized [16:30:03] hm [16:30:04] so [16:30:15] i had noticed that the partitions weren't balanced [16:30:15] hm [16:30:17] so [16:30:42] the json schema in each message in the eventlogging stream is going to be different for each messages? [16:31:11] ori^? [16:32:29] yeah, they're not balanced because they're logged at different volumes [16:32:58] we could partition by uuid [16:33:00] that's probably saner [16:33:03] the uuid is unique per message [16:33:43] or just random even [16:33:45] would be fine [16:34:00] but, this is a little annoying, beacuse I was hoping to map hive tables on top of the stream [16:34:05] didn't realize t hey were different schemas [16:34:24] perhaps each schema should be its own topic in kafka? or i guess they are on demand schemas, huh? [16:34:54] each (schema, revision id) pair defines a mapping [16:35:08] the mariadb consumer creates a table for each one of them [16:35:49] like, new revision_id gets you an automatic new table? [16:35:55] yes [16:35:58] hmmmmm [16:36:00] well, [16:36:03] you need to log an event [16:36:07] and it needs to be well-formed [16:36:36] aye [16:36:37] hm [16:36:38] hm [16:36:50] ottomata: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/jrm.py#L111 [16:36:54] not sure how I can accomadate that automatically in hive [16:37:04] i thikn we'd have to etl it after it is in hadoop [16:37:12] all camus does is write the data do a directory named by the topic [16:37:29] then we have a custom script that looks for tables in hive and adds partitons to them as they are imported [16:37:35] so each scid (=schema + rev id) should be a topic [16:37:41] yeah [16:37:45] in order to use it this way [16:37:57] are topics expensive? [16:37:59] but that's pretty annoying, as there are 30-40 of them! [16:38:05] no, not really, no more than partitions [16:38:12] but they are not automatic [16:38:22] they can be auto-created... [16:38:23] gm [16:38:24] hm [16:38:47] basically the sql writer assumes there's a table and just issues an INSERT statement [16:38:58] if the server comes back with an error indicating the table doesn't exist, it is created [16:39:16] there is a potential for DDOS there, but tables are only created for legitimate (schema, rev id) pairs [16:39:27] and creating revisions requires editing things on metawiki [16:39:37] so you'd get caught pretty quickly [16:40:01] i.e., if you were trying to trigger the creation of a large volume of tables by editing schemas and logging a well-formed event for each edit [16:40:38] aye hm [16:40:39] db1048 has 159 tables (=159 schema / rev id combinations), representing a year's worth of eventlogging data [16:40:57] so yeah, we could autocreate them in kafka... [16:40:58] i think we can do that [16:41:10] we'd have to set some things to make sure partition and replica settings are correct for new topics [16:41:11] but ungh [16:41:21] still, going to be really annoying [16:41:58] would you like to punt on that for now and chat about it in person at zurich? [16:43:13] yeah sounds good [16:43:29] should I skip the camus imports for now then? [16:43:43] would the data be usable in hadoop even if it was all in one directory? [16:43:54] i.e. didn't have a hive table mapped onto it? [16:44:10] you could still write code to query it, via pig or whatever [16:44:23] but you'd need to filter out schemas you didn't care about [16:45:05] i think so. hive is sql-like, sql is sql. so if you really want to do straightforward relational querying, there's always the database there. hadoop is useful for more specialized data processing [16:45:57] so i think it would still be usable [16:46:19] ok, cool, i kinda want to get it working anyway, since iI sent linkedin a pull request today...IT SHOULD WORK, everything says it is working [16:46:21] but it isn't writing the data! [16:46:32] ok, i gotta move to a cafe, gonna bang this out real quick when I get there!! GRRR [16:46:33] bbl [16:46:36] thanks ori! [16:49:02] thanks ottomata [19:37:27] hey ori, i just figured out why camus and eventlogging wasn't working! [19:37:27] ah! [19:37:34] need a real quick brainbounce [19:38:52] ottomata: hey! what's up? [19:40:05] hey so [19:40:10] https://github.com/linkedin/camus/pull/65/files [19:40:16] just thikning about best way to support this [19:40:18] ok so [19:40:27] there's two properties that control this in teh camus configs [19:40:51] camus.message.timestamp.format=unix [19:40:51] camus.message.timestamp.field=timestamp [19:41:03] so, Camus expects timestamps returnred to be in milliseconds [19:41:06] that's why this wasn't working [19:41:19] i didn't realize that, because everything else I had done so far had been returned by SimpleDateFormat [19:41:20] so [19:41:21] wondering [19:41:27] whatcha think [19:41:37] better to have two separate properties for unix formats: [19:41:45] 'unix_seconds', 'unix_milliseconds' [19:41:52] or, to be smart and somehow infer which the timestamp is [19:41:53] like [19:42:08] if timestamp > some big number (since there are many more milliseconds) then assume timestamp is in milliseconds [19:42:11] else multiply by 1000 [19:43:09] i just caught up on your chat from earlier about this [19:43:37] so right now camus is getting the json from eventlogging, as if it was doing a zsub? [19:45:20] ottomata: ^ [19:45:42] kinda, there's a consumer implementation [19:45:52] it can transform the data for kafka's sake [19:46:05] so it can multiply by a 1000 if that's what kafka expects [19:46:14] oh. [19:46:15] hm [19:46:24] i mena, it is easy to do in camustoo ori [19:46:31] and yeah, i think inferring coudl be ok [19:46:33] i could just check [19:46:39] if timestamp > 1000000000000, assume in milliseconds [19:46:43] in seconds [19:46:48] that would be in the year 33658 :p [19:47:21] works for me [19:48:00] milimetric|lunch: i should fill you in on details from my conversation w/springle about uuids [19:48:07] but i have to run now, might have to catch you tomorrow [19:48:28] i'll leave irc on ori, feel free to fill me in whenever and i'll check tonight [19:48:32] but i'm around for another few hours now [19:50:22] ottomata: if you wanted it to work forever you could just check that converted to years it's not more than a hundred years away from the current year or something like that. [19:50:57] but your partition puzzle from above is what nags at me [19:51:25] yeah [19:53:22] ottomata: I don't think it matters that the partitions would be uneven if you partitioned by schema_revid [19:53:41] that's the most natural partition, anything else would hinder queryability of the data [19:54:18] because basically you want selects where schema_revid == something-specific and timestamp between A and B [19:54:32] doing "timestamp between A and B" is really hard if you partition by time, for example [19:55:05] ah, i mean, ok sorry [19:55:07] in that convo [19:55:13] there are two meanings of partition [19:55:18] hive partitions is what you are talking about [19:55:20] and yeah, that is a problem [19:55:21] right [19:55:29] i was talking about kafka partitions [19:55:32] oh [19:55:40] which has more to do with balancing data than anything else [19:55:57] yep, true, but does that matter since this data is miniscule compared to the other things kafka has to deal with? [19:57:11] not really, just annoying because everything else is balanced! [19:57:11] it means that some disks work harder than the others [19:57:11] not a LOT harder [19:57:11] but enough for me to notice [19:57:11] some camus import mappers take longer to run than others [19:57:11] etc. [20:07:42] gotcha, makes sense [20:16:50] (PS1) Ottomata: Adding support for integer unix timestamp in JsonStringMessageDecoder [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/130204 [20:16:52] (PS1) Ottomata: Automatically converting to millisecond timestamp if the value is large enough [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/130205 [20:17:45] (CR) Ottomata: "These are being merged upstream by LinkedIn, going ahead and merging. Let me know if you have comments or objections." [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/130204 (owner: Ottomata) [20:17:49] (CR) Ottomata: "These are being merged upstream by LinkedIn, going ahead and merging. Let me know if you have comments or objections." [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/130205 (owner: Ottomata) [20:17:57] (CR) Ottomata: [C: 2 V: 2] Adding support for integer unix timestamp in JsonStringMessageDecoder [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/130204 (owner: Ottomata) [20:18:03] (CR) Ottomata: [C: 2 V: 2] Automatically converting to millisecond timestamp if the value is large enough [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/130205 (owner: Ottomata) [20:18:55] (PS1) Ottomata: Releasing 0.1.0-wmf4 with support for integer timestamps in JsonStringMessageDecoder [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/130206 [20:19:17] (CR) Ottomata: [C: 2 V: 2] Releasing 0.1.0-wmf4 with support for integer timestamps in JsonStringMessageDecoder [analytics/camus] (wmf) - https://gerrit.wikimedia.org/r/130206 (owner: Ottomata) [20:27:29] (PS1) Ottomata: Deploying camus-wmf-0.1.0-wmf4.jar [analytics/kraken/deploy] - https://gerrit.wikimedia.org/r/130208 [20:28:13] (CR) Ottomata: [C: 2 V: 2] Deploying camus-wmf-0.1.0-wmf4.jar [analytics/kraken/deploy] - https://gerrit.wikimedia.org/r/130208 (owner: Ottomata) [20:33:47] (PS1) Ottomata: Importing eventlogging via Camus into HDFS [analytics/kraken] - https://gerrit.wikimedia.org/r/130209 [20:34:42] (CR) Ottomata: [C: 2 V: 2] Importing eventlogging via Camus into HDFS [analytics/kraken] - https://gerrit.wikimedia.org/r/130209 (owner: Ottomata) [20:35:15] (PS1) Ottomata: Updating kraken, now importing eventlogging via Camus into HDFS [analytics/kraken/deploy] - https://gerrit.wikimedia.org/r/130210 [20:35:28] (CR) Ottomata: [C: 2 V: 2] Updating kraken, now importing eventlogging via Camus into HDFS [analytics/kraken/deploy] - https://gerrit.wikimedia.org/r/130210 (owner: Ottomata)