[00:01:24] Analytics-Backlog, Discovery, Reading-Infrastructure-Team: Determine proper encoding for structured log data sent to Kafka by MediaWiki - https://phabricator.wikimedia.org/T114733#1815581 (bd808) >>! In T114733#1812461, @EBernhardson wrote: > Avro's json format might be a better choice for writing to ka... [00:18:41] Analytics-Backlog, Fundraising research, Research-and-Data: FR tech hadoop onboarding - https://phabricator.wikimedia.org/T118613#1815658 (madhuvishy) @atgo - Do you need to be able to run Hive queries? You were added to statistics-privatedata-users that doesn't allow for that. You'd have to request to... [01:51:12] (PS2) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [01:52:00] (CR) jenkins-bot: [V: -1] [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) (owner: Madhuvishy) [02:32:19] (PS3) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [02:33:11] (CR) jenkins-bot: [V: -1] [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) (owner: Madhuvishy) [02:54:09] (PS4) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [03:43:42] randomly...any ideas what an .sbt file would look like for building hive udf's? i need a super simple udf that sum's an array so figured scala would be easy...the scala part was easy, but now sbt is a pain in my rear :P [08:08:19] Analytics-Backlog: Wikimedia "top" pageviews API has problematic double-encoded JSON - https://phabricator.wikimedia.org/T118931#1816459 (whym) [09:11:56] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1816566 (jcrespo) As an update, this task showed to be more complex than initially thought. The complex setup of the eventlogging schema means that it is very prone to break, as [[ https... [10:20:08] (PS1) Addshore: Load WikimediaCurl in twitter file [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/254127 [10:20:22] (CR) Addshore: [C: 2 V: 2] Load WikimediaCurl in twitter file [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/254127 (owner: Addshore) [10:40:33] Hi all! Does anyone have any idea where these files come from? https://metrics.wmflabs.org/static/public/datafiles/Pageviews/ [10:41:35] Hi addshore [10:41:38] Yes we do :) [10:41:55] * addshore is just trying to explain the massive jump in https://vital-signs.wmflabs.org/#projects=wikidatawiki/metrics=Pageviews [10:42:25] addshore: Those files are compute using the aggregator tool (https://github.com/wikimedia/analytics-aggregator) [10:42:26] from roughly the 9th of this month to present day [10:42:26] apparently coming from the desktop site per the csv [10:43:09] hm [10:43:19] Is it explicitly page views? or domain requests? Is there a limit on the domains for wikidata? or is it a wildcard? [10:43:42] it is what is considered as pageviews [10:44:09] as a random stab in the dark might query.wikidata.org accidently be bulked into this? [10:44:24] http://discovery.wmflabs.org/wdqs/#wdqs_usage << note the big usage spike of the query service at the same point [10:45:43] * joal look at pageview definition for wikidata domain [10:46:02] link? :D [10:47:11] addshore: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L66 [10:47:37] yeh, we need to exclude query from that too! [10:47:45] =] [10:47:54] at least for wikidata [10:47:58] now the quesion addshore : what is the mime type of those requests [10:48:24] *checks* [10:49:03] content-type:application/sparql-results+json [10:50:02] addshore: and a path exqample ? [10:50:17] Because the logic behind pageview def is taking all that into account :) [10:50:38] http://tinyurl.com/nz8nvb7 << too long for irc ;) [10:50:53] ttps://query.wikidata.org/bigdata/namespace/wdq/sparql?query=stuffhere.. [10:51:34] although requests to https://query.wikidata.org/* should be ignored :) [10:51:37] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L297 [10:51:49] These requests shouldn't be included [10:52:00] hmmm [10:52:09] But ok, let's make a ticket to remove query.wikidata.org from pageview [10:52:16] *will do* [10:52:24] filed against analytics-backlog? [10:52:28] Let me double check on hiove before [10:52:33] okay! [10:52:43] BEcause it might not even be those specific case [10:57:52] Analytics-Backlog, Analytics-General-or-Unknown, WMDE-Analytics-Engineering, Wikidata, Story: [Story] Statistics for Special:EntityData usage - https://phabricator.wikimedia.org/T64874#1816685 (JAllemandou) [10:58:19] also addshore, added the task about Special:EntityData to our backlog --^ [10:58:28] awesome :) [10:58:42] No promise on deadline though ;) [10:58:54] the end of the year would be amazing ;) [10:58:59] Analytics-Backlog, WMDE-Analytics-Engineering, Wikidata: Remove query.wikidata.org from pageview definition (for wikidata) - https://phabricator.wikimedia.org/T119054#1816686 (Addshore) NEW [10:59:06] also joal created this one for tracking this thing ^^ [10:59:11] k :) [10:59:23] addshore: can you add the comment in the task about deadline ? [11:01:09] sure! [11:01:43] Analytics-Backlog, Analytics-General-or-Unknown, WMDE-Analytics-Engineering, Wikidata, Story: [Story] Statistics for Special:EntityData usage - https://phabricator.wikimedia.org/T64874#1816693 (Addshore) It would also be great to have this running (perhaps with all possible historical data (I th... [11:02:31] basically we are trying to get a bundle of stuff done for the dev summit [11:02:52] addshore: on one hour (2015-11-14T19:00 UTC) --> No query.wikidata.org domain [11:03:13] These don't seem to show up in our webreques [11:03:23] Looking for the full day [11:03:49] hmm, I am sure I have seen them in webrequest raw before! [11:04:13] as I was looking at the resonse codes [11:04:41] ohhh, maybe response code thing: I only kept 200 and 304 [11:05:23] hmm, well there should be lots of 200s in there [11:06:09] So for the given hour, only www.wikidata.org or m.wikidata.org [11:06:51] for the given hour: 2317120 on www, not pageview, and 146672 on www pageview [11:07:17] on mobile, it's neglectable (4576 and 2410 respectively) [11:07:19] addshore: --^ [11:07:44] for 2015-11-14T19:00 ? [11:08:32] yep, mobile has minimal usage on wikidata currently [11:08:50] addshore: https://gist.github.com/jobar/e8954d5cac5b7c605620 [11:09:55] ahh, *didnt know about the is_pageview value there* [11:11:05] addshore: it's the one one which we filter to get the pageview_hourly table :) [11:11:38] addshore: but still, no query.wikidata.org :( [11:11:47] I'm still mildly confused then, as this massive spike doesnt show in reqstats at all [11:12:26] well, I guess reqstats !== page views but [11:12:39] I don't know about reqstats [11:13:03] this still seems very odd to me ;) [11:13:34] addshore: may some bot managing to get through our filtering (not very difficult, not to say very easy) [11:19:03] *keeps digging* [11:26:59] addshore: daily results seems reasonable (3M pageviews) [11:27:11] addshore: digging into user agent [11:27:57] wait, 3M pageviews? :P thats even more than is showing in the csv! :P [11:28:20] Analytics-Backlog, WMDE-Analytics-Engineering, Wikidata: Investigate wikidata pageview sipke on 2015-11-14 - https://phabricator.wikimedia.org/T119054#1816728 (JAllemandou) [11:28:34] addshore: not filtered by agent_type = 'user' :) [11:28:50] addshore: changed the title of he task you created --^ [11:28:55] awesome! [11:29:15] not filtered by agent_type ahh! [11:29:41] So I assume half/half roughtly [11:29:47] (user / sipder) [11:43:14] so odd, and looking at the pageview api it doesn't look like any 1 page has been accessed an extreme ammount [11:43:54] addshore: I think I have the cuplrit: http-kit/2.0 not considered as spider :( [11:44:22] Is that the whole user agent? >.> [11:44:44] yesir [11:44:50] hah [11:45:19] makes about half of the pageview requests for hour I study [11:45:22] HTTP client/server for Clojure [11:45:27] wow [11:45:32] Th [11:45:36] This explains that [11:45:46] Let's change again the task you filled [11:45:47] is that have of all pageviews, or half of non bot pageviews? [11:45:53] *half [11:46:02] half all pageviews [11:46:07] hah [11:46:18] coming from the same IP as well at a guess? [11:46:24] http-kit/2.0 84311 [11:46:41] second row: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) 19648 [11:46:47] A good power low :) [11:46:53] Didn't check the ip [11:48:34] I just looked at a few live requests and it looks like its loading all revisions... :P [11:50:14] k [11:52:48] Now addshore that thing is a concern for me because of that: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L58 [11:53:13] hmmmm [11:53:15] addshore: Normally, useragents containing http show up as spiders [11:54:08] And I have counter examples in the list I have generated for wikidata [11:54:11] :( [11:55:19] Analytics-Kanban, Datasets-Webstatscollector: Wikimedia "top" pageviews API weirdness with the "Paul_Elio" article - https://phabricator.wikimedia.org/T118933#1816777 (Nemo_bis) Please remember to add a specific blue project to all tasks related to pageviews data. [11:55:58] Analytics-Backlog, Datasets-Webstatscollector, Language-Engineering: Investigate anomalous views to pages with replacement characters - https://phabricator.wikimedia.org/T117945#1816779 (Nemo_bis) [11:56:40] joal: :/ [11:57:55] this thing is making requests to /all/ sites it would seem [11:58:17] and doing the same thing, seemingly requesting random old versions of random pages [11:59:33] joal: how heavy / long would a query looking at all sites take? for this useragnt / the IP? [12:01:12] not sure I understand: what time-frame, and what restriction/data: when user_agent = http-kit/2.0 ? [12:01:27] Analytics-Backlog, Datasets-General-or-Unknown: Wikimedia "top" pageviews API has problematic double-encoded JSON - https://phabricator.wikimedia.org/T118931#1816785 (Nemo_bis) [12:01:37] well, just a sample of an hour would be fine *is writing it* [12:08:40] https://www.irccloud.com/pastebin/aRLv4npA/ [12:09:07] joal: ^^ requests from the useragent & IP for 1 hour per site [12:09:21] wow, that guy crawl heavily [12:09:42] yeh, but why the hell is it crawling old revisions? [12:15:22] k addshore, found the full explanation: I messed up at deploy time of the change for user agent to exclude http [12:15:27] I'm gonna fix that today [12:16:08] :D Cool! Just stick stuff in that ticket :) [12:16:17] I will :) [12:16:20] epic! [12:16:41] also, off the top of your head does the top page views api bit exclude crawlers? ;) [12:18:02] Analytics-Kanban, WMDE-Analytics-Engineering, Wikidata: Fix '.*http.*' not being tagged as spiders in webrequest - https://phabricator.wikimedia.org/T119054#1816809 (JAllemandou) p:Triage>Unbreak! a:JAllemandou [12:18:35] addshore: from 2015-11-17, yes, before, data need to be backfilled (https://phabricator.wikimedia.org/T118991) [12:19:03] okay! [12:22:13] Analytics-Kanban, WMDE-Analytics-Engineering, Wikidata: Fix '.*http.*' not being tagged as spiders in webrequest - https://phabricator.wikimedia.org/T119054#1816815 (JAllemandou) I messed up a deploy about a month ago, preventing the change merged here: https://gerrit.wikimedia.org/r/#/c/244465/ to act... [12:23:37] Analytics-Kanban: Investigate cassandra daily top job [5 pts] {slug} - https://phabricator.wikimedia.org/T118449#1816817 (JAllemandou) [12:38:18] (PS1) Joal: Upgrade refine oozie job to jar v0.0.20 [analytics/refinery] - https://gerrit.wikimedia.org/r/254133 (https://phabricator.wikimedia.org/T119054) [12:38:55] (CR) Joal: [C: 2] "Self merging bug" [analytics/refinery] - https://gerrit.wikimedia.org/r/254133 (https://phabricator.wikimedia.org/T119054) (owner: Joal) [12:39:19] (CR) Joal: [V: 2] "Self merging bug" [analytics/refinery] - https://gerrit.wikimedia.org/r/254133 (https://phabricator.wikimedia.org/T119054) (owner: Joal) [12:43:01] !log Deploying refinery [12:44:04] Analytics-Kanban, WMDE-Analytics-Engineering, Wikidata, Patch-For-Review: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk} - https://phabricator.wikimedia.org/T119054#1816854 (JAllemandou) [12:53:37] !log Restart refine bundle [13:34:29] Analytics, Beta-Cluster-Infrastructure: deployment-fluorine fails puppet '/usr/sbin/usermod -u 10003 datasets' returned 4: usermod: UID '10003' already exists - https://phabricator.wikimedia.org/T117028#1816912 (faidon) a:ArielGlenn This is a long-known problem that @ArielGlenn and I had discussed bef... [14:17:15] Analytics-Backlog, Analytics-General-or-Unknown, WMDE-Analytics-Engineering, Wikidata, Story: [Story] Statistics for Special:EntityData usage - https://phabricator.wikimedia.org/T64874#1817016 (Addshore) [14:21:28] Analytics-Backlog, MediaWiki-API, Reading-Infrastructure-Team, Research-and-Data, and 4 others: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618#1817055 (Addshore) [14:40:08] halfak__: give me a ping when you're up halfak__ :) [14:40:17] exit [14:40:21] oops :) [14:41:06] o/ joal [14:44:14] ping joal [14:49:15] addshore: http-kit/2.0 is considered a spider from now on ! [14:49:18] addshore: http-kit/2.0 is considered a spider from now on ! [14:49:29] amazing! :) [14:49:42] halfak: I'm sorry, it actually haven't ping :( [14:50:00] No worries [14:50:13] I was just about to restart my client ;) [14:50:17] so halfak, I was pissed yesterday, and rewrote the job as a map reduce with secondary sorting [14:50:40] Tested on a single that had failed previously, --> worked [14:50:50] now running on the full stuff [14:50:53] Seems ok [14:51:10] Best programming == "Goddamn it this should work" [14:51:11] Right *goes to go and do a comparison of influxdb vs opentsdb vs graphite.... [14:51:13] :D [14:51:28] yessir :) [14:51:42] halfak: finished late yesterday night, but happy :) [14:52:05] Cool! Thanks joal. I also started up some work. It was pretty easy to change my scripts to handle XML, but I needed to get my python environment set up on the IA/Research cluster. [14:52:13] So I'm glad you beat me to it :D [14:52:28] halfak: I have also figured out an interesting improvement: pushing predicates down into the input format -- Not having to decode the json to filter / generate sorting keys [14:52:54] Not yet fully implemented, but will soon [14:53:13] Yes! I would love to have this in streaming. [14:53:25] I do a lot of json2tsv, tsv2json work. [14:53:52] halfak: well, I still have to aprse the xml though :) [14:54:26] halfak: But since this step is inevitable, I prefer to have as much as the predicates worked after first parse step [14:55:13] +1 [14:56:26] dcausse: Hi Sir :) [14:56:39] dcausse: To let you know that your oozie has been deployed [14:56:43] joal: hi! [14:56:49] thanks! :) [14:56:49] Ready to launch whenever you want dcausse [14:56:58] well we still have issues with avro :( [14:57:09] dcausse: I know that you still have schema management issues, but still wanted to let you know :) [14:57:12] yup [14:57:54] I'll wait for the team to agree, but will probably move your task to done and let you start the job when you're ready dcausse, ok ? [14:58:08] joal: sounds good [14:58:15] awesome dcausse :) [15:00:33] Analytics-Kanban, Datasets-Webstatscollector: Wikimedia "top" pageviews API weirdness with the "Paul_Elio" article [5 pts] {slug} - https://phabricator.wikimedia.org/T118933#1817184 (JAllemandou) [15:03:14] joal, I see a succeeded job on the research cluster with 18 maps and 1 reduce. [15:03:26] halfak: was my test bed :) [15:03:36] curretnly running job is the real big one [15:03:42] Gotcha. [15:03:48] halfak: 6% into reduce [15:03:59] halfak: usual 20h map [15:04:03] How many reducers? [15:04:09] 2000 halfak [15:04:13] Great :) [15:04:20] \o/ [15:04:24] I tried to remember the number you asked me ;) [15:04:40] Well halfak, before it's not finished, I won't say anything [15:04:55] But I fell more confident with that run [15:05:04] Still. It's starting to feel like we're building with Legos instead of atoms. :) [15:05:44] hm halfak, for it's more: since the lego I had fails, went back to atom to build my own lego piece :) [15:06:14] But getting dirty hands on this kind of problem is where you learn moreI think :) [15:06:18] halfak: --^ [15:06:31] It's painful and frustrating, but I learnt :) [15:08:08] I look forward to the point at which we have high-functioning, battle-tested legos to write some docs about. [15:09:00] halfak: I am actually quite happy with the formatting I have in scala, I'll show it to you in a while and see how we can promote that if research need it :) [15:15:42] :) It'll be good that I get a handle on this at least so that I can modify and re-use it. [15:20:07] Analytics-Backlog, Discovery, Reading-Infrastructure-Team: Determine proper encoding for structured log data sent to Kafka by MediaWiki - https://phabricator.wikimedia.org/T114733#1817270 (Ottomata) Could be worth it. We had a meeting yesterday and discussed at least trying this. We want to see if we... [15:32:24] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1817294 (Ottomata) FYI, the repo is here, waiting for some schemas! :) https://gerrit.wikimedia.org/r/#/admin/projects/mediawik... [15:45:58] (PS9) DCausse: Add support for custom timestamp and schema rev id in avro message decoders [analytics/refinery/source] - https://gerrit.wikimedia.org/r/251267 (https://phabricator.wikimedia.org/T117873) [15:52:56] Analytics-Backlog, Discovery, Reading-Infrastructure-Team: Determine proper encoding for structured log data sent to Kafka by MediaWiki - https://phabricator.wikimedia.org/T114733#1817361 (dcausse) Unfortunately AvroJson won't help to resolve the issues we have with AvroBinary, if the schema used by the... [15:55:30] bd808: sure, let's talk today [15:57:22] addshore: you can also contribute code patches/tests if you want your changes to be done sooner [15:57:46] nuria: of course :) [15:58:08] dcausse: yt? [15:59:12] dcausse: how did you work arround the offset issues? [15:59:59] nuria: thanks. I have a magically meeting free day today so I can talk whenever you have time. [16:00:30] let me reach dcausse before cause he is on european tz, what is your tz? [16:00:48] GMT-7 (MST) [16:01:23] nuria: hi! [16:01:31] you were right AvroJson won't work :( [16:01:44] hi yalls! [16:01:51] dcausse: loads of work.. eh? [16:02:04] it's exactly the same requirements, the writerSchema needs to be known even with AvroJson [16:02:05] dcausse: i read your super handy work arround [16:02:18] nuria, dcausse, i'm looking at eventbus stuff too, reallly not sure how all this is going to work, but i'm leaning more and more to making meta.wm.org http schemas be the way we share schemas...not sure htough [16:02:21] avro is very strict :/ [16:02:25] dcausse: but 1st.. how did you get arround the offset stuff? [16:02:33] would like to see if we can make it work with file based repo and with avro [16:02:53] ottomata: I wrote a workaround that uses the classpath for now [16:02:53] ottomata: did you read dcausse 's e-mail on analytics-internal [16:02:56] ? [16:03:28] I think it will be easy to change it to uses an external service (file or http) [16:03:29] dcausse: that is a workarround we can use I think (need to think about cases but do read it and let me know) [16:03:32] yes [16:04:08] ok, let's 1st work on unblocking search team [16:04:08] the format is kafka will be magicbyte + long(8 bytes) [16:04:28] hm, dcausse do you need a reader schema for camus? [16:04:33] if you have the writer schema, that should be enough, no? [16:04:37] ottomata: 1st unblocking and later a solution that fetches schemas on amore solid fashion [16:04:40] evolution can be handled when reading with hive? [16:04:51] ottomata: probably [16:05:30] i mean, i guess its nice, but it does make it a little more complicated [16:05:36] with the property, etc. [16:05:40] mmmm..let's work the case of reading old data from kafka [16:05:49] if you prefer I can remove this latest schema code [16:05:56] nuria: reading old data would just use the given writer schema [16:06:02] and use that to write to hadoop [16:06:15] then when reading historical data, say, via hive, you'd use hte lateste schema [16:06:31] and hive would have each writer schema from the header of each binary file [16:06:48] ottomata: you still need to be sure that writerSchema is < readerSchema [16:07:13] yeah, you'd have to evolve properly [16:07:18] that's the only advantage to specify readerSchema in camus [16:07:21] but that would be true for kafka too, right? [16:07:25] oh, hm [16:07:29] yeah you'd validate that before writing? [16:07:43] I do not tested your idea [16:07:54] i mean, yeah it is a little nicer to have camus always writing with the same schema [16:07:55] but I can remove some code and test if you want [16:08:07] it probably will normalize things a bit in hadoop [16:08:46] the schema used to write a file will be the same for each camus runs, which means the time at which the binary file schema changes in hadoop will always be the same for all data [16:09:09] rather than the case where say there are 2 producers, one of which has schema A, the other schema B, and camus latests schema C [16:09:22] in that case, if you just use the writer schema, you'd have both A and B in the same hourly partition [16:09:23] hm. [16:09:34] i am getting lost [16:09:57] hah, i'm arguing pros and cons of using a reader schema at all with camus (that is, doing schema evolution in camus) [16:10:01] I think you're right it's not necessary to have the readerSchema, if the writerSchema is not know in camus it will fail anyway... [16:10:21] right, there are some pros to having it though, with adding just a little complexity [16:10:32] i guess that can be optional dcausse? if you property isn't set it'll just use the writer? [16:10:48] not sure but I can make it I think [16:10:56] good idea [16:12:56] joal / ottomata I have some oozie stuff for this geo-coded pageview stuff [16:13:04] dcausse: give us a few minutes as ottomata and i need to discuss another project we got going on. I will get back to you [16:13:04] where do you think it makes sense in refinery? [16:13:18] milimetric: new stuff totally ? [16:13:19] oozie/pageview/geographical [16:13:22] dcausse: I still want to know how did you deal with teh offset stuff [16:13:33] well, yeah, it's new [16:13:45] depends on pageview_hourly (not confirmed yet, but probably) [16:13:48] nuria: I did not test with camus in prod yet [16:14:01] dcausse: ah ok, just with unit tests [16:14:04] hm, wonder --> we have projectview at same level as pageview [16:14:05] maybe oozie/pageview/hourly/geographical ? [16:14:08] I've just tested with unit test and found that the problem remains the same :( [16:14:58] milimetric: I'd go for pageview/geo/hourly (or other) [16:15:00] dcausse: i see, that is actually a better path than the one i took, it is too bad we do not have a better test env for this [16:15:01] oops, I guess I mean it depends on projectview [16:15:15] milimetric: project based only, huh [16:15:16] cool, so oozie/projectview/geo/hourly [16:15:20] yes [16:15:27] if ottomata says so :) [16:15:49] ok, ottomata: bless this path or condemn it: "oozie/projectview/geo/hourly" [16:21:18] seems that ottomata is a "no decision" mood milimetric ;) [16:21:45] it's ok, IRC does not come with a "must answer right away" contract :) [16:33:10] haha [16:33:13] was in batcave with nuria [16:33:22] milimetric: what is this data? [16:33:47] it's data for Erik's geographic breakdown of pageviews by project [16:33:51] hourly [16:33:53] dcausse: back, then , my preference will be to eventually use confluent schema registry but in order to unblock you we can use your suggested interim solution, we need to wrap it such the schema retrieval [16:33:59] (he currently has it at 15 minute intervals [16:34:01] ) [16:34:07] but I'm seeing if he's ok with hourly [16:34:37] dcausse: can be swapped to use schema registry later (or even our meta json extension for schemas) [16:34:43] milimetric: where will this data be saved in hdfs? [16:34:54] will there also be a new hive table? [16:34:55] nuria: yes I think it will be relatively easy [16:34:57] what will it be called? [16:35:05] dcausse: https://meta.wikimedia.org/wiki/Schema:EditorActivation [16:35:07] no hive table, just files [16:35:13] Erik said it's up to us where we put the files [16:35:33] do you want to use meta now? [16:35:38] dcausse: either one will work later i think but deploying teh schema registry now will delay you guys too much i think [16:36:08] dcausse: and so will using meta [16:36:12] milimetric: hm [16:36:17] what do you think though? [16:36:21] ok, will use the classpath as a repo for now and change it later, what's important is to make sure we won't need to change the format in kafka to encode the schema rev id [16:36:41] dcausse: right, exactly that would not change [16:36:43] milimetric: where does the projectview data live now? [16:36:54] well there's a projectview table for that [16:37:08] and then it's archived in ...../archive/pageview/2015... etc. [16:37:14] dcausse: I think we need to do a bit more work on the code though, let me think about it a bit [16:37:18] hm [16:37:21] milimetric i'm lookin gat [16:37:24] ok [16:37:26] /mnt/hdfs/wmf/data/archive/projectview on stat1002 [16:37:30] legacy webstatcollector ? [16:37:42] one sec lemme reread that oozie [16:37:49] i think i am remembering [16:37:59] legacy is the legacy format, not legacy data def [16:38:11] right joal? (i know we decided on this together...:p) [16:38:35] right, https://github.com/wikimedia/analytics-refinery/blob/master/oozie/projectview/hourly/coordinator.properties#L70 [16:38:41] OH and webstatscollector is not used anymore, right? we were going to call it that [16:38:45] but decide dnto to? [16:38:47] archive/projectview/legacy/hourly [16:39:14] yeah, ok, joal, can we remove the archive/projectview/webstatscollector directory? [16:39:16] yes, that folder was wsc instead of legacy in my patch, and we decided against it [16:39:18] i think it is not used [16:39:44] i don't remember ever wanting to calling webstatscollector spelled out [16:39:50] milimetric: would it make sense to store this data at archive/projectview/geo/hourly [16:39:50] ? [16:40:02] yes, that makes sense to me [16:40:35] I have read the thread - sounds good :) [16:40:57] i'm going to put a README file in the legacy/ dir for projectview and pageview linking to the oozie readme file, ok? [16:41:11] ok, milimetric i'm fine with that then [16:41:18] cool, thx [16:41:18] IIRC we change from legacy to webstatcollector with milimetric when he made the change about pageviews [16:41:48] the oozie path seems fine too, a little weird since we'll have projectview/{hourly,geo/hourly} [16:41:58] no, the other way. It was wsc and we changed it to legacy, lemme find the gerrit change [16:42:02] that's right [16:42:07] (PS10) DCausse: Add support for custom timestamp and schema rev id in avro message decoders [analytics/refinery/source] - https://gerrit.wikimedia.org/r/251267 (https://phabricator.wikimedia.org/T117873) [16:42:07] k milimetric [16:42:14] memory .... [16:43:50] joal: you're right and I'm right, we both remembered half of the story [16:43:51] https://gerrit.wikimedia.org/r/#/c/246149/6..7/oozie/pageview/hourly/coordinator.properties [16:44:00] we changed it from webstatscollector to legacy [16:44:13] k [16:44:52] k, so i'm going to delete the webstastcollector dirs there [16:44:55] there isn't new data in them [16:47:43] dcausse: see https://github.com/linkedin/camus/blob/master/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/JSONToAvroMessageDecoder.java#L102 [16:47:47] (CR) Ottomata: "I'd recommend to not use schema id 0 (I may be using it to mean 'latest known schema' in EventBus stuff), and also to make any schema rev" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/251267 (https://phabricator.wikimedia.org/T117873) (owner: DCausse) [16:48:16] dcausse: that is the built in support to read schema id from incoming stream correct? [16:48:32] nuria: this one is for avro json only I think [16:49:25] huh, that is kinda cool though [16:49:42] that means you dont' have to do magic integer thing with avro json [16:49:47] because the json can be parsed either way [16:49:51] and the schema id can be read out of hte json [16:49:53] nuria, ottomata : are you ok if we move dcausse task https://phabricator.wikimedia.org/T117575 to done (I deployed today) [16:50:07] joal: yes, thank you [16:50:10] nuria, ottomata : No job has been started but the code is there [16:50:45] and not sure it will handle timestamp correctly [16:51:03] Analytics-Kanban: JsonRevisionsSortedPerPage failed on enwiki-20150901-pages-meta-history [13 pts] {paon} - https://phabricator.wikimedia.org/T114359#1817522 (JAllemandou) [16:51:08] it uses it's own CamusWrapper that supports only timestamp in millisec [16:51:22] dcausse: yes, timestamp part will not work. [16:51:51] Analytics-Kanban: JsonRevisionsSortedPerPage failed on enwiki-20150901-pages-meta-history [13 pts] {paon} - https://phabricator.wikimedia.org/T114359#1693022 (JAllemandou) I tested various memory, each failed. I finally went and rewrote the job using core mapreduce API instead of using scrunch. Job is still... [16:52:30] ottomata: concerning rev 0, I need something to handle the case where the kafka message does not have any magic bytes [16:52:37] dcausse: given the issues with schemas, do we wnat to produce json from mw, or avro? [16:52:42] *want [16:53:19] nuria: it depends, if you want to use the patch I've made avro seems easier as we don't have to convert to AvroJson [16:53:50] if you prefer to use the classes from linkedin then AvroJson seems easier, but we still need to re-work this timestamp issue [16:54:28] dcausse: avro sound s good. that is what i though. In that case, the int for the schema should not need to be added as it is already the avro convention [16:54:52] ? [16:54:54] dcausse: and if your logging utility does avro binary it is probably alredy set up to add that int, correct? [16:55:11] I don't think so... [16:55:25] maybe you're talking about the ObjectContainer format? [16:55:36] dcausse: in the producer side [16:55:48] https://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files [16:56:21] the schema rev id is not part of the avro spec... but maybe I'm wrong [16:56:29] Analytics-Kanban: Prepare Pageview API lightning talk {melc} - https://phabricator.wikimedia.org/T119091#1817541 (mforns) NEW a:mforns [16:57:28] nuria: no [16:57:36] the php producer they are using will need to be modified to add the byte [16:57:39] pretty sure [16:57:40] dcausse:I thought it was .. but no sounds like i am the one that is wrong [16:57:52] that is just a special convention that everyone in the kafka world is using [16:57:56]