[00:01:24] Analytics-Backlog, Discovery, Reading-Infrastructure-Team: Determine proper encoding for structured log data sent to Kafka by MediaWiki - https://phabricator.wikimedia.org/T114733#1815581 (bd808) >>! In T114733#1812461, @EBernhardson wrote: > Avro's json format might be a better choice for writing to ka... [00:18:41] Analytics-Backlog, Fundraising research, Research-and-Data: FR tech hadoop onboarding - https://phabricator.wikimedia.org/T118613#1815658 (madhuvishy) @atgo - Do you need to be able to run Hive queries? You were added to statistics-privatedata-users that doesn't allow for that. You'd have to request to... [01:51:12] (PS2) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [01:52:00] (CR) jenkins-bot: [V: -1] [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) (owner: Madhuvishy) [02:32:19] (PS3) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [02:33:11] (CR) jenkins-bot: [V: -1] [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) (owner: Madhuvishy) [02:54:09] (PS4) Madhuvishy: [WIP] Setup celery task workflow to handle running reports for the Global API [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/253750 (https://phabricator.wikimedia.org/T118308) [03:43:42] randomly...any ideas what an .sbt file would look like for building hive udf's? i need a super simple udf that sum's an array so figured scala would be easy...the scala part was easy, but now sbt is a pain in my rear :P [08:08:19] Analytics-Backlog: Wikimedia "top" pageviews API has problematic double-encoded JSON - https://phabricator.wikimedia.org/T118931#1816459 (whym) [09:11:56] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1816566 (jcrespo) As an update, this task showed to be more complex than initially thought. The complex setup of the eventlogging schema means that it is very prone to break, as [[ https... [10:20:08] (PS1) Addshore: Load WikimediaCurl in twitter file [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/254127 [10:20:22] (CR) Addshore: [C: 2 V: 2] Load WikimediaCurl in twitter file [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/254127 (owner: Addshore) [10:40:33] Hi all! Does anyone have any idea where these files come from? https://metrics.wmflabs.org/static/public/datafiles/Pageviews/ [10:41:35] Hi addshore [10:41:38] Yes we do :) [10:41:55] * addshore is just trying to explain the massive jump in https://vital-signs.wmflabs.org/#projects=wikidatawiki/metrics=Pageviews [10:42:25] addshore: Those files are compute using the aggregator tool (https://github.com/wikimedia/analytics-aggregator) [10:42:26] from roughly the 9th of this month to present day [10:42:26] apparently coming from the desktop site per the csv [10:43:09] hm [10:43:19] Is it explicitly page views? or domain requests? Is there a limit on the domains for wikidata? or is it a wildcard? [10:43:42] it is what is considered as pageviews [10:44:09] as a random stab in the dark might query.wikidata.org accidently be bulked into this? [10:44:24] http://discovery.wmflabs.org/wdqs/#wdqs_usage << note the big usage spike of the query service at the same point [10:45:43] * joal look at pageview definition for wikidata domain [10:46:02] link? :D [10:47:11] addshore: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L66 [10:47:37] yeh, we need to exclude query from that too! [10:47:45] =] [10:47:54] at least for wikidata [10:47:58] now the quesion addshore : what is the mime type of those requests [10:48:24] *checks* [10:49:03] content-type:application/sparql-results+json [10:50:02] addshore: and a path exqample ? [10:50:17] Because the logic behind pageview def is taking all that into account :) [10:50:38] http://tinyurl.com/nz8nvb7 << too long for irc ;) [10:50:53] ttps://query.wikidata.org/bigdata/namespace/wdq/sparql?query=stuffhere.. [10:51:34] although requests to https://query.wikidata.org/* should be ignored :) [10:51:37] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L297 [10:51:49] These requests shouldn't be included [10:52:00] hmmm [10:52:09] But ok, let's make a ticket to remove query.wikidata.org from pageview [10:52:16] *will do* [10:52:24] filed against analytics-backlog? [10:52:28] Let me double check on hiove before [10:52:33] okay! [10:52:43] BEcause it might not even be those specific case [10:57:52] Analytics-Backlog, Analytics-General-or-Unknown, WMDE-Analytics-Engineering, Wikidata, Story: [Story] Statistics for Special:EntityData usage - https://phabricator.wikimedia.org/T64874#1816685 (JAllemandou) [10:58:19] also addshore, added the task about Special:EntityData to our backlog --^ [10:58:28] awesome :) [10:58:42] No promise on deadline though ;) [10:58:54] the end of the year would be amazing ;) [10:58:59] Analytics-Backlog, WMDE-Analytics-Engineering, Wikidata: Remove query.wikidata.org from pageview definition (for wikidata) - https://phabricator.wikimedia.org/T119054#1816686 (Addshore) NEW [10:59:06] also joal created this one for tracking this thing ^^ [10:59:11] k :) [10:59:23] addshore: can you add the comment in the task about deadline ? [11:01:09] sure! [11:01:43] Analytics-Backlog, Analytics-General-or-Unknown, WMDE-Analytics-Engineering, Wikidata, Story: [Story] Statistics for Special:EntityData usage - https://phabricator.wikimedia.org/T64874#1816693 (Addshore) It would also be great to have this running (perhaps with all possible historical data (I th... [11:02:31] basically we are trying to get a bundle of stuff done for the dev summit [11:02:52] addshore: on one hour (2015-11-14T19:00 UTC) --> No query.wikidata.org domain [11:03:13] These don't seem to show up in our webreques [11:03:23] Looking for the full day [11:03:49] hmm, I am sure I have seen them in webrequest raw before! [11:04:13] as I was looking at the resonse codes [11:04:41] ohhh, maybe response code thing: I only kept 200 and 304 [11:05:23] hmm, well there should be lots of 200s in there [11:06:09] So for the given hour, only www.wikidata.org or m.wikidata.org [11:06:51] for the given hour: 2317120 on www, not pageview, and 146672 on www pageview [11:07:17] on mobile, it's neglectable (4576 and 2410 respectively) [11:07:19] addshore: --^ [11:07:44] for 2015-11-14T19:00 ? [11:08:32] yep, mobile has minimal usage on wikidata currently [11:08:50] addshore: https://gist.github.com/jobar/e8954d5cac5b7c605620 [11:09:55] ahh, *didnt know about the is_pageview value there* [11:11:05] addshore: it's the one one which we filter to get the pageview_hourly table :) [11:11:38] addshore: but still, no query.wikidata.org :( [11:11:47] I'm still mildly confused then, as this massive spike doesnt show in reqstats at all [11:12:26] well, I guess reqstats !== page views but [11:12:39] I don't know about reqstats [11:13:03] this still seems very odd to me ;) [11:13:34] addshore: may some bot managing to get through our filtering (not very difficult, not to say very easy) [11:19:03] *keeps digging* [11:26:59] addshore: daily results seems reasonable (3M pageviews) [11:27:11] addshore: digging into user agent [11:27:57] wait, 3M pageviews? :P thats even more than is showing in the csv! :P [11:28:20] Analytics-Backlog, WMDE-Analytics-Engineering, Wikidata: Investigate wikidata pageview sipke on 2015-11-14 - https://phabricator.wikimedia.org/T119054#1816728 (JAllemandou) [11:28:34] addshore: not filtered by agent_type = 'user' :) [11:28:50] addshore: changed the title of he task you created --^ [11:28:55] awesome! [11:29:15] not filtered by agent_type ahh! [11:29:41] So I assume half/half roughtly [11:29:47] (user / sipder) [11:43:14] so odd, and looking at the pageview api it doesn't look like any 1 page has been accessed an extreme ammount [11:43:54] addshore: I think I have the cuplrit: http-kit/2.0 not considered as spider :( [11:44:22] Is that the whole user agent? >.> [11:44:44] yesir [11:44:50] hah [11:45:19] makes about half of the pageview requests for hour I study [11:45:22] HTTP client/server for Clojure [11:45:27] wow [11:45:32] Th [11:45:36] This explains that [11:45:46] Let's change again the task you filled [11:45:47] is that have of all pageviews, or half of non bot pageviews? [11:45:53] *half [11:46:02] half all pageviews [11:46:07] hah [11:46:18] coming from the same IP as well at a guess? [11:46:24] http-kit/2.0 84311 [11:46:41] second row: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) 19648 [11:46:47] A good power low :) [11:46:53] Didn't check the ip [11:48:34] I just looked at a few live requests and it looks like its loading all revisions... :P [11:50:14] k [11:52:48] Now addshore that thing is a concern for me because of that: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L58 [11:53:13] hmmmm [11:53:15] addshore: Normally, useragents containing http show up as spiders [11:54:08] And I have counter examples in the list I have generated for wikidata [11:54:11] :( [11:55:19] Analytics-Kanban, Datasets-Webstatscollector: Wikimedia "top" pageviews API weirdness with the "Paul_Elio" article - https://phabricator.wikimedia.org/T118933#1816777 (Nemo_bis) Please remember to add a specific blue project to all tasks related to pageviews data. [11:55:58] Analytics-Backlog, Datasets-Webstatscollector, Language-Engineering: Investigate anomalous views to pages with replacement characters - https://phabricator.wikimedia.org/T117945#1816779 (Nemo_bis) [11:56:40] joal: :/ [11:57:55] this thing is making requests to /all/ sites it would seem [11:58:17] and doing the same thing, seemingly requesting random old versions of random pages [11:59:33] joal: how heavy / long would a query looking at all sites take? for this useragnt / the IP? [12:01:12] not sure I understand: what time-frame, and what restriction/data: when user_agent = http-kit/2.0 ? [12:01:27] Analytics-Backlog, Datasets-General-or-Unknown: Wikimedia "top" pageviews API has problematic double-encoded JSON - https://phabricator.wikimedia.org/T118931#1816785 (Nemo_bis) [12:01:37] well, just a sample of an hour would be fine *is writing it* [12:08:40] https://www.irccloud.com/pastebin/aRLv4npA/ [12:09:07] joal: ^^ requests from the useragent & IP for 1 hour per site [12:09:21] wow, that guy crawl heavily [12:09:42] yeh, but why the hell is it crawling old revisions? [12:15:22] k addshore, found the full explanation: I messed up at deploy time of the change for user agent to exclude http [12:15:27] I'm gonna fix that today [12:16:08] :D Cool! Just stick stuff in that ticket :) [12:16:17] I will :) [12:16:20] epic! [12:16:41] also, off the top of your head does the top page views api bit exclude crawlers? ;) [12:18:02] Analytics-Kanban, WMDE-Analytics-Engineering, Wikidata: Fix '.*http.*' not being tagged as spiders in webrequest - https://phabricator.wikimedia.org/T119054#1816809 (JAllemandou) p:Triage>Unbreak! a:JAllemandou [12:18:35] addshore: from 2015-11-17, yes, before, data need to be backfilled (https://phabricator.wikimedia.org/T118991) [12:19:03] okay! [12:22:13] Analytics-Kanban, WMDE-Analytics-Engineering, Wikidata: Fix '.*http.*' not being tagged as spiders in webrequest - https://phabricator.wikimedia.org/T119054#1816815 (JAllemandou) I messed up a deploy about a month ago, preventing the change merged here: https://gerrit.wikimedia.org/r/#/c/244465/ to act... [12:23:37] Analytics-Kanban: Investigate cassandra daily top job [5 pts] {slug} - https://phabricator.wikimedia.org/T118449#1816817 (JAllemandou) [12:38:18] (PS1) Joal: Upgrade refine oozie job to jar v0.0.20 [analytics/refinery] - https://gerrit.wikimedia.org/r/254133 (https://phabricator.wikimedia.org/T119054) [12:38:55] (CR) Joal: [C: 2] "Self merging bug" [analytics/refinery] - https://gerrit.wikimedia.org/r/254133 (https://phabricator.wikimedia.org/T119054) (owner: Joal) [12:39:19] (CR) Joal: [V: 2] "Self merging bug" [analytics/refinery] - https://gerrit.wikimedia.org/r/254133 (https://phabricator.wikimedia.org/T119054) (owner: Joal) [12:43:01] !log Deploying refinery [12:44:04] Analytics-Kanban, WMDE-Analytics-Engineering, Wikidata, Patch-For-Review: Fix '.*http.*' not being tagged as spiders in webrequest [5 pts] {hawk} - https://phabricator.wikimedia.org/T119054#1816854 (JAllemandou) [12:53:37] !log Restart refine bundle [13:34:29] Analytics, Beta-Cluster-Infrastructure: deployment-fluorine fails puppet '/usr/sbin/usermod -u 10003 datasets' returned 4: usermod: UID '10003' already exists - https://phabricator.wikimedia.org/T117028#1816912 (faidon) a:ArielGlenn This is a long-known problem that @ArielGlenn and I had discussed bef... [14:17:15] Analytics-Backlog, Analytics-General-or-Unknown, WMDE-Analytics-Engineering, Wikidata, Story: [Story] Statistics for Special:EntityData usage - https://phabricator.wikimedia.org/T64874#1817016 (Addshore) [14:21:28] Analytics-Backlog, MediaWiki-API, Reading-Infrastructure-Team, Research-and-Data, and 4 others: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618#1817055 (Addshore) [14:40:08] halfak__: give me a ping when you're up halfak__ :) [14:40:17] exit [14:40:21] oops :) [14:41:06] o/ joal [14:44:14] ping joal [14:49:15] addshore: http-kit/2.0 is considered a spider from now on ! [14:49:18] addshore: http-kit/2.0 is considered a spider from now on ! [14:49:29] amazing! :) [14:49:42] halfak: I'm sorry, it actually haven't ping :( [14:50:00] No worries [14:50:13] I was just about to restart my client ;) [14:50:17] so halfak, I was pissed yesterday, and rewrote the job as a map reduce with secondary sorting [14:50:40] Tested on a single that had failed previously, --> worked [14:50:50] now running on the full stuff [14:50:53] Seems ok [14:51:10] Best programming == "Goddamn it this should work" [14:51:11] Right *goes to go and do a comparison of influxdb vs opentsdb vs graphite.... [14:51:13] :D [14:51:28] yessir :) [14:51:42] halfak: finished late yesterday night, but happy :) [14:52:05] Cool! Thanks joal. I also started up some work. It was pretty easy to change my scripts to handle XML, but I needed to get my python environment set up on the IA/Research cluster. [14:52:13] So I'm glad you beat me to it :D [14:52:28] halfak: I have also figured out an interesting improvement: pushing predicates down into the input format -- Not having to decode the json to filter / generate sorting keys [14:52:54] Not yet fully implemented, but will soon [14:53:13] Yes! I would love to have this in streaming. [14:53:25] I do a lot of json2tsv, tsv2json work. [14:53:52] halfak: well, I still have to aprse the xml though :) [14:54:26] halfak: But since this step is inevitable, I prefer to have as much as the predicates worked after first parse step [14:55:13] +1 [14:56:26] dcausse: Hi Sir :) [14:56:39] dcausse: To let you know that your oozie has been deployed [14:56:43] joal: hi! [14:56:49] thanks! :) [14:56:49] Ready to launch whenever you want dcausse [14:56:58] well we still have issues with avro :( [14:57:09] dcausse: I know that you still have schema management issues, but still wanted to let you know :) [14:57:12] yup [14:57:54] I'll wait for the team to agree, but will probably move your task to done and let you start the job when you're ready dcausse, ok ? [14:58:08] joal: sounds good [14:58:15] awesome dcausse :) [15:00:33] Analytics-Kanban, Datasets-Webstatscollector: Wikimedia "top" pageviews API weirdness with the "Paul_Elio" article [5 pts] {slug} - https://phabricator.wikimedia.org/T118933#1817184 (JAllemandou) [15:03:14] joal, I see a succeeded job on the research cluster with 18 maps and 1 reduce. [15:03:26] halfak: was my test bed :) [15:03:36] curretnly running job is the real big one [15:03:42] Gotcha. [15:03:48] halfak: 6% into reduce [15:03:59] halfak: usual 20h map [15:04:03] How many reducers? [15:04:09] 2000 halfak [15:04:13] Great :) [15:04:20] \o/ [15:04:24] I tried to remember the number you asked me ;) [15:04:40] Well halfak, before it's not finished, I won't say anything [15:04:55] But I fell more confident with that run [15:05:04] Still. It's starting to feel like we're building with Legos instead of atoms. :) [15:05:44] hm halfak, for it's more: since the lego I had fails, went back to atom to build my own lego piece :) [15:06:14] But getting dirty hands on this kind of problem is where you learn moreI think :) [15:06:18] halfak: --^ [15:06:31] It's painful and frustrating, but I learnt :) [15:08:08] I look forward to the point at which we have high-functioning, battle-tested legos to write some docs about. [15:09:00] halfak: I am actually quite happy with the formatting I have in scala, I'll show it to you in a while and see how we can promote that if research need it :) [15:15:42] :) It'll be good that I get a handle on this at least so that I can modify and re-use it. [15:20:07] Analytics-Backlog, Discovery, Reading-Infrastructure-Team: Determine proper encoding for structured log data sent to Kafka by MediaWiki - https://phabricator.wikimedia.org/T114733#1817270 (Ottomata) Could be worth it. We had a meeting yesterday and discussed at least trying this. We want to see if we... [15:32:24] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1817294 (Ottomata) FYI, the repo is here, waiting for some schemas! :) https://gerrit.wikimedia.org/r/#/admin/projects/mediawik... [15:45:58] (PS9) DCausse: Add support for custom timestamp and schema rev id in avro message decoders [analytics/refinery/source] - https://gerrit.wikimedia.org/r/251267 (https://phabricator.wikimedia.org/T117873) [15:52:56] Analytics-Backlog, Discovery, Reading-Infrastructure-Team: Determine proper encoding for structured log data sent to Kafka by MediaWiki - https://phabricator.wikimedia.org/T114733#1817361 (dcausse) Unfortunately AvroJson won't help to resolve the issues we have with AvroBinary, if the schema used by the... [15:55:30] bd808: sure, let's talk today [15:57:22] addshore: you can also contribute code patches/tests if you want your changes to be done sooner [15:57:46] nuria: of course :) [15:58:08] dcausse: yt? [15:59:12] dcausse: how did you work arround the offset issues? [15:59:59] nuria: thanks. I have a magically meeting free day today so I can talk whenever you have time. [16:00:30] let me reach dcausse before cause he is on european tz, what is your tz? [16:00:48] GMT-7 (MST) [16:01:23] nuria: hi! [16:01:31] you were right AvroJson won't work :( [16:01:44] hi yalls! [16:01:51] dcausse: loads of work.. eh? [16:02:04] it's exactly the same requirements, the writerSchema needs to be known even with AvroJson [16:02:05] dcausse: i read your super handy work arround [16:02:18] nuria, dcausse, i'm looking at eventbus stuff too, reallly not sure how all this is going to work, but i'm leaning more and more to making meta.wm.org http schemas be the way we share schemas...not sure htough [16:02:21] avro is very strict :/ [16:02:25] dcausse: but 1st.. how did you get arround the offset stuff? [16:02:33] would like to see if we can make it work with file based repo and with avro [16:02:53] ottomata: I wrote a workaround that uses the classpath for now [16:02:53] ottomata: did you read dcausse 's e-mail on analytics-internal [16:02:56] ? [16:03:28] I think it will be easy to change it to uses an external service (file or http) [16:03:29] dcausse: that is a workarround we can use I think (need to think about cases but do read it and let me know) [16:03:32] yes [16:04:08] ok, let's 1st work on unblocking search team [16:04:08] the format is kafka will be magicbyte + long(8 bytes) [16:04:28] hm, dcausse do you need a reader schema for camus? [16:04:33] if you have the writer schema, that should be enough, no? [16:04:37] ottomata: 1st unblocking and later a solution that fetches schemas on amore solid fashion [16:04:40] evolution can be handled when reading with hive? [16:04:51] ottomata: probably [16:05:30] i mean, i guess its nice, but it does make it a little more complicated [16:05:36] with the property, etc. [16:05:40] mmmm..let's work the case of reading old data from kafka [16:05:49] if you prefer I can remove this latest schema code [16:05:56] nuria: reading old data would just use the given writer schema [16:06:02] and use that to write to hadoop [16:06:15] then when reading historical data, say, via hive, you'd use hte lateste schema [16:06:31] and hive would have each writer schema from the header of each binary file [16:06:48] ottomata: you still need to be sure that writerSchema is < readerSchema [16:07:13] yeah, you'd have to evolve properly [16:07:18] that's the only advantage to specify readerSchema in camus [16:07:21] but that would be true for kafka too, right? [16:07:25] oh, hm [16:07:29] yeah you'd validate that before writing? [16:07:43] I do not tested your idea [16:07:54] i mean, yeah it is a little nicer to have camus always writing with the same schema [16:07:55] but I can remove some code and test if you want [16:08:07] it probably will normalize things a bit in hadoop [16:08:46] the schema used to write a file will be the same for each camus runs, which means the time at which the binary file schema changes in hadoop will always be the same for all data [16:09:09] rather than the case where say there are 2 producers, one of which has schema A, the other schema B, and camus latests schema C [16:09:22] in that case, if you just use the writer schema, you'd have both A and B in the same hourly partition [16:09:23] hm. [16:09:34] i am getting lost [16:09:57] hah, i'm arguing pros and cons of using a reader schema at all with camus (that is, doing schema evolution in camus) [16:10:01] I think you're right it's not necessary to have the readerSchema, if the writerSchema is not know in camus it will fail anyway... [16:10:21] right, there are some pros to having it though, with adding just a little complexity [16:10:32] i guess that can be optional dcausse? if you property isn't set it'll just use the writer? [16:10:48] not sure but I can make it I think [16:10:56] good idea [16:12:56] joal / ottomata I have some oozie stuff for this geo-coded pageview stuff [16:13:04] dcausse: give us a few minutes as ottomata and i need to discuss another project we got going on. I will get back to you [16:13:04] where do you think it makes sense in refinery? [16:13:18] milimetric: new stuff totally ? [16:13:19] oozie/pageview/geographical [16:13:22] dcausse: I still want to know how did you deal with teh offset stuff [16:13:33] well, yeah, it's new [16:13:45] depends on pageview_hourly (not confirmed yet, but probably) [16:13:48] nuria: I did not test with camus in prod yet [16:14:01] dcausse: ah ok, just with unit tests [16:14:04] hm, wonder --> we have projectview at same level as pageview [16:14:05] maybe oozie/pageview/hourly/geographical ? [16:14:08] I've just tested with unit test and found that the problem remains the same :( [16:14:58] milimetric: I'd go for pageview/geo/hourly (or other) [16:15:00] dcausse: i see, that is actually a better path than the one i took, it is too bad we do not have a better test env for this [16:15:01] oops, I guess I mean it depends on projectview [16:15:15] milimetric: project based only, huh [16:15:16] cool, so oozie/projectview/geo/hourly [16:15:20] yes [16:15:27] if ottomata says so :) [16:15:49] ok, ottomata: bless this path or condemn it: "oozie/projectview/geo/hourly" [16:21:18] seems that ottomata is a "no decision" mood milimetric ;) [16:21:45] it's ok, IRC does not come with a "must answer right away" contract :) [16:33:10] haha [16:33:13] was in batcave with nuria [16:33:22] milimetric: what is this data? [16:33:47] it's data for Erik's geographic breakdown of pageviews by project [16:33:51] hourly [16:33:53] dcausse: back, then , my preference will be to eventually use confluent schema registry but in order to unblock you we can use your suggested interim solution, we need to wrap it such the schema retrieval [16:33:59] (he currently has it at 15 minute intervals [16:34:01] ) [16:34:07] but I'm seeing if he's ok with hourly [16:34:37] dcausse: can be swapped to use schema registry later (or even our meta json extension for schemas) [16:34:43] milimetric: where will this data be saved in hdfs? [16:34:54] will there also be a new hive table? [16:34:55] nuria: yes I think it will be relatively easy [16:34:57] what will it be called? [16:35:05] dcausse: https://meta.wikimedia.org/wiki/Schema:EditorActivation [16:35:07] no hive table, just files [16:35:13] Erik said it's up to us where we put the files [16:35:33] do you want to use meta now? [16:35:38] dcausse: either one will work later i think but deploying teh schema registry now will delay you guys too much i think [16:36:08] dcausse: and so will using meta [16:36:12] milimetric: hm [16:36:17] what do you think though? [16:36:21] ok, will use the classpath as a repo for now and change it later, what's important is to make sure we won't need to change the format in kafka to encode the schema rev id [16:36:41] dcausse: right, exactly that would not change [16:36:43] milimetric: where does the projectview data live now? [16:36:54] well there's a projectview table for that [16:37:08] and then it's archived in ...../archive/pageview/2015... etc. [16:37:14] dcausse: I think we need to do a bit more work on the code though, let me think about it a bit [16:37:18] hm [16:37:21] milimetric i'm lookin gat [16:37:24] ok [16:37:26] /mnt/hdfs/wmf/data/archive/projectview on stat1002 [16:37:30] legacy webstatcollector ? [16:37:42] one sec lemme reread that oozie [16:37:49] i think i am remembering [16:37:59] legacy is the legacy format, not legacy data def [16:38:11] right joal? (i know we decided on this together...:p) [16:38:35] right, https://github.com/wikimedia/analytics-refinery/blob/master/oozie/projectview/hourly/coordinator.properties#L70 [16:38:41] OH and webstatscollector is not used anymore, right? we were going to call it that [16:38:45] but decide dnto to? [16:38:47] archive/projectview/legacy/hourly [16:39:14] yeah, ok, joal, can we remove the archive/projectview/webstatscollector directory? [16:39:16] yes, that folder was wsc instead of legacy in my patch, and we decided against it [16:39:18] i think it is not used [16:39:44] i don't remember ever wanting to calling webstatscollector spelled out [16:39:50] milimetric: would it make sense to store this data at archive/projectview/geo/hourly [16:39:50] ? [16:40:02] yes, that makes sense to me [16:40:35] I have read the thread - sounds good :) [16:40:57] i'm going to put a README file in the legacy/ dir for projectview and pageview linking to the oozie readme file, ok? [16:41:11] ok, milimetric i'm fine with that then [16:41:18] cool, thx [16:41:18] IIRC we change from legacy to webstatcollector with milimetric when he made the change about pageviews [16:41:48] the oozie path seems fine too, a little weird since we'll have projectview/{hourly,geo/hourly} [16:41:58] no, the other way. It was wsc and we changed it to legacy, lemme find the gerrit change [16:42:02] that's right [16:42:07] (PS10) DCausse: Add support for custom timestamp and schema rev id in avro message decoders [analytics/refinery/source] - https://gerrit.wikimedia.org/r/251267 (https://phabricator.wikimedia.org/T117873) [16:42:07] k milimetric [16:42:14] memory .... [16:43:50] joal: you're right and I'm right, we both remembered half of the story [16:43:51] https://gerrit.wikimedia.org/r/#/c/246149/6..7/oozie/pageview/hourly/coordinator.properties [16:44:00] we changed it from webstatscollector to legacy [16:44:13] k [16:44:52] k, so i'm going to delete the webstastcollector dirs there [16:44:55] there isn't new data in them [16:47:43] dcausse: see https://github.com/linkedin/camus/blob/master/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/JSONToAvroMessageDecoder.java#L102 [16:47:47] (CR) Ottomata: "I'd recommend to not use schema id 0 (I may be using it to mean 'latest known schema' in EventBus stuff), and also to make any schema rev" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/251267 (https://phabricator.wikimedia.org/T117873) (owner: DCausse) [16:48:16] dcausse: that is the built in support to read schema id from incoming stream correct? [16:48:32] nuria: this one is for avro json only I think [16:49:25] huh, that is kinda cool though [16:49:42] that means you dont' have to do magic integer thing with avro json [16:49:47] because the json can be parsed either way [16:49:51] and the schema id can be read out of hte json [16:49:53] nuria, ottomata : are you ok if we move dcausse task https://phabricator.wikimedia.org/T117575 to done (I deployed today) [16:50:07] joal: yes, thank you [16:50:10] nuria, ottomata : No job has been started but the code is there [16:50:45] and not sure it will handle timestamp correctly [16:51:03] Analytics-Kanban: JsonRevisionsSortedPerPage failed on enwiki-20150901-pages-meta-history [13 pts] {paon} - https://phabricator.wikimedia.org/T114359#1817522 (JAllemandou) [16:51:08] it uses it's own CamusWrapper that supports only timestamp in millisec [16:51:22] dcausse: yes, timestamp part will not work. [16:51:51] Analytics-Kanban: JsonRevisionsSortedPerPage failed on enwiki-20150901-pages-meta-history [13 pts] {paon} - https://phabricator.wikimedia.org/T114359#1693022 (JAllemandou) I tested various memory, each failed. I finally went and rewrote the job using core mapreduce API instead of using scrunch. Job is still... [16:52:30] ottomata: concerning rev 0, I need something to handle the case where the kafka message does not have any magic bytes [16:52:37] dcausse: given the issues with schemas, do we wnat to produce json from mw, or avro? [16:52:42] *want [16:53:19] nuria: it depends, if you want to use the patch I've made avro seems easier as we don't have to convert to AvroJson [16:53:50] if you prefer to use the classes from linkedin then AvroJson seems easier, but we still need to re-work this timestamp issue [16:54:28] dcausse: avro sound s good. that is what i though. In that case, the int for the schema should not need to be added as it is already the avro convention [16:54:52] ? [16:54:54] dcausse: and if your logging utility does avro binary it is probably alredy set up to add that int, correct? [16:55:11] I don't think so... [16:55:25] maybe you're talking about the ObjectContainer format? [16:55:36] dcausse: in the producer side [16:55:48] https://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files [16:56:21] the schema rev id is not part of the avro spec... but maybe I'm wrong [16:56:29] Analytics-Kanban: Prepare Pageview API lightning talk {melc} - https://phabricator.wikimedia.org/T119091#1817541 (mforns) NEW a:mforns [16:57:28] nuria: no [16:57:36] the php producer they are using will need to be modified to add the byte [16:57:39] pretty sure [16:57:40] dcausse:I thought it was .. but no sounds like i am the one that is wrong [16:57:52] that is just a special convention that everyone in the kafka world is using [16:57:56] it is not built into avro [16:58:12] ottomata: ahhhh [16:58:31] dcausse: we have standup, we'll be back [16:58:40] k [16:58:42] hm MHMMm HMMH [16:59:07] dcausse: if this is built into the camus avro json encoder>..>>>..>>i would lean towards using avro json...but i don't want to keep going back and forth on you [16:59:26] :) [16:59:33] i guess it depends on how difficult it is to modify the php stuff to do avro json [16:59:42] i mean, you'll have to modify it anyway to do the integer byte [16:59:45] hm. [16:59:53] and the timestamp problem [16:59:54] i dunno. [16:59:55] yeah [17:00:03] but the timestamp problem is easy, we know what to do. [17:00:17] but you'll have to rewrite the built-in [17:00:26] yeah i guess so... [17:00:39] yeah standup... hm will keep thikning, if you really want to do the binary thing, i think its ok [17:02:01] Analytics-EventLogging, Analytics-Kanban, EventBus, Patch-For-Review: Deploy eventlogging from new repository. [5 pts] - https://phabricator.wikimedia.org/T118863#1817566 (Ottomata) [17:02:42] Analytics-Kanban, RESTBase, Services, RESTBase-API: configure RESTBase pageview proxy to Analytics' cluster {slug} [34 pts] - https://phabricator.wikimedia.org/T114830#1817569 (kevinator) Open>Resolved [17:03:19] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1817571 (kevinator) Open>Resolved [17:03:37] Analytics-Kanban: Pageview API Press release {slug} [2 pts] - https://phabricator.wikimedia.org/T117225#1817572 (kevinator) Open>Resolved [17:03:50] Analytics-Backlog: Traffic Breakdown Report - Visiting Country {lama} - https://phabricator.wikimedia.org/T115605#1817576 (kevinator) [17:03:51] Analytics-Kanban: Understand the Perl code for this report - visiting country {lama} - https://phabricator.wikimedia.org/T117243#1817574 (kevinator) Open>Resolved [17:04:02] Analytics-Kanban: Understand the Perl code for this report - Visiting Country per Wikipedia Language {lama} - https://phabricator.wikimedia.org/T117244#1817578 (kevinator) Open>Resolved [17:04:03] Analytics-Backlog: Traffic Breakdown Report - Visiting Country per Wikipedia Language {lama} - https://phabricator.wikimedia.org/T115608#1817580 (kevinator) [17:04:20] Analytics-Kanban: Missing Pageview API data for one article {slug} [3 pts] - https://phabricator.wikimedia.org/T118785#1817581 (kevinator) Open>Resolved [17:04:32] Analytics-Kanban: Backfill cassandra pageview data - September [5 pts] {slug} - https://phabricator.wikimedia.org/T118450#1817583 (kevinator) Open>Resolved [17:07:05] Analytics-Kanban, Datasets-Webstatscollector: Wikimedia "top" pageviews API weirdness with the "Paul_Elio" article [5 pts] {slug} - https://phabricator.wikimedia.org/T118933#1817585 (Milimetric) @Nemo_bis: you need us to add Datasets-Webstatscollector on all pageview data tasks? I'm not familiar... [17:07:55] Analytics-Kanban: Pageview API documentation for end users {slug} [8 pts] - https://phabricator.wikimedia.org/T117226#1817587 (kevinator) [17:08:05] Analytics-Kanban: Pageview API documentation for end users {slug} [8 pts] - https://phabricator.wikimedia.org/T117226#1817588 (kevinator) Open>Resolved [17:08:12] is there an existing repo for hive udf's? I needed something that could sum up array and hive didn't have anything built in [17:08:25] Analytics-Kanban: Troubleshoot Hebrew characters in Wikimetrics {dove} [2 pts] - https://phabricator.wikimedia.org/T118574#1817591 (kevinator) Open>Resolved [17:08:39] Analytics-EventLogging, Analytics-Kanban, EventBus, Patch-For-Review: Deploy eventlogging from new repository [5 pts] - https://phabricator.wikimedia.org/T118863#1817593 (kevinator) [17:08:50] Analytics, Analytics-Kanban, Discovery, EventBus, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1817596 (kevinator) [17:08:54] Analytics-EventLogging, Analytics-Kanban, EventBus, Patch-For-Review: Deploy eventlogging from new repository [5 pts] - https://phabricator.wikimedia.org/T118863#1811200 (kevinator) Open>Resolved [17:09:06] Analytics-Kanban: Research avro schema evolution, do we need a write and reader schema? - https://phabricator.wikimedia.org/T119092#1817597 (Nuria) NEW [17:09:15] Analytics-Backlog: Traffic Breakdown Report - Visiting Country per Wiki {lama} - https://phabricator.wikimedia.org/T115607#1817606 (kevinator) [17:09:17] Analytics-Kanban: Understand the Perl code for "Visiting Country per Wiki" report {lama} - https://phabricator.wikimedia.org/T117247#1817604 (kevinator) Open>Resolved [17:09:23] ebernhardson: analytics/refinery/source/refinery-hive [17:09:33] I added a new UDF in https://gerrit.wikimedia.org/r/#/c/253046/ [17:10:02] bd808: ok thanks! [17:12:20] Analytics-Kanban: Research avro schema evolution, do we need a write and reader schema? - https://phabricator.wikimedia.org/T119092#1817608 (Nuria) See encoding of the same data with schema1 and schema1 with an additional "optional" field. Note that the absence of the "additional " field is represented by an... [17:19:19] bd808: :Wq [17:19:22] ignore that :) [17:19:39] * bd808 force saves and quits [17:19:53] (CR) Ottomata: "Hm, I'm unsure about the terms 'internal' and 'external'. They are very subjective based on point of view. Perhaps it won't matter...and" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/253046 (https://phabricator.wikimedia.org/T118592) (owner: BryanDavis) [17:21:44] Analytics-Backlog, RESTBase, Services: configure RESTBase pageview proxy to Analytics' cluster on wiki-specific domains - https://phabricator.wikimedia.org/T119094#1817629 (Milimetric) NEW [17:22:14] Analytics-Kanban, RESTBase, Services, RESTBase-API: configure RESTBase pageview proxy to Analytics' cluster {slug} [34 pts] - https://phabricator.wikimedia.org/T114830#1707320 (Milimetric) @mobrovac: I opened up T119094 to continue the work. [17:22:19] dcausse: how will you know what schema to use at all if the kafka message doesn't have the magic byte? [17:22:51] (CR) BryanDavis: "> I'm unsure about the terms 'internal' and 'external'." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/253046 (https://phabricator.wikimedia.org/T118592) (owner: BryanDavis) [17:22:59] Analytics-Kanban, Datasets-General-or-Unknown: Wikimedia "top" pageviews API has problematic double-encoded JSON - https://phabricator.wikimedia.org/T118931#1817646 (Milimetric) p:High>Unbreak! a:Milimetric [17:26:44] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1817656 (Nuria) @jcrespo: Super thanks for your work on this. We understand that priority wise this ticket is not urgent, while being important. We will watch this ticket for any upcomi... [17:27:58] ottomata: it was the purpose of this rev 0 [17:28:23] just to correctly support existing stream, but I can drop backward compat support [17:28:30] right but, what info do you have to even know the schema name? [17:28:33] oh, from that topic config? [17:29:16] topic+rev, it's how the camus/kafka SchemaRegistry works [17:29:17] bd808: "Reasonable point. We can easily change the color of the shed at this point." -> nice, thanks for being flexible [17:29:28] right, our camus thing, hm. [17:29:40] and by using rev 0, if all is backwards compat, i'tll just drop the new fields, or whatever [17:29:41] hm [17:29:58] dcausse: i think if you are going to build in this binary support, we should not support to message formats [17:30:00] two* [17:30:02] not really "our" camus thing but com.linkedin.camus.schemaregistry.SchemaRegistry [17:30:08] nuria, ottomata: how about "wikimedia", "wikimedia_labs", and "internet" for the labels? [17:30:15] it uses topic? [17:30:17] ottomata: ok [17:30:21] ottomata: yes [17:30:25] i thought it just used a unique schema id [17:30:30] looking [17:30:35] no, the way it works is bit messy [17:30:47] huh! [17:31:35] imho these classes are designed for a very specific use case at linkedin [17:31:41] aye [17:33:43] I guess it's one of the reason we wrote our own decoder based on https://github.com/linkedin/camus/blob/master/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders/JSONToAvroMessageDecoder.java#L102 [17:33:52] and not use them directly [17:36:21] dcausse: am looking at https://github.com/linkedin/camus/blob/master/camus-schema-registry/src/main/java/com/linkedin/camus/schemaregistry/FileSchemaRegistry.java [17:41:29] Analytics-Tech-community-metrics, DevRel-November-2015: Fix 404s (VizGrimoireJS entirely broken) on korma's mediawiki.html - https://phabricator.wikimedia.org/T118167#1817689 (Lcanasdiaz) Open>Resolved Guys, this panel is deprecated. I've replace the content with HTML code to redirect to wiki.html.... [17:41:30] Analytics-Tech-community-metrics: MediaWiki.org stats should also consider discussion activity (Talk/Thread namespaces) - https://phabricator.wikimedia.org/T62074#1817691 (Lcanasdiaz) [17:42:11] hm, dcausse, doesn't look that useful, the schema ids are just sha1s, so not incrementing [17:43:24] ottomata: schemas in the jar is a blocker? [17:47:16] ? [17:49:44] ottomata: the solution could work without too much effort if the schemas are in the jar (at least for now). But if you think it's a big issue then maybe we should leaves this out for now and wait for EventBus? [17:52:13] Analytics-Backlog, Analytics-Cluster, Easy: Add client IP to webrequest tables - https://phabricator.wikimedia.org/T116772#1817768 (Nuria) [17:53:06] Analytics-Backlog, Analytics-Cluster, Easy: Add client IP to webrequest tables - https://phabricator.wikimedia.org/T116772#1817780 (Milimetric) p:Triage>Normal [17:56:12] Analytics-Backlog, Analytics-Cluster, Improving access, Research-and-Data: Hashed IP addresses in refined webrequest logs - https://phabricator.wikimedia.org/T118595#1817808 (Milimetric) p:Triage>Normal [17:58:57] Analytics-Backlog: Wikimedia Analytics Refinery Jobs TestCamusPartitionChecker test failure when running as bd808 on stat1002 - https://phabricator.wikimedia.org/T119101#1817812 (bd808) NEW [17:59:58] Analytics-Backlog, RESTBase, Services: configure RESTBase pageview proxy to Analytics' cluster on wiki-specific domains - https://phabricator.wikimedia.org/T119094#1817820 (Milimetric) p:Triage>Normal [18:05:06] Analytics-Backlog, Analytics-General-or-Unknown, WMDE-Analytics-Engineering, Wikidata, Story: [Story] Statistics for Special:EntityData usage - https://phabricator.wikimedia.org/T64874#1817825 (Nuria) @addshore: Do you have access to cluster 1002 to run querys yourself? Timeline wise if you need... [18:06:47] Analytics-Backlog: Wikimedia Analytics Refinery Jobs TestCamusPartitionChecker test failure when running as bd808 on stat1002 - https://phabricator.wikimedia.org/T119101#1817828 (Milimetric) p:Triage>Normal [18:11:34] Analytics-Backlog: Track stats for outreach.wikimedia.org in pageview_hourly - https://phabricator.wikimedia.org/T118987#1817853 (Nuria) In the pageview definition this domain is excluded on purpose at research's team request. Since pageview_hourly only stores what we consider pageviews as per pageview defin... [18:17:19] Analytics-Backlog: Track stats for outreach.wikimedia.org in pageview_hourly - https://phabricator.wikimedia.org/T118987#1817876 (Milimetric) p:Triage>Normal [18:17:41] Analytics-Kanban: Backfill daily-top-articles in cassandra [2015-09-01 - 2015-11-16 (included)] {slug} - https://phabricator.wikimedia.org/T118991#1817879 (JAllemandou) [18:17:42] Analytics-Backlog: Backfill data on cassandra removing spiders from top endpoint - https://phabricator.wikimedia.org/T118972#1817878 (JAllemandou) [18:20:33] Analytics-Kanban, Datasets-General-or-Unknown: Wikimedia "top" pageviews API has problematic double-encoded JSON - https://phabricator.wikimedia.org/T118931#1817890 (Nemo_bis) [18:21:02] Analytics-Cluster, Analytics-Kanban: {slug} Pageview API - https://phabricator.wikimedia.org/T101792#1817892 (Nemo_bis) [18:21:16] Analytics-Backlog: Create a dedicated hive table with pageview API only requests for reporting - https://phabricator.wikimedia.org/T118938#1817898 (Milimetric) p:Triage>Normal [18:22:34] (PS1) BryanDavis: Rename network_origin UDF partitions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254170 (https://phabricator.wikimedia.org/T118592) [18:59:57] Analytics-Backlog, Analytics-Cluster, Improving access, Research-and-Data: Hashed IP addresses in refined webrequest logs - https://phabricator.wikimedia.org/T118595#1818121 (csteipp) I talked with @ellery about this briefly. I'd prefer that we don't permanently make this connection between our we... [19:07:27] Analytics-Backlog, Analytics-Cluster, Improving access, Research-and-Data: Hashed IP addresses in refined webrequest logs - https://phabricator.wikimedia.org/T118595#1818162 (ellery) @csteipp Otto mentioned that there is the potential to introduce a request ID. We could associate eventlogging recor... [19:09:16] dcausse: wait, schemas ARE in teh jar [19:10:08] dcausse, ottomata : did you guys talked further about avro issues? [19:10:18] nuria: yes, in the solution I use exclusively the jar, this is not very clean... [19:11:01] nuria: in fact I don't know, sometimes I feel so close but sometimes I feel that I'll never finish this job :) [19:11:41] dcausse: maybe we can do a hangout today too ? cc ottomata [19:12:02] as you want [19:18:01] dcausse: let me know what will be a good time [19:18:17] nuria: now would be perfect if possible :) [19:18:34] dcausse: ok, give me 5 mins to get setup [19:21:06] (CR) Nuria: [C: 2 V: 2] "Thank you for doing changes, terminology is a lot better and I am sorry we didi not suggested this earlier. Tests run clean." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254170 (https://phabricator.wikimedia.org/T118592) (owner: BryanDavis) [19:34:49] can do another hangout if yall want [19:37:18] (CR) Ottomata: "thank you!" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254170 (https://phabricator.wikimedia.org/T118592) (owner: BryanDavis) [19:37:46] joal: still there? [19:38:32] joal: i could use some help with the offset issue i was running into when testing camus [19:45:58] nuria: maybe I can help you? [19:46:56] ottomata: k , my error is this one: [19:48:18] ^ ottomata [19:48:46] ja? [19:49:52] https://www.irccloud.com/pastebin/kz8rSR5T/ [19:50:01] sorry, my irc dropped ottomata [19:50:31] full log at /home/nuria/avro-kafka/log_camus_avro_test.txt [19:53:35] nuria: can you just remove your camus offset files [19:53:36] ? [19:53:45] from hdfs? [19:53:48] from my /tmp/dir? [19:53:57] yes, I tried that but will do again [19:57:12] from yeah wherever you are having camus write offsets [19:58:40] nuria: I'm back [19:58:57] have you managed with ottomata ? [19:59:13] joal: I am doing another run, let me see [20:02:31] milimetric: should i get this patch merged and then start one for the logic of each node one? https://phabricator.wikimedia.org/T118309 [20:02:59] * milimetric looks [20:03:03] or multiple patches based off of each other - i don't know that much git magic [20:03:38] oh, if you wanna chain gerrit changesets it's not too bad. [20:03:49] it gets a little harder if you have to update changes earlier in the chain [20:03:55] yeah [20:03:57] but it's not bad, I can help [20:04:03] it's just a matter of rebasing [20:04:39] joal: me no compredou, i have deleted all my offsets under /tmp/nuria/history [20:04:43] so if you're on a branch where you worked on this first change, just git checkout -b next-change-branch-name-whatever [20:04:44] and it still says: 15/11/18 22:38:50 INFO kafka.CamusJob: Previous execution: hdfs://analytics-hadoop/tmp/nuria/history/2015-10-05-18-32-57 [20:04:58] milimetric: so i just continue working in my same branch, push the new changes to a new patch? [20:05:12] what happens if i want to fix things in the old patch? [20:05:13] nuria [20:05:15] what about base? [20:05:30] madhuvishy: my suggestion is to make a separate branch for each change [20:05:32] that is where it picks up the last run from [20:05:34] nuria: and dest too - i would clear all [20:05:36] your main camus path [20:05:38] delete it and history [20:05:40] i deleted all /tmp/nuria ottomata ( i think) lemme seee [20:05:43] oh [20:06:03] milimetric: okay [20:06:05] i can do that [20:06:05] ottomata: [20:06:10] https://www.irccloud.com/pastebin/tEpVQndL/ [20:06:15] milimetric: and then keep rebasing - okay [20:06:25] so clean as of now [20:06:26] yeah, and let me know if you run into trouble [20:06:36] there are lots of ways to do the same thing in git, you just gotta find one that you like [20:07:25] ottomata, joal: [20:07:30] but camus must be writing this info elsewhere to know: " [20:07:30] milimetric: okay sure. also, should I make a separate GlobalMetricUpload or sth form to provide for the cohort stuff and start date, end date? [20:07:36] https://www.irccloud.com/pastebin/61Uq8z2L/ [20:08:08] madhuvishy: sth form? [20:08:17] you can just push the form you had in the same change if you have it [20:08:28] milimetric: sth=something sorry [20:08:28] or if you want, you can organize your commits with git add -p [20:08:43] :) oh, right [20:08:59] so in gerrit, one commit == one change [20:09:19] not really :D [20:09:21] so managing a chain of changes is managing a chain of commits [20:09:42] milimetric: no, this is not about gerrit - I'm asking otherwise - should we reuse CohortUpload, or make a new form, because start and end date are not in cohort upload [20:09:46] (repo, branch, change-id) == change [20:10:04] careful! hashar is WATCHING.... [20:10:18] oh right, sorry, yes, new form is cool. You can subclass CohortUpload I guess [20:10:19] heh [20:10:32] hashar: nah, those are just artifacts [20:10:34] milimetric: yeah I'll do that [20:10:46] and congratulations folks for the PageView RESTBase entry point. I am sure we will see creative uses of that api \O/ [20:11:12] thanks hashar :) I agree with you that someone should put little graphs of pageviews on each article [20:11:52] joal, ottomata : do any of you know where camus stores offsets ? in zookeeper? [20:12:00] joal: maybe taht makes no sense. [20:12:09] no, just in hdfs [20:12:16] nuria, i'm going to try to run your stuff.. [20:12:33] avro-kafka, right? [20:12:36] the properties in there? [20:12:38] ok, it is on /home/nuria/avro-kafka/ launch_camus_job_no_wrapper.sh [20:12:49] k [20:12:50] ottomata: ya, see sh file [20:13:27] milimetric: a first step would be Special:Statistics . Then maybe that can be done on the backend and cached in memcached to avoid calls to restbase [20:13:37] anyway kudos [20:14:12] hashar: it's behind varnish anyway, so doesn't need to be babied [20:14:28] we might need to work on the cache expirations for it [20:16:21] hashar: ya, caching should not be an issue with daily expiration [20:16:22] nuria: I'll have a look at that tomorrow if I have no news from you :) [20:18:28] nuria: i get a different error [20:18:34] null record [20:18:52] or is that what you were getting [20:27:38] ottomata: nah, i was getting offset one [20:28:20] joal:ok [20:31:54] bye a-team, tomorrow [20:32:04] nite [20:32:08] salut! [20:34:06] ottomata: lemme re-run [20:34:19] not sure how to get past this error though it is weird too [20:36:26] ottomata: i am just going to use "kafka.move.to.earliest.offset" [20:36:36] setting and see if it works [20:37:09] earliest will probably be bad, as there is weird data in test topic, no [20:37:11] i am now trying [20:37:13] kafka.move.to.last.offset.list=test [20:37:38] oh but there isn't actual data in this tocpi at the end, right? hm [20:38:56] nuria: job succeeded when Iset that [20:39:22] but no data imported, obvi [20:39:30] ottomata: there is data , i just pushed it [20:39:36] ok [20:39:38] running again [20:39:44] ottomata: using cat data.json | kafkacat -b kafka1012.eqiad.wmnet:9092 -t test [20:39:58] aye just saw that in your .txt file :) [20:40:38] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1818524 (ezachte) Resolved>Open [20:41:39] ottomata: i think i had a job running you might need to retry [20:41:41] null recored! :) [20:41:47] naw shouldn't matter, i'm importing elsewhere [20:42:17] yargh, not sure nuria, this seems to not be an offset problem though [20:43:08] nuria: https://gist.github.com/ottomata/6d5aea160e53104345e0 [20:43:21] milimetric: if you are still around, the PageView demo page at https://analytics.wmflabs.org/demo/pageview-api/ doesn't work for me in either chrome or safari :( [20:43:34] milimetric: i just see cats and dogs but no graph / bar whatever :} [20:43:42] hashar: are you using chromium? [20:43:56] ottomata: we have gotten that before,that is a red herring [20:44:04] milimetric: tried with v46 [20:44:09] nuria: of? [20:44:18] hashar: there's some default security setting on chromium (not chrome) that doesn't let you work with CORS [20:44:23] ottomata: the important thing is [20:44:26] https://www.irccloud.com/pastebin/6h1J3mLQ/ [20:44:43] hashar: so basically the only problem is your browser has to allow CORS [20:44:52] milimetric: ahh cors I was suspecting that but couldn't not find any error to report it. Works in firefox for some reason [20:45:07] milimetric: so +1 on the demo :-))))) [20:45:31] nuria: yes beacus it failed reading? [20:46:03] ottomata: right, on my prior runs with madhuvishy we got the ETL key error every time [20:46:19] ottomata: but when things were working bytes read was bigger than 0 [20:46:37] yeah, that was because we had non avro nonsense test messages in there [20:47:00] right, i think maybe the job isn't able to read whatever you are piping to kafka nuria? [20:47:42] ottomata: so this error you think it doesn't matter [20:47:53] https://www.irccloud.com/pastebin/j8TObwWd/ [20:48:26] i think it does but i don't get that error [20:48:29] ottomata: ok, let's try changing things [20:58:33] ottomata: and there is no way to flush a topic right? [21:00:49] no [21:01:02] but nuria, you can create another test topic if it would help, might be good to have a couple [21:01:03] test1 [21:01:04] maybe [21:01:08] you can just produce to it, and it will be created [21:01:14] ohhhhh [21:03:10] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1818566 (ezachte) =Done Update diagram, a.o. to show new file names + added missing report {F2976774} Added docs on data1001 U... [21:04:32] (CR) Deskana: "Discovery could use some feedback on this, as it's sitting our review queue. Thanks!" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [21:07:20] (CR) Nuria: "I have reviewed patch, suggesting we change string constants by enums some time ago, I can work together with this with oliver as needed b" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/247601 (https://phabricator.wikimedia.org/T115919) (owner: OliverKeyes) [21:13:41] dcausse: FYI that creating a new topic make offset issues disappear. cc ottomata , run still fails. [21:13:56] what now? [21:26:27] (PS2) Mforns: Add sum aggregate by user report [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/254068 (https://phabricator.wikimedia.org/T117287) [21:27:52] (PS3) Mforns: Add sum aggregate by user report [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/254068 (https://phabricator.wikimedia.org/T117287) [21:49:43] Analytics-Backlog, Database: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856#1818703 (mforns) @jcrespo Sorry for pinging you via email, I will avoid that in the future. Thanks! [21:49:58] ottomata: it just processes no data but i guess it could also be schema errors on my end (although i validated that data with schema before sending it) . i am going to move to david's patch to see if i can add a bit of structure to the registry [21:57:13] nuria: aye, ok [21:57:16] sorry i couldn't be more help [22:16:16] Analytics-Backlog, Fundraising research, Research-and-Data: FR tech hadoop onboarding - https://phabricator.wikimedia.org/T118613#1818828 (atgo) I think it's fine as is and will let you know if that changes. Thanks! [22:19:08] milimetric: is there some special knockout magic that makes the start and end date fields appear on the UI? [22:19:31] on the report creation page that is [22:33:21] madhuvishy: no magic, but the full story is a bit confusing [22:33:39] madhuvishy: I suspect the part that you're missing is that the various Metrics inherit from TimeseriesMetric [22:33:50] which defines start_date and end_date as WTForm fields [22:33:57] milimetric: aah [22:34:00] ummm [22:34:07] i copied all that over [22:34:30] hm? :) [22:34:46] milimetric: as in, in my new form [22:34:53] i included those fields [22:34:58] may be i missed something [22:35:44] https://www.irccloud.com/pastebin/OAGW81dg/ [22:36:06] madhuvishy: how are you generating the HTML? [22:36:11] milimetric: I don't see anything else i'd need [22:36:27] milimetric: I put in [22:36:31] https://www.irccloud.com/pastebin/nZkkqU9H/ [22:36:49] although probably this value: default.start_date I have to define in the js [22:37:04] but i don't get the input boxes at all on the UI [22:37:55] knockout.util.js is included on the page [22:39:15] oh! [22:39:25] sorry, right if you're using that whole mess... [22:39:32] but wait, why do you have to use that? [22:40:04] milimetric: use what? are you saying no need to use data-bind? [22:40:28] no, that's a custom binding meant to implement all the absolutely crazy things they wanted us to do with dates on that page [22:40:35] aah [22:40:45] so i can directly use the datetimepicker [22:40:50] lemme see ... maybe we can use it by hardcoding some stuff [22:40:56] they always want it to be UTC [22:41:08] so we can maybe hardcode the UTC zone [22:41:11] oh wait [22:41:14] but but [22:41:16] no they want the output to be UTC.... [22:41:20] its not even showing up [22:41:34] right, that makes sense, it would probably throw errors right now [22:41:41] is that because of the params? no errors too [22:41:42] that may or may not be swallowed up by that insane binding [22:41:49] aahhh [22:41:50] ok [22:41:56] it does get swallowed [22:42:00] (they'd be swallowed 'cause that thing handles infinitely many use cases or something) [22:42:11] alright [22:42:20] https://eonasdan.github.io/bootstrap-datetimepicker/ [22:42:47] i should just use the plugin directly may be? [22:45:31] uh... [22:45:40] so they'll probably want the same time zone support and crap [22:46:27] this basically makes it so they can enter dates in any time zone they want and select a time zone from a drop down and it'll create a hidden input with the correct name and id and keep the date in sync with what's selected in the picker [22:46:56] they'll probably want that... so maybe just copy the timezone dropdown from the report page [22:47:25] oh [22:47:42] and you can pass in ko.observable() for value: [22:47:56] if you want to test and just make sure that works [22:48:16] okay [22:48:52] try passing in ko.observable("{name: 'Central European Time', value: '+01:00'}") for timezone [22:49:03] (instead of copying that) [22:49:57] milimetric: like this? [22:50:00] https://www.irccloud.com/pastebin/Xu1lLJdd/ [22:50:53] yes, does that still not work? [22:51:05] oh wait... [22:51:05] sorry [22:51:21] lose the quotes on the thing inside the observable, madhuvishy [22:51:30] that thing just wants a plain object [22:51:48] ah okay [22:51:56] milimetric: still doesn't show up though [22:52:00] grrr [22:52:14] gimme a sec, I'll try it [22:52:24] milimetric: is there stuff specific to this in the report creation js? [22:52:35] I thought they were all in knockout util [22:52:55] i don't think so, no [22:53:00] okay [22:53:00] if there is, that's a bug [22:53:49] milimetric: hmmm, its probably just me missing something [22:54:03] yeah, but this is stupidly hard :( sorry [22:54:58] madhuvishy: what I'm gonna do is try adding this on a random page and step through the datetimepicker binding's update call (https://github.com/wikimedia/analytics-wikimetrics/blob/master/wikimetrics/static/js/knockout.util.js#L25) [22:55:40] milimetric: can you batcave? I wanna see you debug it [22:55:55] sure [22:56:23] ok i'm there [22:59:40] Analytics-Backlog: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1819140 (Tbayer) NEW [23:13:16] Analytics-Backlog: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1819183 (Tbayer) [23:27:52] hey a-team, signing off, have a good end of day, bye! [23:28:03] nite! [23:28:09] good night mforns, I'll review your code later today :) [23:28:24] thanks madhuvishy! [23:37:56] (CR) Madhuvishy: "This is cool, just one comment." (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/254068 (https://phabricator.wikimedia.org/T117287) (owner: Mforns) [23:45:04] Analytics-Backlog, Fundraising research, Research-and-Data: FR tech hadoop onboarding - https://phabricator.wikimedia.org/T118613#1819359 (DarTar) @atgo I'll assign this to you so you can coordinate with @madhuvishy and Analytics for data access as needed.