[00:23:00] nite everyone! [02:50:16] (CR) Madhuvishy: Add get pageview_info udf and underlying functions (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [02:57:37] (CR) Madhuvishy: Add pageview aggregation and parquet merger. (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212541 (owner: Joal) [06:58:21] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#1319849 (Qgil) There are still negative numbers. [07:03:37] Analytics-Tech-community-metrics, ECT-May-2015: Key performance indicator: Gerrit review queue - https://phabricator.wikimedia.org/T39463#1319865 (Qgil) Open>Resolved Alright, I finally went through http://korma.wmflabs.org/browser/gerrit_review_queue.html and I think the page is good overall. Ther... [07:09:48] Analytics-Tech-community-metrics: Median time to review for Gerrit Changesets, per month - https://phabricator.wikimedia.org/T97715#1319876 (Qgil) The two graphs shown in F13952 at T68265 correspond to what is today "Age of open changesets (monthly snapshots)" and "Age of open changesets by affilation (month... [07:11:50] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#1319881 (Qgil) [07:12:34] Analytics-Tech-community-metrics, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1319887 (Qgil) [07:12:36] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015, Patch-For-Review: "Age of unreviewed changesets by affiliation" shows negative number of changesets - https://phabricator.wikimedia.org/T72600#748491 (Qgil) [07:21:48] Analytics-Tech-community-metrics, ECT-June-2015: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors - https://phabricator.wikimedia.org/T100189#1319915 (Qgil) In http://korma.wmflabs.org/browser/gerrit_review_queue.html we have "Age of open changesets by affi... [07:22:23] Analytics-Tech-community-metrics, ECT-June-2015: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors - https://phabricator.wikimedia.org/T100189#1319921 (Qgil) [07:24:05] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Fine tune "Code Review overview" metrics page in Korma - https://phabricator.wikimedia.org/T97118#1319923 (Qgil) [07:25:08] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Key performance indicator: analyze who contributes code - https://phabricator.wikimedia.org/T55485#1319928 (Qgil) [08:55:30] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1320017 (Aklapper) [09:15:51] Analytics-General-or-Unknown, CA-team, Community-Liaison, Wikimedia-Extension-setup: enable Piwik on ru.wikimedia.org - https://phabricator.wikimedia.org/T91963#1320044 (Multichill) >>! In T91963#1318815, @Tgr wrote: > It is not clear (to me, anyway) that the WMF privacy policy covers the WM-RU sit... [09:53:24] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015, Patch-For-Review: "Volume of open changesets" graph should show reviews pending every month - https://phabricator.wikimedia.org/T72278#1320087 (Aklapper) According to Dicortazar in our meeting this needs some more discussion with @qgil.... [09:53:48] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Maniphest backend for Metrics Grimoire - https://phabricator.wikimedia.org/T96238#1320098 (Aklapper) Work ongoing, hence adding ECT-June-2015; looking forward to the results! [09:57:31] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015, Epic, Google-Summer-of-Code-2015: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#1320109 (Aklapper) [10:01:04] Analytics-Tech-community-metrics, ECT-June-2015: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors - https://phabricator.wikimedia.org/T100189#1320120 (Aklapper) a:Dicortazar There are plans to exclude obvious Freemail providers like GMX from the stats. W... [10:05:12] Analytics-Tech-community-metrics, ECT-May-2015: Provide list of open Gerrit changesets with most activity which aren't -1/-2'ed - https://phabricator.wikimedia.org/T94036#1320134 (Aklapper) >>! In T94036#1308877, @Dicortazar wrote: > have a look at http://korma.wmflabs.org/browser/scr-backlog.html Thanks... [10:06:06] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Tech metrics should talk about "Affiliation" instead of organizations or companies - https://phabricator.wikimedia.org/T62091#1320141 (Aklapper) This still needs to be pulled - adding ECT-June-2015 after talking to @Dicortazar [12:08:37] Analytics-Kanban, Analytics-Wikimetrics: Wikimetrics crashes when cohort description has special characters - https://phabricator.wikimedia.org/T100781#1320438 (mforns) NEW [12:59:09] Ironholds: Heya, you there ? [13:00:47] joal, yep [13:01:03] Got some data for you (didn't make it yesterday, but today :) [13:01:12] https://gist.github.com/jobar/9450b29096cd193a0f15 [13:01:18] https://gist.github.com/jobar/abdc496bb025f9b8fc89 [13:01:44] Would like your opinion on project names correctness and global scale as well (does it looks correct ?) [13:04:30] And last (but not least) - https://gist.github.com/jobar/59eab22201685bd4c68f [13:08:32] Ironholds: Another question then: https://gist.github.com/jobar/3baee2c69db8a40758af [13:08:46] joal, cool! [13:08:52] (sorry, didn't get pinged, got distracted by http://uproxx.com/music/2015/05/single-ladies-set-to-the-ducktales-theme-song-is-the-best-thing-on-the-internet-right-now/ ) [13:08:56] Do you think we can say that every project in the form '%-%.%' is a dialect one ? [13:09:54] * joal thanks Ironholds for this link :D [13:10:38] I don't know what you mean by "a dialect one" [13:11:30] a domain from which I should extract dialect [13:11:51] For the moment, I use /dialect/ when not /wiki/ nor /w/ [13:12:11] then that is the heuristic you should keep using; it's the right heuristic ;p [13:12:29] huhu [13:12:55] arbcom-de.wikipedia is the German-language Arbitration Committee, for example [13:13:25] that's the big fuzzing element there. Alternately, you could do a negative check (if it's got a dash and doesn't match 'arbcom' then...) [13:13:52] I guess my questions would just be whether the outcome is identical in both cases, and if so, do whichever is more performant. [13:13:55] arbcom is the only wrong one here ? [13:15:44] that I can see, but I might be wrong :D [13:15:46] for isntance, fiu-vro.wikipedia.org exists, but fiu.wikipedia.org is not [13:16:01] So it's not really dialect ... Or is it ? [13:16:01] yeah, that's actually a foible of our naming schemes [13:16:08] hum [13:16:26] Ok, I'll keep my heuristic, and leave projects as projects (much easier) [13:16:39] fiu-vro is the Voro language - it's a dialect under the ISO rules [13:17:14] our naming scheme does not FOLLOW this ISO convention :/ [13:17:26] hum ... I guess what makes sense here are the pair (project/dialect) [13:17:30] so yeah, looks like a full language. No idea where the fiu came from [13:17:48] I'll leave it as it is now [13:18:14] Any will for me to change something before I start beackfilling ? [13:18:20] * Ironholds thinks [13:18:28] Nothing teeny ;p [13:18:54] ottomata, Meestah Otto, where did the cirrus logs end up on 1002? [13:18:59] I know they're there but not the dir ;p [13:19:34] Ironholds: anything big then ;-P [13:19:37] ? [13:20:04] I'll still wait for analytics standup before starting, to get peer acknowledgement [13:20:17] ? [13:20:48] Ironholds: remind me... [13:21:13] ottomata, you offered to sync up the cirrussearch logs from fluorine with stat1002, they ended up there, I was very happy and totally forgot where they lived [13:21:28] joal, I'm good, I think :). Unless you have 40 new machines in your basement. [13:22:14] Ironholds: Would love that so much ... [13:22:27] hahah [13:22:27] Thanks a lot for having looked at that :) [13:22:42] You'll know when data is accessibler [13:24:44] ah yes [13:24:46] Ironholds: [13:24:50] /a/mw-log/archive/ [13:26:47] ottomata: the dialect term comes from Ironholds [13:26:52] I guess it's ok ? [13:27:12] (PS2) Joal: Add get pageview_info udf and underlying functions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 [13:28:13] Ironholds: I had not seen your comment on consistency [13:28:38] ottomata, Ironholds : Do you prefer me to change that to '--' to match geodata one ? [13:28:42] no [13:28:48] - is more consistent [13:28:52] I'd just like things to be consistent everywhere [13:28:58] -- i assume exists because the country code is always 2 letters [13:29:08] hm ... Maybe [13:29:18] i [13:29:21] i'm not sure which is best [13:29:27] but, - is better than -- :) [13:29:29] having read the underlying libmaxminddb source code [13:29:37] it exists because the people we depend on write horrifying code [13:30:02] C written by Perl people *shudders* [13:30:28] * joal hides behind Ironholds [13:30:50] :P [13:31:04] anyway; I don't mind what it is as long as it's sensible [13:31:59] (CR) Joal: Add get pageview_info udf and underlying functions (9 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [13:41:48] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015, Patch-For-Review: "Volume of open changesets" graph should show reviews pending every month - https://phabricator.wikimedia.org/T72278#1320628 (Qgil) >>! In T72278#1320087, @Aklapper wrote: > According to Dicortazar in our meeting this n... [13:52:07] milimetric: yt? [13:52:13] hey ottomata [13:52:22] re your el load email [13:52:38] yes [13:52:43] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=eventlog1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1432907489&g=load_report&z=large&c=Miscellaneous%20eqiad [13:53:13] that node has 24 processors [13:53:22] oh :) [13:53:29] way beefier than vanadium, I didn't know [13:53:30] good [13:53:42] 32G ram [13:53:59] if you were cpu bound [13:54:03] you could parallelize [13:54:08] not so easy to do with zmq though [13:54:11] but with kafka, easy! [13:54:19] well, easier. [13:54:36] ottomata: Why would it be difficult using 0mq ? [13:54:51] beacuse all messages are in a single stream? [13:54:56] you could farm them out, but you'd have to do that [13:55:06] Multiple consummers can use the same queue [13:55:17] and get different messages? [13:55:31] We would loose schema partition [13:55:37] I think so yes [13:55:59] really? so zmq deletes a message from the queue when it is read? [13:56:06] wouldn't that mean that only one consumer could read? [13:56:08] or i mean [13:56:11] one consumer per message [13:56:18] we use zsub and other tools to check the stream [13:56:23] 1 consumer reads 1 message, the other reads another ? [13:56:24] woudln't that mean we are draining the stream when we do that? [13:56:41] if the message is not deleted when read [13:57:00] I don't get ottomata :( [13:57:01] then i guess those consumers that need distinct messages could be in the same 'consumer group' or 'socket iid' or whatever zmq does [13:57:16] Yeah that's the idea [13:57:22] but, how would they coordinate? [13:57:29] badly :) [13:57:31] uh, if zsub steals the messages from the rest of the pipeline, that'd be... bad [13:57:33] zmq just keeps track of a high water mark? [13:57:40] for a given group [13:57:47] and each time a consumer asks for a message it increments? [13:57:52] how big is zmqs buffer? [13:58:19] ottomata, thanks for the link above! Sorry I missed it ;p [13:58:54] Analytics-EventLogging, Analytics-Kanban: Load Test Event Logging [8 pts] {oryx} - https://phabricator.wikimedia.org/T100667#1320658 (Milimetric) [14:00:48] joal: i gotta run a quick errand, but today i was thinking of working on madhu's spark problems, and also, if you like, maybe bob's column pruning? i think you are busy with other things so I can take that on [14:01:07] I am currently trying bob's pruning [14:01:10] ah ok :) [14:01:11] cool. [14:01:11] :) [14:01:23] Code is ready to launch for starting backfilling [14:01:27] i will focus on spark then, i betcha that will be a headache [14:01:35] I'll wait for standup [14:01:42] I have some comments on madhu's code [14:01:48] To optimize [14:01:57] I'l take some time for that [14:02:01] as well ;) [14:02:14] k [14:04:09] ottomata: zmq sockets are different depending on pub/sub or push/pull [14:04:25] parallelization depends on that [14:04:43] pub/sub --> broadcast (no paraelization) [14:05:12] push/pull --> queue (paralellization ) [14:05:15] ottomata: --^ [14:05:23] ah, so EL as is is not easily parallelizable [14:05:29] would have to use a different socket type> [14:05:30] ? [14:06:16] I think so yeaad [14:06:37] alos, column prunning --> almost no gain if we keep parquet + snappy [14:06:57] if we use gz instead of snappy --> divide size by 2 [14:07:02] I'll check bz2 [14:07:08] hm [14:07:15] Since it's cold storage, we could go that way [14:40:27] joal: madhuvishy is only running this on mobile data? is that right? [14:40:51] hmm, yeah is think so [14:40:59] Let me check [14:41:46] yes, mobile only [14:42:02] Quickly, some optims --> Use is_pageview filed instead of compute [14:42:07] filter before union [14:42:31] extract wmfuuid from map in query instead of compute [14:42:45] And I think it's already not too bad :) [14:42:50] aye [14:43:02] . The Parquet data source is now able to discover and infer partitioning information automatically. [14:43:02] ??? [14:43:15] By passing path/to/table to either SQLContext.parquetFile or SQLContext.load, Spark SQL will automatically extract the partitioning information from the paths. Now the schema of the returned DataFrame becomes: [14:43:15] ? [14:43:19] https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#parquet-files [14:43:22] ottomata, the domain meeting is clashing with another meeting I've got, I'm afraid [14:43:24] If you give a hive-like path [14:43:44] joal, we do! [14:43:55] indeed [14:44:03] I don't get your point :( [14:44:03] kinda coooool, does that mean it is smart enough to prune partitions? [14:44:14] oh, my point is, if we use parqeutFile to just load [14:44:22] '/wmf/data/wmf/webrequest' [14:44:22] I don't think so [14:44:35] You still can't load multiple folders at once [14:44:39] hm. you think it just gets the partition keys as in the schema? [14:44:48] Ironholds: uhhhh [14:44:55] i think you should be there, as you have the most informed opinion i think [14:44:56] no? [14:45:00] I think it basicaaly let you query data based on partition key [14:45:09] hm [14:45:20] Ironholds: I think so too [14:45:26] well, then it'll have to be next week [14:45:41] Even if my opinion has nothing comparable with ottomata ;) [14:45:50] because if you schedule meetings in the only meeting free gap I have today, which is 2 hours long, I will be Very Unhappy [14:46:03] (although not as unhappy as my boss who will be all "Oliver why have you got nothing done this week") [14:46:22] sooooo, so [14:47:13] soo, monday at the earliest if you need me in it. If you can do it without me, go right ahead. [14:47:23] joal: up to you [14:47:40] Let's reschedule [14:50:16] done, next tuesday [14:51:06] ottomata: about data pruning, I also repartition to ensure better block match [14:51:20] gzip --> 16 blocks instead of 64 [14:51:24] ok ? [14:52:06] sure [14:52:12] don't mind that at all [14:52:17] k [14:54:50] joal, i am reading the ParquetRelation2 source in spark [14:54:55] it looks like it should do partition pruning [14:55:09] going to test... [14:58:06] wow :) [14:58:10] That would be nice ! [14:58:45] So you'ld load and union, then filter based on partition data ? [14:59:40] union? [15:00:13] I mean, you could load data in subfolders? [15:00:22] because if not, it means union [15:00:22] looks like it, yeah [15:00:32] hmm, makes much more sense then :) [15:00:36] https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#partition-discovery [15:01:55] joal: https://github.com/apache/spark/blob/v1.3.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L402 [15:02:19] testing it live on our cluster --> seems to wrok great :) [15:03:19] Youhou ! [15:03:24] That's really cool :) [15:03:29] No more unioning needed ! [15:03:40] Thanks for having double checked that :) [15:05:24] oh, you have tested already? [15:05:27] you can tell that it works? [15:05:38] joal: check this out too: [15:05:39] https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#configuration [15:05:42] spark.sql.parquet.filterPushdown [15:05:48] defaults to false because of a bug with nullable and binary [15:05:54] not sure if we have any of those columns [15:06:20] but parquet pushdown means that it can filter based on conditionals when the parquet reader is actually returning data [15:06:43] i think that let's it read much less data if we are filtering on a non partition column [15:07:03] say, we only wanted where uri_host = pl.wikipedia.org [15:07:25] it would read the uri_host column file, filter on that, and then only read rows out of the other column files that match [15:07:48] sounds really great [15:08:34] so, let's see if we can make this work as is, test how long it takes, and then turn that on and test and compare? [15:09:17] why noy :) [15:09:40] I wonder about the null bug tho [15:11:13] ja not sure if it affects us or not [16:19:14] (PS3) Mforns: Add stacked bars component to compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/214036 (https://phabricator.wikimedia.org/T91123) [16:19:35] (PS4) Mforns: Add stacked bars component to compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/214036 (https://phabricator.wikimedia.org/T91123) [16:23:22] (PS5) Mforns: Add stacked bars component to compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/214036 (https://phabricator.wikimedia.org/T91123) [16:24:08] (PS6) Mforns: Add stacked bars component to compare layout [analytics/dashiki] - https://gerrit.wikimedia.org/r/214036 (https://phabricator.wikimedia.org/T91123) [17:23:22] joal: how did you test the partition pruning? [17:24:33] ottomata: loaded a dataset for a day, and filtered data over an hour [17:24:50] how did you load? [17:24:59] my shell is hanging on [17:24:59] val webrequestParqet = sqlContext.parquetFile("/wmf/data/wmf/webrequest") [17:25:14] well, that's a little too big ;) [17:25:40] why big? [17:25:43] it is supposed to prune, no? [17:25:48] It will possibly load, but it tries to read parquet defs for every single folder [17:25:49] everything below that is a partition [17:25:54] OH [17:25:57] :) [17:26:03] beacuse it has to find the schema, HMMM [17:26:14] too big [17:26:58] still, not what we want though [17:27:05] i was hoping we could treat that like a table [17:27:13] if we just loaded a dataframe there and registered a temp table [17:27:32] I have the point, but it's not feasible as is [17:27:43] that means we'll still need code to go from date range to file paths, no? [17:27:55] hm. unless...hm, unless we can specify the schema up front [17:27:55] It will still be usefull to prevent us having to load and union every single hour [17:27:56] hm [17:27:58] yes [17:28:30] for regular day and month aggregations / queries, it is usefull [17:29:49] wow, I had never seen production queue that loaded :) [17:30:13] memory used: 1.03TB [17:30:16] :D [17:30:48] Memory change changes things :) [17:34:32] hehe :) [17:34:47] Analytics-Kanban, Analytics-Wikimetrics: Wikimetrics crashes when cohort description has special characters - https://phabricator.wikimedia.org/T100781#1321297 (kevinator) p:Triage>Unbreak! [17:35:09] ottomata: Will you have an impala for me to ride next week ? [17:36:54] hm, i haven't been focusing on that, but it is working, no? [17:37:12] just annoying default queue names, and have to point it at a worker manually? [17:37:12] right? [17:37:27] I have not tested yet [17:37:30] But will ! [17:37:35] You rememb [17:37:37] er [17:37:51] parquet issues or something ? [17:37:55] I can't recall [17:38:01] I'll test next week :) [17:39:53] caching in spark is really REALLY cool :) [17:42:14] first run (caching data) -> 210s [17:42:14] second run (data cached) -> 4s ! [17:42:14] Yay ! [17:48:01] joal: ottomata Hi [17:48:07] heya madhuvishy [17:48:22] just got to office. reading thread above [17:49:17] joal: are there recommendations for code changes I should try [17:49:33] There are some :) [17:50:11] 16:42:01 < joal> Quickly, some optims --> Use is_pageview filed instead of compute [17:50:14] 16:42:07 < joal> filter before union [17:50:17] 16:42:30 < joal> extract wmfuuid from map in query instead of compute [17:50:20] 16:42:44 < joal> And I think it's already not too bad :) [17:50:30] madhuvishy: --^ [17:50:30] joal: hmmm, let me look at these. [17:50:33] :) [17:50:35] Sure ! [17:50:43] Please ask if any question :) [17:53:50] (CR) Madhuvishy: Add get pageview_info udf and underlying functions (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [17:53:59] joal: sure [17:58:32] joal: so is_pageview true is the only check for mobile app pageviews? or should i also filter for access_method = 'mobile app' [17:58:50] You can do both [17:59:12] But it's not necessary in that specific case because wmfuuid is only set for intereting rows [17:59:28] madhuvishy: --^ [17:59:34] joal: aaah [17:59:35] right [17:59:38] :) [17:59:46] data - optimised code ;) [18:00:26] joal: so - there's this filterBadData method - do you think I can get rid of that? [18:00:48] rather, I'm asking how to replace that with a dataframe based filter [18:00:58] because that's depending on the rdd way [18:01:04] SchemaRDD* [18:01:06] right [18:01:13] I get it :) [18:02:06] Let's make an etherpad [18:02:32] phew, hey you two, i'm poking around in spark trying to get it to do what I want, but am not getting very far. [18:02:35] how can I help you rigiht now? [18:02:44] http://etherpad.wikimedia.org/p/spark_pair_coding [18:04:18] joal: I'm on line 172 [18:04:37] Madhu, can you follow meok [18:05:11] joal: madhuvishy? i was going to help yall with stuff today, am sorta not getting very far with what I was aiming at. so i'm going to give up. The stuff joal and I discovered about partitioning pruning is sorta helpful, but i'm not sure how helpful [18:05:51] so, my q is: is there something I can do that is not what you are doing now? I know you are trying to make your code more efficient, but we still shoudlnt' be getting OOMs when just loading up data, right? [18:05:58] ottomata: Hmmm, we're pairing on the optimizations joal sugested [18:06:00] is that still happening now that we know how to use partition pruning? [18:06:03] or does it not matter? [18:06:34] it does matter [18:06:41] ottomata: hmmm, i haven't tried it - can we just load data upto /webrequest and it'll figure it all out? [18:06:49] Because loading too much data in RAM is not gonna make it :) [18:07:08] madhuvishy: unfortunately it doesn't work :( [18:07:22] joal: aah [18:07:54] Well, it could work, but it first trie to load every single parquet schema, and reconstrcut a common on that [18:08:06] And we have MANY partitions to read :) [18:08:43] joal: hmmm right. may be i can just give it two months [18:08:43] yeah, i was just trying to see if i could specify the schema ahead of time [18:08:51] i can't figure out a way to do that [18:09:11] ottomata: hmmm, i dont think i understand that [18:09:37] ok so, the new problem is this [18:09:39] when doint something like [18:09:57] we want to do: [18:09:58] val webrequestParquet = sqlContext.parquetFile("/wmf/data/wmf/webrequest/") [18:10:14] webrequestParquet.registerTempTable("webrequest") [18:10:22] val d = hiveContext.sql("SELECT uri_host, uri_path FROM webrequest WHERE webrequest_source='misc' and year=2015 and month=5 and day=27 and hour=0 limit 10"); [18:10:32] and have it figure out which data to load, right? [18:10:35] yes [18:10:37] it should only have to look at that one hourly partition [18:10:56] but, spark parquet stuff is trying to be really fancy [18:11:07] parquet supports a limited schema evolution [18:11:12] we can also do /wmf/data/wmf/webrequest/webrequest_source='misc' (not important) [18:11:14] parquet schemas are stored in each parquet file [18:11:38] so, spark is going into each final partition and reading the schema out [18:11:50] and it has to do this before it can do any partition pruning [18:15:53] joal: i'm still at a loss for how you tested [18:15:57] partition pruning [18:16:10] loaded one day of partitions [18:16:20] like [18:16:24] sqlContext.parquetFile("/wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=5/day=20") [18:16:25] ? [18:16:28] right [18:16:34] then filter [18:16:36] ok, hm, was hoping to do a month [18:16:41] you can try [18:16:46] that gave me Premature EOF: no length prefix available [18:16:55] :( [18:16:56] which i *think* is due maybe to a datanode timeout? not sure where that ocmes from [18:17:06] more driver memory ? [18:17:15] that isn't an OOM [18:17:26] i'm running with 1024M [18:20:24] :( [18:20:31] MAdhu [18:20:36] madhuvishy: sorry ;) [18:20:41] You good with that ? [18:20:55] joal: yeah, i can remove all those lines now? [18:21:54] Now your userSessionAll is straight from data :) [18:22:04] Yes you have it [18:22:12] I let you test [18:22:18] And go for diner in the mean time ! [18:22:40] You see, now, for each file read, you first remove A LOT of it's data :) [18:22:48] and then union with others :) [18:22:59] Arf :) [18:23:06] Going to eat ! [18:23:09] Later :) [18:23:15] joal: cool thanks :) [18:24:25] joal: i even get Premature EOF: no length prefix available for just one day :( [18:25:13] i am very worried about spark :( [18:25:27] was talking to my twitter friend, he was recommending we don't use it for production :( [18:33:28] ok, madhuvishy, let me know how I can help I guess [18:33:47] if you have something 100% runnable right now, I can try to run it, but if you are already adjusting it for performance i should wait I guess? [18:34:04] ottomata: :O have you seen this? https://spark-summit.org/2015/ [18:34:21] ottomata: Hmmm, we changed a bunch of things and I have some errors because of it [18:34:27] ah cooool [18:34:28] still trying to figure it out [18:34:38] i will be in sf june 3 - 11 [18:34:40] will miss it! [18:35:27] ottomata: ha. i want to go but too much monies [18:35:31] ayyyye [18:35:45] but yay you are in SF [18:35:48] HMMM [18:35:55] i think my Premature EOFs are just warning [18:35:55] s [18:36:50] COOL [18:36:51] actually [18:36:53] madhuvishy: [18:36:55] i think it does work. [18:37:03] it is just really slow to read the schemas [18:37:03] ottomata: oh? [18:37:08] but, not so unuseable for batch job slow [18:37:11] yeah, maybe 5 minutes to read? [18:37:16] i did [18:37:17] val webrequestParquet = sqlContext.parquetFile("/wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=5") [18:37:19] webrequestParquet.registerTempTable("wr") [18:37:21] val d = sqlContext.sql("SELECT uri_host, uri_path FROM wr WHERE webrequest_source='mobile' and year=2015 and month=5 and day=20 and hour=0 limit 10"); [18:37:23] and [18:37:26] d.take(10) [18:37:27] and see [18:37:29] 15/05/29 18:36:45 INFO ParquetRelation2: Reading 0.14534883720930233% of partitions [18:37:31] so that is cool! [18:37:41] got results [18:38:10] ottomata: ooh. nice. we still need to go from that to 30 days! [18:38:17] but yeah, nice! [18:38:49] yeah, but at least you don't have to do the union stuff now [18:38:50] if that works [18:38:56] and you don't have to map to partition file paths [18:39:05] lemme see how long it takes for a year of mobile [18:40:00] if this actuallyw orks [18:40:05] you could just do [18:40:15] basePath = "..." [18:40:23] parquetFile(basePath) [18:40:36] and then rely on partition predicates for the rest of the pruning [18:41:07] don't have to figure out how to map something llike where day between 1 and 30 to actual directories [18:41:14] ottomata: yup. [18:43:16] ottomata: joal I'm stepping out for lunch and will be back to work on this in a bit. [18:43:25] ok. cool [18:43:49] joal: I have more errors on line 265 on in the etherpad. still trying to figure out how to change that. brb [19:12:59] nope can't load the full thing. [19:13:01] hm. [19:13:26] well at least we know you can load a month [19:14:21] hmm, well no sorry, it is still running... [19:14:21] hm [19:15:21] madhuvishy: with dataframes, repartition instead of coalesce :) [19:15:52] ottomata: do you prefer us no to go too much in the spark direction ? [19:16:07] I read your comment from your friend :( [19:18:22] madhuvishy: or extract rdd at userSessionAll [19:19:14] joal: i dunno. [19:19:26] i think their data is much larger than ours, but even so far we are having this much trouble [19:19:41] ottomata: True ... [19:19:53] We could go scalding if you prefer [19:19:58] maybe for aggregations that have to work on larger datasets, if they are not too complex, we should stick with hive [19:20:03] haha, i actually have beeen trying that some today [19:20:10] scalding documentation is SO BAD [19:20:16] never read it [19:20:24] I have used cascading [19:23:49] joal, i'm going to poke around a little more with scalding maybe right now, see if i can actually load a webrequest parquet file [19:23:54] it is not clear to me how to do that atm [19:23:59] fabian and i played with this a while ago [19:24:02] before we used parquet [19:24:07] k [19:24:09] https://github.com/declerambaul/WikiScalding [19:24:19] worked on the raw json stuff [19:24:29] k [19:24:52] For the moment, I'll stick to spark (works fine for me, except for the brand new sql stuff) [19:24:58] yeah, keep it for now [19:25:04] would be a big investment to change [19:25:16] i think there's a lot more code infrastructure we'd need for scalding [19:25:23] need case classes for data soures, etc. [19:25:59] hm [19:26:28] joal: the fact that I can load a month of data using parquetFile is good. [19:26:35] Yes, agreed [19:26:42] it took like 5ish minutes to get back to be as it was reading the schemas [19:26:43] but that's not so bad [19:26:49] We still need to ensure it can work with it without failing [19:27:07] Cause I wonder how it works filtering and so on [19:30:00] btw joal, in case you didn't know, you can pass multiple paths to parquetFile [19:30:12] Didn't know that, no [19:30:16] That's nice ! [19:30:24] yup, so hopefully, no unioning needed at all [19:30:32] just pass all the paths ot the last 30 days [19:33:50] ottomata: absolutely not sure [19:33:57] depends on how filtering is done [19:34:03] if it's done file by file, ok [19:34:08] if not, no good [19:35:23] ? [19:35:37] filtering? [19:35:39] do you mean the partition pruning? [19:35:42] or the filter pushdown? [19:35:51] jo^ [19:35:54] filtering pushdown I thinkk [19:35:55] joa^ [19:35:57] ah [19:36:00] we haven't tried that yet, right? [19:36:06] ideally we could use it either way [19:36:17] so far the trouble i've been having is actually reading the schemas [19:36:22] meaning: if spark tries to load all data before filtering it, no good [19:36:48] i guess, hopefully it can just to filtering in executors though [19:36:51] i dont' think it will try to load the data [19:37:02] I do hope so [19:37:08] But before testing ... [19:40:59] joal: ottomata back [19:41:11] hey madhuvishy [19:41:35] backlog a bit, you should find :) [19:44:20] joal: yeah reading :) [19:50:41] joal: yeah, so if i keep it as rdd - it says combineByKey cant be resolved [19:51:54] have more info ? [19:53:01] joal: yeah [19:53:04] https://www.irccloud.com/pastebin/qE3zY7o9 [19:53:39] it says combineByKey cannot be resolved. same for foldLeft. [19:55:54] ottomata: so parquetFile takes a comma separated String of paths? [19:58:14] yes [19:58:34] ottomata: cool i will try that. [20:02:03] rats, joal, i don't think spark.sql.parquet.filterPushdown works [20:02:14] hmmm [20:02:34] it doesn't like it when I use partition columns in the predicate [20:02:45] java.lang.IllegalArgumentException: Column [day] was not found in schema! [20:02:56] trying it withtout those, seeing if it filters on is_pageview [20:16:41] joal: so userSessionsAll used to be a list of tuples. now it's a dataframe [20:20:16] Analytics-Kanban, Analytics-Wikimetrics: Wikimetrics crashes when cohort description has special characters - https://phabricator.wikimedia.org/T100781#1321861 (Halibutt) Just to make it clear, "special characters" means most likely any non-Latin 1 character. [20:22:47] madhuvishy: sorry, baby called :( [20:23:00] joal: yeah i thought so :) it's okay! [20:23:20] So for rdd, ask for the rdd par of the dataframe :) [20:23:34] val userSessions = userSessionsAll.rdd.coalesce(100) [20:23:37] or [20:23:43] val userSessionsAll = getAllParquetData(webrequestPaths, sqlContext).rdd [20:23:56] is what I have [20:24:19] That's not what there is on the etherpaf [20:24:24] But I trust you :) [20:24:36] joal: oh yeah, I changed the code though [20:24:57] So now, wth rdd, what happens? [20:25:23] many other errors. do you want to batcave real quick? [20:25:31] sharing screen will help may be [20:25:39] k [20:45:33] ottomata: Joseph helped me get all the optimizations done. I also removed union and am passing parquetFile a list of paths. gonna test it now [20:48:22] (PS4) Madhuvishy: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change reports to run for last n days instead of daily or monthly (not sure if this is gonna work yet) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) [21:01:48] ottomata: this is what i get when i past it comma separated string of paths [21:01:51] Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://analytics-hadoop/wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=5/day=16, /wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=5/day=17, /wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=5/day=18, [21:01:51] /wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=5/day=19, /wmf/data/wmf/webrequest/webrequest_source=mobile/year=2015/month=5/day=20 [21:11:31] hmmm [21:11:59] ah [21:12:01] madhuvishy: [21:12:03] madhuvishy: Pass a list of strings instead of one string [21:12:05] it takes variable args [21:12:07] no, not a list [21:12:12] if your paths is a List [21:12:14] so [21:12:16] normally you sould do [21:12:23] parquetFile("file1", "file2" [21:12:25] ) [21:12:27] if you have a List [21:12:30] or Seq or something [21:12:31] you can do [21:12:38] parquetFile(paths:_*) [21:12:46] wo [21:12:55] parquetFile(paths:_*) [21:12:56] gr [21:13:01] paths:_* [21:13:07] ottomata: ooh i dint know how to do that. [21:13:08] parquetFile( paths:_* ) [21:13:09] yeah [21:13:18] i had to look it up earlier today :) [21:13:32] ottomata: :) cool i'll try that in a bit [21:14:09] even without that, i'm able to run it now for 2 days without heap space errors. so moving towards victory [21:16:13] woot! [21:16:21] Nice madhuvishy :) [21:16:30] Victory before weekend :) [21:16:37] \o/ [21:17:00] ottomata: backfilling started in sparkshell [21:17:03] ottomata: joal|night :D Gonna hit it with 15/30 days next. lets see :) [21:17:11] please don't kill :) [21:17:27] :) [21:19:04] real bed time now :) [21:19:09] Ciao folks [21:21:32] (PS3) Joal: Add get pageview_info udf and underlying functions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 [21:26:22] cya joal|night [21:52:01] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics, Easy, Need-volunteer: "Create Report" button does not appear when uploading a new cohort - https://phabricator.wikimedia.org/T95456#1322197 (madhuvishy) Hi @Abit, could you check if this problem still happens - I did a quick test an... [22:24:26] ottomata: tried with 30 days - failed. (application_1430945266892_38545). I tried 4 days before that - which was fine. trying 15 now. [22:25:30] spark-submit --master yarn --driver-memory 1500M --num-executors=12 --executor-cores=2 --executor-memory=2g --class org.wikimedia.analytics.refinery.job.AppSessionMetrics --verbose /home/madhuvishy/workplace/refinery-source/source/refinery-job/target/refinery-job-0.0.12-SNAPSHOT.jar -o /user/madhuvishy/tmp/ -y 2015 -m 5 -d 25 -n 30 [22:25:37] is what i ran.