[03:25:16] (CR) Milimetric: "Some javascript gotchas pointed out in the bindings. Setting the id attribute is the only real problem, the rest could wait. The perform" (23 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/214036 (https://phabricator.wikimedia.org/T91123) (owner: Mforns) [03:29:35] Analytics-Cluster, Fundraising Sprint Kraftwerk, Fundraising Sprint Lou Reed, Fundraising Tech Backlog, operations: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1328015 (Ottomata) In those cases, there are more requests in kafkatee than in udp2log... [05:24:37] Analytics, Engineering-Community, ECT-June-2015, ECT-May-2015: Analytics Team Offsite - Before Wikimania - https://phabricator.wikimedia.org/T90602#1328076 (Rfarrand) [05:49:37] Quarry: Add list of query executions to the query page side-bar - https://phabricator.wikimedia.org/T100982#1328147 (Abarcomb) This request arouse out of a discussion at a workshop. In my drawing, I saw modifications of the same query as branches, while completely new queries formed new top-level nodes. Of c... [08:44:11] Analytics, Tool-Labs-tools-Other: Work on Metrics tools wm-metrics and MediaCollectionDB, refactoring and code quality. - https://phabricator.wikimedia.org/T100710#1328387 (Qgil) [08:44:22] Analytics, Tool-Labs-tools-Other: Work on Metrics tools wm-metrics and MediaCollectionDB, refactoring and code quality. - https://phabricator.wikimedia.org/T100710#1328388 (Qgil) a:JeanFred [11:27:15] hi pginer. [11:27:49] hi lzia [11:28:24] pginer, do you know if we have a page that very briefly explains how users should use ContentTranslation, something like a super simple tutorial? [11:29:00] I want to make one for the recommendation test, since users will go to ContentTranslation directly without much background. I thought I'd double-check to make sure it doesn't already exist [11:29:49] I think amir created something [11:30:00] Let me paste what I can find here [11:30:07] thanks pginer. [11:31:38] Amir created this document: https://www.mediawiki.org/wiki/Content_translation/Documentation/User_guide [11:31:58] We have also a screencast: https://youtu.be/nHTDeKW3hV0 [11:32:46] That includes enabling the beta feature. There is a shorter version focusing just on the translation editor: https://youtu.be/Ed2Ke_RLqOo [11:32:56] perfect. thank you pginer. [11:33:11] ok, no problem [11:47:00] (CR) Mforns: [C: 2 V: 2] "LGTM" [analytics/dashiki] - https://gerrit.wikimedia.org/r/212454 (owner: Milimetric) [12:45:58] (PS5) Joal: Add get pageview_info udf and underlying functions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 [12:46:27] (CR) Mforns: [C: 2 V: 2] "LGTM" [analytics/dashiki] - https://gerrit.wikimedia.org/r/212467 (owner: Milimetric) [12:46:52] (CR) Joal: Add get pageview_info udf and underlying functions (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [12:51:05] halfak: I have no update neither, didn't have time to push the metadata analysis further [12:51:12] Shall we cancel ? [13:00:36] morning all [13:03:24] Hi milimetric [13:03:33] hi joal [13:04:41] hey joal, if I wanted to run that top articles job for 15 days instead of 30, do you think that's feasible with the cluster situation right now? [13:05:22] it is feasible for sure :) [13:05:27] I'm asking, I think, both against the raw data, and against the hourly aggregate once that finishes [13:05:48] I mean, will you guys let me run it and will it not bring down the cluster? [13:06:39] milimetric: let's batcave and talk about that [13:09:28] o/ milimetric & joal [13:09:39] I'm very briefly inbetween things and wanted to say "hi" :) [13:09:59] Hi "inbetween-man" :) [13:10:40] I hope to have time to make some progress on metadata analysis by the end of week [13:10:46] Depending on how pageviews work [13:10:50] We'll see [13:11:04] I've lost milimetric :( [13:11:25] * joal is hanging around like a dreadful soul [13:12:48] joal: :( sorry [13:12:57] I was spacing out and my pings aren't working [13:12:59] huhu :) [13:13:30] ok, restored pings. It's 2015 and we still have to "did you try reloading the page?" [13:13:41] joal: coming to the batcave [14:15:03] (CR) Mforns: [C: 2 V: 2] "LGTM! There are 2 console.logs, but I've seen you removed them in the next change." [analytics/dashiki] - https://gerrit.wikimedia.org/r/212800 (owner: Milimetric) [14:21:52] thanks mforns_ I checked the last commit in that string again and there are no loose console.logs anywhere [14:22:04] btw, the ag command line code search tool is Awesome [14:22:09] "silver searcher" [14:22:17] i'm not exactly sure how I lived without it [14:22:23] milimetric, yes, no problem [14:22:36] ? [14:23:16] milimetric, looking to ag [14:25:09] MORNING [14:25:11] hellloooo [14:26:32] hi! [14:30:36] Hullllo ottomata [14:30:43] hyeaaa [14:30:45] Have a minute for meeeee real quick ? [14:30:48] sure! [14:30:52] for you i have many minutes [14:30:55] batcave :) [14:44:08] joal, i think you missed my question about using a lib to extract the query params rather than parsing them ourselves [14:44:18] hm, nope :) [14:44:25] I added a comment about that :) [14:44:39] ja but I recommented because i think we are misunderstanding each other [14:44:46] i'm asking about actually extracting the commands [14:44:47] sorry [14:44:49] the params [14:44:52] as in [14:44:58] map = getParams(uri_query) [14:45:06] map['page_title'] [14:45:10] I need to decode the parameters in a specific way, and extracting include decoding [14:45:15] oh [14:45:18] hm [14:45:22] makes sense ? [14:45:27] I know, it's uggly [14:45:28] because you need to decode special you can't extract? [14:45:32] using lib? [14:45:34] because lib does decoding? [14:45:40] That's the thing [14:45:50] Or at least the lib I was using [14:46:05] apache.httpclient [14:46:12] something like that [14:46:48] hm, ok [14:46:53] :( [14:46:57] hm, well you still have some unused imports [14:47:04] org.apache.http.NameValuePair; [14:47:05] A LOT of data cleansing in there [14:47:11] URLEncodedUtils [14:47:16] A shit, forgot those [14:47:20] Will patch [14:50:09] Anything else before I resubmit ? [14:50:14] ottomata: --^ [14:51:31] hm, just so i understand the dialiec thing [14:51:48] if the uri_path looks like [14:52:03] /xx-xxx/yyyy [14:52:12] yup [14:52:13] return xx-xxx [14:52:16] if it look slike [14:52:18] /xx/yyyy [14:52:22] return default value? [14:52:27] yessir [14:52:32] iiinteresting [14:52:36] so will [14:52:41] examples I have are mostly for zh [14:52:42] /zh/Wikipedia... [14:52:53] will that have zh.wikipedia.org as uri host? [14:52:55] as well as [14:53:02] /zh-hk/... [14:53:12] likely those will both be in zh.wikipedia project? [14:53:13] zh is really strange from a host perspective [14:53:38] there are dialect in hosts for zh, as well as dialects in folders [14:53:41] hey milimetric, yt? [14:53:50] weird, ok! [14:53:51] heh [14:54:08] hey ottomata [14:54:44] So decision was taken with Ironholds to use only folder as dialect info, and leave host info as project [14:55:27] ok cool [14:55:33] milimetric: just a sanity check for me [14:55:37] is page_title the best name? [14:55:47] that's what I told joal, but i want to confirm tht [14:55:48] that [14:55:53] vs. [14:56:03] article, page, page_name, whatever [14:56:03] lemme think about it for a sec too [14:56:10] what's the field in the db? [14:56:14] just page? or page_title [14:56:21] i had remmebered page_title, but i want to be sure [14:56:47] page_title is cool 'cause it's the same in the mediawiki db [14:56:56] good, that's kinda what i wanted [14:57:07] we are going to use that more,a nd i think it would be good to standardize it where we can [14:57:11] I don't always love the legacy names, but i think in this case it makes sense [14:57:14] yeah [14:57:16] its a fine name [14:57:27] cool, ok [14:57:30] joal, ja, lgtm [14:57:36] send your patch and I will +2 [14:57:42] cool [14:57:56] Let's for Ironholds review as well before merging :) [14:58:15] (PS6) Joal: Add get pageview_info udf and underlying functions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 [15:00:35] joal: madhuvishy, i think the increase in exectuors solved exactly this: https://spark.apache.org/docs/1.3.0/tuning.html#memory-usage-of-reduce-tasks [15:02:07] (CR) Ottomata: [C: 2 V: 2] Add get pageview_info udf and underlying functions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [15:02:24] Thx ottomata [15:03:28] joal: could you check out my comments here when you get a chance: [15:03:29] https://gerrit.wikimedia.org/r/#/c/212573/4/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/AppSessionMetrics.scala [15:03:36] mostly about -h for scopts and hdfsUriRoot [15:04:02] i could be wrong, but it seems like we shoudl be able to get a FileSystem object from defaults in .xml files, dunno [15:05:23] ottomata: will double check [15:06:45] joal, you can do [15:06:46] https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/Path.html [15:06:54] p = new Path("hdfs://pathtofile") [15:07:05] fs = p.getFileSystem(conf) [15:07:13] ah i will paste that into review for madhu [15:07:20] Makes sense [15:07:50] I used the hdfsRoot thing to mimic what we do with Oozie [15:08:34] ja but with oozie we usually just pass input and output to workflows or actions [15:08:49] the input is built from parts (hdfs uri, etc.) but the job takes just a single path [15:09:33] coolk [15:09:39] I don't really mind :) [15:09:53] Simplet path expression = Happy [15:10:09] s/t/r [15:10:25] (CR) Ottomata: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change rep (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) (owner: Madhuvishy) [15:12:44] Analytics-Tech-community-metrics, ECT-June-2015: Gerrit changes reviewed per month (on scr.html) - https://phabricator.wikimedia.org/T97716#1329770 (Aklapper) (Adding #ECT-June-2015 because this blocks T94578 which is a hard goal for this month) [15:12:46] Analytics-Tech-community-metrics, ECT-June-2015: Active changeset *authors* and changeset *reviewers* per month - https://phabricator.wikimedia.org/T97717#1329772 (Aklapper) (Adding #ECT-June-2015 because this blocks T94578 which is a hard goal for this month) [15:17:53] (CR) Mforns: "LGTM! Just a comment on a comment." (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/212821 (owner: Milimetric) [15:21:53] ottomata: level of paralelism here refers to number of partitions, not number of cores [15:22:06] (PS4) Milimetric: Refactor Wikimetrics layout to use TimeseriesData [analytics/dashiki] - https://gerrit.wikimedia.org/r/212821 [15:22:17] nawww [15:22:19] how? [15:22:28] (CR) Milimetric: Refactor Wikimetrics layout to use TimeseriesData (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/212821 (owner: Milimetric) [15:22:30] i mean, both would matter, but without more exectuors you reach al imit [15:22:39] (PS3) Milimetric: Refactor Compare layout to use TimeseriesData [analytics/dashiki] - https://gerrit.wikimedia.org/r/213967 [15:22:44] (PS4) Milimetric: Use Dygraphs in Vital Signs [analytics/dashiki] - https://gerrit.wikimedia.org/r/214270 (https://phabricator.wikimedia.org/T96339) [15:22:53] well, you could have 100 partitions with 4 workers, or the opposite :) [15:23:04] it is talking about increasing the number of parallel tasks [15:24:02] it says: "the working set of one of your tasks" [15:24:11] task === executor [15:24:17] working set === partition [15:24:35] Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters. [15:25:01] that would be a lot of exectuors though, > # cores in cluster [15:25:17] I guess by default spark uses the number of executors as number of partitions to reduce on if not specified [15:25:30] Which would make sense [15:25:49] And, task != executor ! [15:25:58] right that is true [15:26:04] but, more exectuors == more parallel tasks [15:26:05] task == unit of work for an executor [15:26:29] (CR) Mforns: [C: 2 V: 2] "LGTM" [analytics/dashiki] - https://gerrit.wikimedia.org/r/212821 (owner: Milimetric) [15:26:34] "because it reuses one executor JVM across many tasks " [15:26:45] 1 executor JVM --- Many tasks [15:27:57] partitionning into small bits is better than into big ones --> facilitate the parallelization in case of non-homogeneous tasks, and prevents OOM issue [15:28:01] ottomata: --^ [15:28:26] Only concern --> writing results into small files [15:28:32] aye [15:28:46] well, can coalecing help there? [15:28:48] So, partition into small bits for execution, and then repartition into big enough bits for writing [15:29:01] For sure :) [15:29:33] aye [15:29:35] I thought there was one, but it seems to have been removed [15:29:40] that could be something to optimize [15:29:46] repartition(parallesm) [15:29:46] re-reading [15:29:57] there are a lot of knobs here, hard to know [15:30:06] so, i ran over 10 days with 40 executors and watched how busy they were [15:30:11] line 260 [15:30:16] most of the time, they each had 1 task, so that is good [15:30:38] They can't get more --> 1 task per executor per moment [15:31:41] right, but [15:31:42] joal / ottomata: standup :) [15:31:45] I'd like to know where, in the execution path, the OOM happened [15:31:51] oops sorry [15:31:53] just wanted to make sure there weren't empty execturos [15:31:54] whoos [15:32:10] ottomata: makes sense [15:42:17] Analytics-Cluster, hardware-requests, operations: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1329868 (Ottomata) Ok cool, noted for the future danke. How goes? :) [15:42:58] Analytics-Cluster, hardware-requests, operations: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1329872 (Ottomata) Oh, also, same number of cores please :) [15:44:39] Analytics-Cluster, Analytics-Kanban: Ooziefy and parquetize pageview intermediate aggregation using refined table fields [13 pts] {wren} - https://phabricator.wikimedia.org/T99931#1329876 (ggellerman) a:JAllemandou [15:52:44] Analytics-Tech-community-metrics, ECT-June-2015: Active changeset *authors* and changeset *reviewers* per month - https://phabricator.wikimedia.org/T97717#1329919 (Aklapper) p:Normal>High a:Dicortazar [15:52:46] Analytics-Tech-community-metrics, ECT-June-2015: Gerrit changes reviewed per month (on scr.html) - https://phabricator.wikimedia.org/T97716#1329922 (Aklapper) p:Normal>High a:Dicortazar [16:03:12] halfak: you there ? [16:06:16] Analytics-Cluster: Create current-definition/projectcounts [13 pts] {musk} - https://phabricator.wikimedia.org/T101118#1330039 (kevinator) NEW [16:07:29] Analytics-Cluster, Analytics-Kanban: Add Pageview aggregation to Python [13 pts] {musk} - https://phabricator.wikimedia.org/T95339#1330065 (kevinator) [16:08:33] Analytics-Cluster, Analytics-Kanban: Create current-definition/projectcounts [13 pts] {musk} - https://phabricator.wikimedia.org/T101118#1330076 (kevinator) [16:19:03] Analytics-Cluster, Analytics-Kanban: Add Pageview aggregation to Python [13 pts] {musk} - https://phabricator.wikimedia.org/T95339#1330118 (kevinator) [16:19:05] Analytics-Cluster, Analytics-Kanban: Create current-definition/projectcounts [13 pts] {musk} - https://phabricator.wikimedia.org/T101118#1330117 (kevinator) [16:23:04] milimetric, madhuvishy, joal, can anyone of you please review this patch: https://gerrit.wikimedia.org/r/#/c/215200/ ? It fixes a quite critical bug that we introduced in the last utf-8 changes and that affects a lot of wikimetrics users. [16:23:25] i'll do it mforns [16:23:28] When you merge it, I will deploy it asap! [16:23:42] thx milimetric [16:23:45] thanks milimetric [16:23:51] I can go for it, but will take longer ! [16:24:45] (PS6) Madhuvishy: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change reports to run for last n days instead of daily or monthly (not sure if this is gonna work yet) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) [16:24:55] ottomata: pushed latest code. [16:24:58] mforns: you've got a flake8 error, besides that it looks good to me [16:25:08] i'm gonna run the migration and test locally [16:25:14] milimetric, oh gosh [16:25:47] ottomata: and the command I ran was: spark-submit --master yarn --driver-memory 1500M --num-executors=60 --executor-cores=1 --executor-memory=2g --class org.wikimedia.analytics.refinery.job.AppSessionMetrics --verbose /home/madhuvishy/workplace/refinery-source/source/refinery-job/target/refinery-job-0.0.12-SNAPSHOT.jar -o /user/madhuvishy/tmp/ -y 2015 -m 5 [16:25:47] -d 26 -n 15 [16:26:22] Analytics-Tech-community-metrics, ECT-June-2015: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1330159 (Aklapper) p:High>Low > I think this task is a nice to have but not a blocker. That makes things more relaxing, thanks. Lowering priorit... [16:26:33] (PS2) Mforns: Fix cohort description utf8 bug [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/215200 (https://phabricator.wikimedia.org/T100781) [16:26:58] milimetric, pushed a new patch fixing flake8 [16:28:19] mforns: I have a call now, so not able to pull and test it, but looked at the code and it looks good to me [16:28:37] madhuvishy, thanks! don't worry [16:29:07] I will deploy it in staging before production [16:29:34] (CR) Milimetric: [C: 2] Fix cohort description utf8 bug [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/215200 (https://phabricator.wikimedia.org/T100781) (owner: Mforns) [16:29:47] milimetric, thank you :] [16:29:48] mforns: nice, tests fail before migration, work after [16:29:50] I love alembic [16:30:01] cool [16:30:19] ottomata, in in 1 [16:30:21] ...okay, 2 [16:31:20] madhuvishy: thanks, will try after this meeting.. [16:31:47] Ironholds: helloooo [16:32:13] https://plus.google.com/hangouts/_/wikimedia.org/domain-info [16:34:35] Ironholds: ? [16:34:41] ottomata, see above ;p [16:34:52] oh you'd be here [16:34:52] sorry [16:34:56] thought you were counting me down [16:44:06] Analytics-Tech-community-metrics, ECT-June-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1330263 (Aklapper) [16:44:07] Analytics-Tech-community-metrics, ECT-June-2015: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1330264 (Aklapper) [16:44:18] Analytics-Tech-community-metrics, ECT-June-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1167361 (Aklapper) [16:44:19] Analytics-Tech-community-metrics, ECT-June-2015: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1325337 (Aklapper) [17:01:39] Analytics-Kanban: Add cache headers to the datasets.wikimedia.org/limn-public-data/metrics folder - https://phabricator.wikimedia.org/T101125#1330382 (Milimetric) NEW a:Milimetric [17:02:07] Analytics-Kanban: Add cache headers to the datasets.wikimedia.org/limn-public-data/metrics folder {lion} - https://phabricator.wikimedia.org/T101125#1330391 (kevinator) [17:11:43] (CR) Ottomata: [C: 2] [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change rep (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) (owner: Madhuvishy) [17:11:50] (CR) Ottomata: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change rep [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) (owner: Madhuvishy) [17:16:05] Ironholds: plop [17:16:09] Forgot to ask you [17:16:23] ? [17:16:39] Do you mind having quick look at that code review: https://gerrit.wikimedia.org/r/#/c/214349/5 [17:22:51] Analytics-Cluster, Analytics-Kanban: Ooziefy and parquetize pageview intermediate aggregation using refined table fields [13 pts] {wren} - https://phabricator.wikimedia.org/T99931#1330548 (ggellerman) [17:22:53] Analytics-Cluster, Analytics-Kanban: Compute pageviews aggregates daily and monthly from April {wren} - https://phabricator.wikimedia.org/T96067#1330547 (ggellerman) [17:33:33] joal, got meetings and research spikes all day :( [17:33:45] mwarf :( [17:34:14] Ok nonetheless, no merge needed before some other modification from a colleague [17:34:27] Can you get a quick look tomorrow ? [17:34:33] Ironholds: --^ [17:44:09] (CR) Madhuvishy: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change rep (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) (owner: Madhuvishy) [17:47:05] joal, totally! [17:47:19] madhuvishy: Can you tell me where in the execution process you got that OOM issue ? [17:47:25] Thx Ironholds :) [17:47:55] joal: Hmmm I tried looking at yarn logs but it doesn't tell me much. I can run it again and see. [17:50:22] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics, Easy, Need-volunteer: "Create Report" button does not appear when uploading a new cohort - https://phabricator.wikimedia.org/T95456#1330687 (Abit) Hullo @madhuvishy, I just started another batch of Wikimetrics reports and got the sa... [17:56:32] Analytics-Tech-community-metrics: Weekly report for "Allow contributors to update their own details in tech metrics directly" - https://phabricator.wikimedia.org/T101134#1330724 (Sarvesh.onlyme) NEW a:Dicortazar [17:57:29] Analytics-Cluster, operations: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1330742 (faidon) Yeah, what @Ottomata said, it's just a handful of packages and we have this running on Ubuntu as well, so how hard can it be... On the above: -... [17:58:54] Analytics-Tech-community-metrics: Weekly report for "Allow contributors to update their own details in tech metrics directly" - https://phabricator.wikimedia.org/T101134#1330752 (Sarvesh.onlyme) [17:58:56] Analytics-Tech-community-metrics, ECT-June-2015, Epic, Google-Summer-of-Code-2015: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#1330751 (Sarvesh.onlyme) [18:00:06] Analytics-Tech-community-metrics: Weekly report for "Allow contributors to update their own details in tech metrics directly" - https://phabricator.wikimedia.org/T101134#1330724 (Sarvesh.onlyme) [18:01:21] Analytics-Tech-community-metrics: Weekly report for "Allow contributors to update their own details in tech metrics directly" - https://phabricator.wikimedia.org/T101134#1330774 (Sarvesh.onlyme) [18:02:14] Analytics-Tech-community-metrics: Weekly report for "Allow contributors to update their own details in tech metrics directly" - https://phabricator.wikimedia.org/T101134#1330724 (Sarvesh.onlyme) [18:05:43] milimetric, do you have 3 minutes? [18:05:57] hey mforns, I'm in a meeting now [18:05:59] but what's up [18:06:02] they're just doing intros [18:06:15] milimetric, I had to change the migration to work in staging [18:06:41] milimetric, I wanted to ask you a couple of things to make sure this won't mess up production [18:07:24] milimetric, but I can wait until you finish the meeting, no rush! [18:08:35] mforns: sure, ask away [18:08:38] i'll answer as i can [18:08:45] ok [18:10:02] so, when I run the migrations on staging, the values of the description column that contained special characters were migrated like this "abc???def", so there was an encoding problem in migrating them [18:10:35] so I loked at the current collation of the column, and it was latin_swedish_ci [18:11:20] I added a couple lines of code to the migration to convert the collation of the column first and then change to binary [18:11:30] and this worked in the end [18:12:06] but looking to the collation in production, the description column is a utf8_general_ci already [18:12:30] so, I wonder if the lines that I added to the migration will be necessary/harmful in production? [18:12:53] as the dbs of staging vs production differ in that aspect [18:32:28] EOD time ! [18:32:36] Have a good end of day guys :) [18:32:43] See you tomorrow [18:34:58] night joal|night ! [18:37:10] ottomata: https://spark.apache.org/docs/latest/tuning.html I was reading this yesterday [18:37:38] and wondering if my kryo serializer needs more memory [18:38:09] it was 24. I bumped it up to 1024 for fun, to see what'd happen. [18:44:48] mforns: you have two choices [18:45:00] 1. restore the production database from the backup zip on /data/project/... [18:45:04] * mforns is hearing [18:45:19] 2. try it in a controlled test on staging [18:45:31] (downgrade, change the collation, add some weird names, upgrade) [18:45:46] milimetric, I already did number 2. [18:45:54] Personally, I'd be happy with 2. because this isn't critical data and we have it backed up anyway [18:46:14] but if you want to do 1. I can help you [18:46:35] madhuvishy: did it make a difference? [18:46:44] milimetric, I'd say that the original migration will work in production [18:46:58] mforns: that sounds very likely to me [18:47:12] ottomata: nah. i'm getting these errors for 15 days - http://pastebin.com/ZuehW4uC [18:47:16] why the hell that was latin_swedish_ci ... lol, I'm sure I'll never find out [18:47:18] milimetric, cool, I'll go for the deployment then, thansk1 [18:47:34] np, let me know if you need any help [18:50:36] ottomata: this look kiiinda similar - http://apache-spark-user-list.1001560.n3.nabble.com/setting-heap-space-td16245.html [18:52:04] ya i see that too madhuvishy [18:52:44] madhuvishy: this may help [18:52:45] - Always specify the level of parallelism while doing a groupBy, reduceBy, join, sortBy etc. [18:52:48] ottomata: hmmm. so it failed. these are the 4 errors i got - http://pastebin.com/AzNpd4ps (joal|night) [18:53:03] ottomata: yeah, i saw that too [18:56:11] ottomata: although i dont see any of that in the code [18:56:26] thre is a reduce [18:56:38] but ya it doesn't seem to be OOMing there [18:56:39] hm [18:56:45] oh [18:56:46] yes it is [18:56:54] Job 0 failed: reduce at AppSessionMetrics.scala:54, too [18:57:43] (PS7) Madhuvishy: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change reports to run for last n days instead of daily or monthly (not sure if this is gonna work yet) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) [18:57:44] aah [18:58:52] not sure though madhuvishy, not exactly sure how to get more partitions during reduce, how many keys do you think there are here? [19:00:34] ottomata: i'm not sure either. also may be the combineByKey is problematic? [19:02:02] hm, maybe, why lower partitions to 100? [19:02:04] just curious [19:02:23] milimetric, it went well, will send an email to the list [19:02:46] ottomata: i actually dont know, it says so in the code. [19:02:58] ha ok [19:03:22] i dint touch those parts - so i dont have a lot of context too [19:03:37] mforns: do you know [19:03:38] ? [19:03:53] // lowering in 1 order of magnitude the number of partitions for this job [19:03:54] // logs list 1500 partitions for the original dataset [19:03:54] val userSessions = userSessionsAll.coalesce(100) [19:04:45] ottomata, reading [19:05:04] madhuvishy: i will try to look into this a lot more soon too: [19:05:05] https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation [19:05:14] looks like it requires some nodemanager daemon changes [19:05:18] won't get to that tdoay thoguh [19:05:56] ottomata, madhuvishy, I remember Nuria reducing the number of partitions to improve performance maybe? [19:06:25] ja, not sure at the moment, reducing parallelism might reduce performance because more mem needed [19:06:31] buuut, that is not currently where we are getting OOMs [19:07:25] hmmm [19:07:50] weird, madhuvishy, i'm running with 10 days nwo and also OOMing [19:07:53] but it worked yesterday... [19:08:29] ottomata: hmmm, wondering if getting the timestamps this way is not helping [19:09:05] they say java objects are costly - so may be java.sql.Timestamp to long conversion is causing it [19:10:18] vs. the thing we were doing before with the udf? [19:10:26] that is a change, eh [19:10:26] ? [19:10:28] hm [19:11:08] madhuvishy: maybe also [19:11:08] http://spark.apache.org/docs/1.0.0/programming-guide.html#which-storage-level-to-choose [19:12:50] we should try MEMORY_ONLY_SER? [19:13:17] ottomata: ^ [19:14:00] ja i would try it [19:14:05] not sure where to set that though [19:14:09] .persist()? [19:14:13] ya i'm looking too [19:14:43] (CR) Ottomata: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change rep (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) (owner: Madhuvishy) [19:17:41] ottomata: should we persist userSessions? [19:18:06] well, at the moment we see OOMs at the reduce of nums.map(QTree(_)) [19:18:15] but, i barely even know what that is :) [19:18:28] ottomata: he he me too [19:19:33] ottomata: well i guess it's just summing up a list of nums in some special way [19:19:40] ja [19:20:07] ottomata: https://spark.apache.org/docs/latest/tuning.html#tuning-data-structures [19:20:38] may be making all those QTree objects is not good [19:21:23] ja i mean, who knows, try persist there i guess? [19:21:25] can't hurt to try :) [19:21:51] madhuvishy: btw, i'm going back a few patches to the UDF one and trying it [19:21:55] on 10 days with the same settings [19:22:06] ottomata: yeah cool. [19:28:12] ottomata: let me know if that succeeds. i'm trying the persist [19:28:16] k [19:36:45] madhuvishy: fabian suggests reducing depth of depth of qtree [19:36:48] to 6 maybe [19:37:10] this job is runnign much longer with your previous patch, btw, i htink you might be right about the primative type [19:38:05] madhuvishy: can you map the ts returned by the sql to a long again? [19:38:08] in scala? [19:44:13] ottomata: yeah [19:44:43] ottomata: oh, i dont know how to do that without doing gettime [19:46:45] hm, welp, gotta be some way to get it out of the Timestamp object, right? [19:47:47] ottomata: without doing - r.getAs[java.sql.Timestamp](1).getTime ? [19:50:18] that should be ok maybe? if not we can do a udf, although i'm not sure why it would matter if we did it in sql or in scala [19:50:53] ottomata: this is what i'm already doing [19:51:02] oh... [19:51:21] oh! [19:51:22] you are. [19:51:23] hm. [19:51:35] getTime returns a Long? [19:51:40] ottomata: yeah [19:51:43] hm [19:52:01] btw, ja, madhuvishy, the code with the udf from dt finished! [19:52:05] gonna try lots of days [19:52:23] ottomata: yeah looks like java.sql.Timestamp casting is not helping [19:56:08] aye, hm. i mean, can we cast the ts in sql then? [19:58:22] madhuvishy: the job you are currently running is OOMing, yes? [19:58:39] ottomata: yeah, it's still running though - and very different errors [19:58:59] which change was that? [19:59:01] persist? [19:59:46] ottomata: yup [20:00:27] ottomata: http://pastebin.com/Zk9582vc [20:01:24] I'm persisting userSessions. [20:04:57] ja different error for sure. madhuvishy, lets back away from persist for a moment [20:05:08] will you try reducing the qtree size? [20:05:12] ottomata: yeah alright. [20:05:34] ottomata: also wondering if i can read timestamp as string and do toLong on that [20:05:53] ja, its confusing because it is a hive or parquet timestamp [20:07:32] ja it is so strange that this job seems to be ok when we convert the dt in a udf [20:07:33] so strange [20:07:37] mforns: thanks very much for following up on both those threads [20:07:48] ottomata: yeah! [20:08:12] milimetric, np, I should do that more often [20:08:18] ottomata: this thing's still runnning. dint say failed 4 times bubye [20:08:51] with persist? [20:08:58] ottomata: yup [20:09:02] huh, weird [20:09:19] https://yarn.wikimedia.org/proxy/application_1430945266892_47749/stages/stage?id=0&attempt=0 [20:09:21] so any failed stages! [20:09:25] /tasks [20:10:02] ottomata: yeah [20:10:35] ottomata: wondering if i should kill it or see what happens [20:10:52] hard to say, it doesn't look good though :) [20:11:07] it doesn't really seem to be proceeding atm either though [20:11:17] ottomata: yeah, i dont think it'll succeed. interesting that it dint exit yet [20:11:18] this is during the coalesce too [20:11:23] that would be another thing to try to get rid of [20:14:08] hm. [20:14:17] my 30 day one isn't moving i think... [20:14:18] hm [20:14:40] ottomata: okay, so i'm getting rid of coalesce, reducing the qtree size to 6, and seeing if timestamps can be read as strings and converted to long [20:14:47] oo, ok, big change! [20:14:48] heheh [20:17:53] ottomata: nah, it doesn't let you read timestamps as strings [20:18:01] i should go back to udf i guess [20:18:04] hm [20:18:22] i dont know how to get sql to cast it [20:18:39] hm [20:19:03] madhuvishy: i'll see if i can find something with that [20:19:25] ottomata: okay, until then reverting to udf [20:20:08] k [20:20:17] i mean, you could udf the ts instead of the dt, right? [20:20:24] then you at least don't need to do dateformat conversion [20:20:32] ha, but how the crap is that different! [20:20:33] ?! [20:20:47] i guess, because it runs as part of hte sql...somehow it is lower level and doesn't end up in an RDD anywhere [20:20:48] dunno. [20:23:31] ottomata: yeah that's what i thought. i'll udf the ts [20:41:38] mforns: you still around? [20:41:44] milimetric, yep [20:43:06] milimetric, what's up :] [20:44:02] mforns: https://meta.wikimedia.org/wiki/User_talk:Milimetric_(WMF)/Work_log/2015-06-02 [20:44:09] I am trying to understand this load testing [20:44:11] it's making no sense [20:44:20] I added simple code in the processor and forwarder, looks like this: [20:44:33] aha [20:45:06] https://www.irccloud.com/pastebin/Lwe3iy91/ [20:45:17] (very similar in the forwarder) [20:45:28] and I logged what I got on my talk page above [20:45:37] forwarder: around 10k events [20:45:49] processor: 3975 [20:45:50] aha [20:45:57] in the database: 4000 [20:46:19] and the total generated was over 15k [20:46:29] this was over 32 seconds [20:46:37] so the forwarder on beta chokes at around 300 per second [20:46:50] milimetric, I have a guess of what is happening between the forwarder and the processor [20:46:52] and the processor at around 124 per second [20:47:21] mmm [20:48:23] yes, it makes no sense hehehe, I can't see it [20:48:53] we know that the processor supports more than 120/s, at least in prod [20:50:10] yeah, so that's ok if this scales up with a beefier machine, that makes sense from experience [20:50:16] but this shows us kind of the ratio [20:50:30] basically, it seems the processor is the actual bottleneck, not the consumer [20:50:32] milimetric, aha, yes if we consider that beta has only one core... [20:50:47] not just one core, but probably a very weak core compared [20:51:04] milimetric, aha, yes, and it makes sense, because the validation is the heaviest bloc of code [20:52:02] milimetric, what schemas did you use? [20:53:26] milimetric, and by the way, it is possible to insert more events to the DB than processed, because of the consumer buffers. [20:53:49] mforns: just the Edit schema [20:54:14] mforns: yeah, the buffer thing makes sense. There were probably already 25 events in the buffer, makes sense [20:54:15] the edit schema is quite large, if you use a shorter schema, maybe validation takes less time? [20:54:24] ah, good idea! [20:54:27] ok, I'll try and see what changes [20:54:31] do you have a favorite short one? [20:54:32] cool! [20:54:37] mmm [20:54:47] there is this timing schema [20:54:55] navigation timing [20:55:30] milimetric, no sorry, mistaken, this is a large one [20:56:15] milimetric, this one is short: MobileWebUIClickTracking [21:00:31] I'm also doing optional values never btw, so that doesn't affect the batching [21:03:23] milimetric, yes, the random generator isn't good enough for using this feature, it ends up generating always lots of partitions in the inserts [21:04:25] yes, i noticed that early on [21:06:48] mforns: https://meta.wikimedia.org/wiki/User_talk:Milimetric_(WMF)/Work_log/2015-06-02 [21:06:52] it forwarded a higher percentage [21:07:00] and processed a higher percentage too [21:07:18] but there are clearly bottlenecks [21:07:47] ottomata: I'd love to run this load test on an eventlogging server that's much closer in capacity to eventlog1001. How hard would that be to arrange? [21:08:51] milimetric, your results make total sense to me, note that this time, the consumer couldn't cope with all the events passed by the processor [21:09:01] milimetric, or maybe they are still in the buffer??? [21:09:34] yeah, lemme check [21:10:18] no, still 4000 in the db [21:10:31] that number is creepy, why is always the same [21:10:59] hehehe yes [21:11:17] milimetric, well the buffer is 400 in size [21:11:25] so 10 * buffer [21:11:43] well, total integer divided by buffer [21:11:54] but then it should flush the rest within 300 seconds, no? [21:12:22] milimetric, if another event comes for the same schema, yes [21:12:27] but if no event comes, no [21:12:29] oh, right! [21:12:31] I forgot [21:12:34] I'll send one :) [21:12:37] ok [21:12:54] yep, it inserted 222 more :) [21:13:00] ok [21:13:51] milimetric, but still, there were 4466 processed, and 4000 + 221 = 4221 consumed [21:14:20] yes [21:14:26] so still weird [21:14:28] milimetric: eh? [21:14:34] milimetric, which makes me think that the processor and the consumer have a similar limit [21:14:52] ottomata: I'm running a load test against the beta cluster's event logging setup [21:14:57] but that machine that runs EL has only 1 core [21:15:02] so it's not a great test [21:15:33] and it's dying at around 300 events / second on the forwarder and 100 events / second on the processor. And we know it can do a lot better than that in prod [21:15:52] so I was wondering if we can get another box that matches eventlog1001 in performance more closely so we can try and push it to its limits [21:16:16] madhuvishy: [21:16:16] CAST(ts as int) [21:16:25] you can do that in your spark sql [21:16:31] no udf needed [21:16:42] ottomata: I'm running for 30 days with my changes now [21:16:45] it's actually running [21:16:52] cpp;l! [21:16:53] no heapspace errors :D [21:17:07] milimetric, maybe what I said is wrong, because we're testing with one single schema, so in reality the consumer will be slower, now it needs 1 insert every 400 events, but in reality it will need couple moew [21:17:10] *more [21:17:54] mforns: yeah, but either way, it's clear we need a beefier server to test [21:18:00] milimetric, sure [21:18:09] otherwise it's not very useful other than to get a general idea of where the bottleneck is [21:18:45] I think it's still the processor, because the consumer might do a lot of other inserts but it we can still increase the batch size a lot, and I think that's where we can squeeze a lot of performance [21:19:00] it should have no problem inserting 10,000 events at a time. [21:20:07] I see [21:21:22] milimetric, I think general ideas are very useful, though :] [21:22:17] Analytics-EventLogging, Analytics-Kanban: Load Test Event Logging [8 pts] {oryx} - https://phabricator.wikimedia.org/T100667#1317914 (Milimetric) Pausing this task as we need a separate box to continue testing. I ran tests on beta labs and it seems clear that the virtual instance is our bottleneck. Two... [21:23:04] indeed :) just not useful enough to get this task done, so I paused it. So, anyone need any help before I go diving through the tasked column? [21:23:13] kevinator: any input on my next task? [21:26:48] I guess I'll do the current-definition project counts [21:29:14] milimetric, paused: ok [21:43:19] madhuvishy: is it actually runnning? [21:43:24] i had one that looked like it was but never did anything [21:43:25] for 30 days [21:43:39] ottomata: yup. it's running [21:43:47] application_1430945266892_47860 [21:44:00] oh whoa yeah it is [21:44:01] cool [21:45:38] ottomata: I'm running it with the persist [21:45:51] when this finishes will try without [21:48:53] k [21:49:18] ottomata: but yay [21:58:32] yay indeed, fingers corssed [22:00:37] (CR) Mforns: "LGTM, just two small comments, I'm OK with ignoring them if you like." (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/213967 (owner: Milimetric) [22:01:32] mforns: -1 me for that, goodness. those are definitely mistakes [22:02:11] milimetric, O.o [22:05:05] Analytics-EventLogging, Analytics-Kanban: Event Logging doesn't seem to handle unicode strings {oryx} [8 pts] - https://phabricator.wikimedia.org/T99572#1331999 (kevinator) p:Triage>Normal [22:06:09] Analytics-EventLogging, Analytics-Kanban: Code to write a new Camus consumer and store the data in two Hive tables [21 pts] {oryx} - https://phabricator.wikimedia.org/T98784#1332000 (kevinator) p:Triage>Normal [22:06:54] Analytics-Cluster, Analytics-Kanban: Assess how to extract Mobile App info in webrequest [8 pts] {hawk} - https://phabricator.wikimedia.org/T99932#1332002 (kevinator) [22:07:39] (PS4) Milimetric: Refactor Compare layout to use TimeseriesData [analytics/dashiki] - https://gerrit.wikimedia.org/r/213967 [22:08:22] (CR) Milimetric: Refactor Compare layout to use TimeseriesData (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/213967 (owner: Milimetric) [22:09:50] Analytics-Cluster, Analytics-Kanban: Assess how to extract Mobile App info in webrequest [8 pts] {hawk} - https://phabricator.wikimedia.org/T99932#1332011 (kevinator) [22:18:43] ottomata: nope, it failed. [22:20:45] ottomata: http://pastebin.com/hLHn6t2D [22:22:25] different place though! [22:22:46] hm, madhuvishy, i think that might be the persist thing [22:22:58] ottomata: yeah [22:23:06] let me get rid of persist [22:23:11] i guess try without, yeah [22:30:13] (CR) Mforns: Refactor Compare layout to use TimeseriesData (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/213967 (owner: Milimetric) [22:30:50] (CR) Mforns: [C: 2 V: 2] Refactor Compare layout to use TimeseriesData [analytics/dashiki] - https://gerrit.wikimedia.org/r/213967 (owner: Milimetric) [22:35:05] (PS5) Mforns: Use Dygraphs in Vital Signs [analytics/dashiki] - https://gerrit.wikimedia.org/r/214270 (https://phabricator.wikimedia.org/T96339) (owner: Milimetric) [22:36:00] good night everyone! [22:36:06] see ya [23:04:07] Analytics-EventLogging: Send graphite metrics for Schema. as well - https://phabricator.wikimedia.org/T95780#1332301 (Tgr) A nicely generic approach would be to define a new top-level property for EL schemas, say, a "statsKey" string (or "statsKeys" array) which would be appended to `eventloggi... [23:51:36] Analytics, MediaWiki-extensions-ExtensionDistributor: Set up graphs and dumps for ExtensionDistributor download statistics - https://phabricator.wikimedia.org/T101194#1332414 (Legoktm) NEW