[00:39:52] Analytics / Wikimetrics: Cannot validate a new cohort - https://bugzilla.wikimedia.org/71099#c1 (nuria) Very sorry about this. The labs cluster had some maintenance done this weekend and seems like things are no working as they should quite yet. Once issues with labs are solved we will update this bug,... [01:33:31] (CR) Nuria: "I reviewed the code but could not test in labs due to the db situation." (2 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/161521 (owner: Milimetric) [03:45:04] (PS3) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 [03:45:06] (PS1) QChris: Stop tying Oozie's “data_directory” to webrequests table [analytics/refinery] - https://gerrit.wikimedia.org/r/161904 [03:45:08] (PS1) QChris: Move Oozie's “marking directory done” into separate workflow [analytics/refinery] - https://gerrit.wikimedia.org/r/161905 [03:45:10] (PS1) QChris: Add Oozie setup to rendering webstatscollector files [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 [07:13:38] Analytics / Tech community metrics: Graphs for median/average should report absolute numbers - https://bugzilla.wikimedia.org/66266#c4 (Quim Gil) NEW>ASSI a:Alvaro Looks good, thank you! [07:20:07] Analytics / Tech community metrics: Code review metrics should not include [WIP] changesets - https://bugzilla.wikimedia.org/66283#c4 (Quim Gil) In case it's useful, now apps/android/commons made it to the top of the list, but it has only two WIP open changesets: https://gerrit.wikimedia.org/r/#/q/sta... [11:33:10] Analytics / General/Unknown: Packetloss_Average alarm on udp2log machines on 2014-09-20 - https://bugzilla.wikimedia.org/71116 (christian) NEW p:Unprio s:normal a:None There have been Packetloss_Average alerts on 2014-09-20 for a few minutes on erbium and oxygen [1]. What happened and how... [11:36:53] Analytics / General/Unknown: Packetloss_Average alarm on udp2log machines on 2014-09-20 - https://bugzilla.wikimedia.org/71116#c1 (christian) Ops reported an ULSO outage [1] that matches the time period, and Ops said to have a proper incident report today (2014-09-22). Once that is out, we'll see how... [12:04:40] (CR) QChris: "> > However, we do it nonetheless, as webstatscollector is doing it too." [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [14:01:04] ottomata: Standup time :-) [14:01:08] ooF [14:01:13] sneaky meeting! [14:23:49] (PS3) Milimetric: Update NamespaceEdits metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/161470 (https://bugzilla.wikimedia.org/71008) [14:24:13] (PS3) Milimetric: Update PagesCreated metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/161472 [14:24:42] (PS4) Milimetric: Update PagesCreated metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/161472 (https://bugzilla.wikimedia.org/71009) [14:25:24] (PS2) Milimetric: Add Rolling Recurring Old Active Editor [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/161521 (https://bugzilla.wikimedia.org/69569) [14:52:38] (CR) Ottomata: [C: 2 V: 2] Stop tying Oozie's “data_directory” to webrequests table [analytics/refinery] - https://gerrit.wikimedia.org/r/161904 (owner: QChris) [14:54:26] (CR) Ottomata: Move Oozie's “marking directory done” into separate workflow (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/161905 (owner: QChris) [14:57:06] (PS2) QChris: Add Oozie setup to rendering webstatscollector files [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 [14:57:08] (PS2) QChris: Move Oozie's “marking directory done” into separate workflow [analytics/refinery] - https://gerrit.wikimedia.org/r/161905 [14:57:10] (PS4) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 [14:57:50] (CR) QChris: Move Oozie's “marking directory done” into separate workflow (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/161905 (owner: QChris) [15:06:16] (CR) Ottomata: [C: 2 V: 2] Move Oozie's “marking directory done” into separate workflow [analytics/refinery] - https://gerrit.wikimedia.org/r/161905 (owner: QChris) [15:06:29] (CR) Ottomata: Add Oozie setup to aggregate Webstatscollector pagecount files (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [15:08:50] whoaa oozie switch [15:08:53] not seen this one yet... [15:09:16] :-D [15:12:59] whoa, there is a fork! [15:13:01] that's cool [15:13:10] cool [15:23:10] qchris: quick q, your hive query right now runs as two queries [15:23:14] one for the .mw counts [15:23:20] right. [15:23:22] won't that create multiple output files? [15:23:28] Yes. [15:23:29] (i'm reading your prepare_job_output change now) [15:23:55] But there is an order by when preparing the {prajec,page}counts. [15:24:02] s/prajec/project/ [15:24:07] This order by merges them again. [15:24:33] ah select from destination table [15:24:34] hm ok [15:25:44] confused about how that works, but i will come back to it [15:26:12] k [15:27:46] another quick q: [15:28:02] woudl it not be simpler to write a shell wrapper for this, rather than all the oozie cases? [15:28:06] just to get different error messages? [15:28:24] i understand not shelling out for each of the fs checks [15:28:28] but what abou tthe whole thing? [15:28:32] I thought a lot about that. [15:28:43] Did you read the comment in the middle of the switch? [15:28:45] yes [15:28:52] so [15:29:06] So to me, output parsing of "hdfs dfs -ls" likely to break sooner. [15:29:06] i think your cases are good, in that shelling out in each one to hdfs -ls would be cumbersome [15:29:19] but i mean a wrapper for the whole thing [15:29:38] When you read the help page of "hdfs dfs -ls", the given description already does not match the actual output. [15:29:38] that takes in source_directory, target_file, expecting_ending, etc [15:29:41] as parameters [15:30:04] (CR) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [15:30:31] By wrapper, you mean a ...say ... bash script. Right? [15:30:43] yes [15:30:46] or python, or whatever [15:30:58] haha, python + snakebite :p [15:30:59] jk [15:31:04] that would be complicated too :/ [15:31:09] How would you parse the output of "hdfs dfs -ls" into something like "find"? [15:31:13] but would sidestep the issue of hdfs -ls [15:31:18] https://github.com/spotify/snakebite [15:31:22] By just splitting? [15:31:46] hm? [15:31:56] naw, you'd just be using an API, rather than parsing text output [15:31:57] http://spotify.github.io/snakebite/ [15:32:22] Sorry. The latest question was not meant about snakebite, but about "hdfs dfs -ls". [15:32:23] At Spotify we use the http://github.com/spotify/luigi that relies on doing a lot of existence checks and moving data around in HDFS. And since callinghadoop from python is expensive, we decided to write a pure python HDFS client that only relies on protobuf. [15:32:57] ah, my (halfhearted) idea is to write a python wrapper that uses snakebite to do this [15:33:29] That is heavy tooling for the tiny use case we currently have. [15:33:52] Could we kick that down the road, until we have more users of it? [15:34:05] yeah for sure [15:34:09] i agree [15:34:18] we'd have to package that thing up to, that would delay us a lot [15:34:30] Everything is kept in a separate workflow, so improving that workflow, would improve all the uses of it. [15:34:33] yeah [15:34:45] Cool. [15:36:29] ok, another q [15:36:46] can't the webstatscollector table data itself be used for the pagecount file? [15:36:54] if we store it as textfile tsv? [15:37:50] A hour in the webstatscollector table might be represented by more than 1 file (*), so we need to merge them into a single file. [15:37:59] AHHHH [15:38:03] (*) currently 2 files. But once you mess with the table, the number increases. [15:38:22] ahhh ok, my previous question about hte two queries makes sense now [15:38:22] ok ok [15:38:24] got it. [15:38:43] Awesome. [15:43:01] (CR) Ottomata: Add Oozie setup to rendering webstatscollector files (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [15:51:10] (CR) Ottomata: Add Oozie setup to aggregate Webstatscollector pagecount files (5 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [15:57:46] qchris: am going into a meeting now, and then again in 2 hours [15:57:57] I am there already :-) [15:57:58] will be kinda avail to keep reviewing in 1 hour for 1 hour :) [15:58:03] oh yeah you are in that one too! :p [16:00:23] oh qchris, just thought of something re kafkatee, [16:00:29] fundraising needs it [16:00:33] or probably [16:00:36] ah, meeting starting ok later [17:26:54] (CR) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files (5 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [17:40:55] Analytics / Wikistats: One column header in browsers per country report only contains '---' - https://bugzilla.wikimedia.org/71127 (Erik Zachte) NEW p:Unprio s:minor a:None http://stats.wikimedia.org/wikimedia/squids/SquidReportCountryBrowser.htm reported by Atul Vaidya [17:43:28] (CR) QChris: Add Oozie setup to rendering webstatscollector files (3 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [19:12:52] Analytics / Wikimetrics: Cannot validate a new cohort - https://bugzilla.wikimedia.org/71099#c2 (nuria) Issues in production are solved, we are taking a few minutes to make sure everything is OK and wikimetrics should be back online in a few minutes. [19:21:07] Analytics / Tech community metrics: Code review metrics should not include [WIP] changesets - https://bugzilla.wikimedia.org/66283#c5 (Alvaro) REOP>PATC Quim, WIP reviews are now filtered. And as expected, the two repos above don't appear anymore. [20:52:19] qchris: quick comment about the _sum vs _count [20:52:25] _count is a suffix, as is _sum, right? [20:52:31] so [20:52:45] view_count and response_size_sum kinda make sense...although I don't love response_size_sum [20:52:54] oo size_total [20:52:57] would be fine with me [20:53:06] response_size_total [20:53:10] Mhmm. [20:53:17] So the two would get a different suffix? [20:53:27] yes, aren't the computed differently...? [20:53:33] checking... [20:53:47] No. SUM(...) [20:53:50] yeah [20:53:51] COUNT(*) count, [20:53:51] javascript:; [20:53:51] SUM(response_size) response_size [20:54:18] ah [20:54:19] Oh. OH! [20:54:23] qchis, in sequence stats table [20:54:23] You're right. [20:54:26] we prefixed [20:54:27] count_* [20:54:30] so [20:54:46] count_views [20:54:46] (sum|total)_response_size [20:54:47] or maybe [20:54:53] bytes_served [20:54:53] ? [20:55:11] the column in hive is called "response_size" [20:55:14] although, i do like sum|total_response_size, as...y [20:55:15] eah [20:55:18] what i was gonna say. :) [20:57:03] even if they are computed different ... we might also express the count via sum. [20:57:11] For me it's confusing to call them differently. [20:57:19] But meh. [20:57:50] So ... "count_views" and "sum_response_size? [20:58:05] "count_views" and "totol_response_size? [21:03:24] i kinda like total_response_size [21:03:27] very descriptive [21:04:34] Ok. [21:04:45] Then "count_views" and "total_response_size" it is. [21:05:34] cool. [21:05:45] About the other column names ... [21:06:01] Should we discuss right away too, or do you prefer to do it in gerrit? [21:07:07] lemme keep going in gerrit, am responding to recent comments now. [21:07:29] ok. [21:08:06] (CR) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [21:12:10] Analytics / Wikimetrics: Story: AnalyticsEng uses connection pooling on database URL - https://bugzilla.wikimedia.org/71140 (Dan Andreescu) NEW p:Unprio s:normal a:None Currently we're hitting a different connection string for each different labsdb database name. We could change this to a... [21:15:12] (CR) Ottomata: Add Oozie setup to aggregate Webstatscollector pagecount files (5 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris) [21:19:17] wait, qchris, explain to me why we need this special prepare hdfs mv step? [21:19:35] because INSERT OVERWRITE DIRECTORY might generate multiple files? [21:19:43] Let me reread the code. [21:21:01] i.e., in render_hourly_files/workflow.xml [21:21:04] why can't we just do [21:21:12] https://gerrit.wikimedia.org/r/#/c/161906/2/oozie/util/prepare_job_output_for_plainfs_rsync/workflow.xml [21:21:16] destination_directory=path/to/final/output/directory [21:21:17] ? [21:21:36] for the [21:21:36] action [21:22:46] I guess I do not understand the question. [21:22:52] Sorry. [21:23:26] So are you asking why extract_data_into_single_file is a hive query instead of just a set of hdfs commands? [21:23:38] uh no [21:23:46] i am asking why the hive query doesn't just insert into the final directory [21:23:54] Ah! [21:24:03] why do we need the [21:24:03] [21:24:03] action [21:24:04] ? [21:24:11] Because a job output is always a directory. [21:24:20] hive job? [21:24:21] hm [21:24:22] hm [21:24:23] hm [21:24:24] right [21:24:25] hm [21:24:27] And the file within that directory is typically 00000_0.gz [21:24:34] It's hard to control that reliably. [21:24:39] GOT it. [21:24:57] ok, the comment I have then [21:25:07] i don't like calling this the 'rsync directory' [21:25:08] or wahtever [21:25:13] no reason to tie it to a tool [21:25:16] :-D [21:25:33] arhive? [21:25:34] archive? [21:25:38] archive_directory? [21:25:47] prepare_job_output_for_directory [21:25:48] ? [21:25:49] oops [21:25:53] prepare_job_output_for_archive_directory [21:25:53] ? [21:25:54] archive_directory sounds good to me. [21:26:28] What about "prepare_job_output_for_archive_directory" -> "archive_job_output" ? [21:26:35] +1 [21:26:56] Ok. I will rework the patches then. [21:27:01] k [21:27:03] Thanks for your reviews! [21:27:12] (CR) Ottomata: Add Oozie setup to rendering webstatscollector files (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [21:27:31] thank you for your work! [21:27:35] looking as awesome as ever :) [21:29:49] ok, out for the eve, thikn I"m going to go get a new iphone, ok byyeyeyeye [21:29:57] :-) [21:30:01] have fun. [21:31:54] (CR) QChris: Add Oozie setup to rendering webstatscollector files (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/161906 (owner: QChris) [22:07:51] Analytics / Tech community metrics: Code review metrics should not include [WIP] changesets - https://bugzilla.wikimedia.org/66283#c6 (Quim Gil) PATC>RESO/FIX I've checked the top 10 and everything was perfect. Thank you! [22:33:44] (CR) QChris: Add Oozie setup to aggregate Webstatscollector pagecount files (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/160636 (owner: QChris)