[06:08:56] Analytics-Tech-community-metrics, Engineering-Community: A new events/meet-ups extension - https://phabricator.wikimedia.org/T99809#1324858 (Qgil) [06:09:12] Analytics-Tech-community-metrics, Engineering-Community, MediaWiki-Extension-Requests: A new events/meet-ups extension - https://phabricator.wikimedia.org/T99809#1299500 (Qgil) [06:13:41] Analytics-Cluster, Fundraising Sprint Kraftwerk, Fundraising Sprint Lou Reed, operations: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1324871 (AndyRussG) Hi! I've checked this by comparing, from erbium: `/a/log/fundraising/logs/buffer/2015/bannerImp... [06:30:44] Analytics: Work on Metrics tools wm-metrics and MediaCollectionDB, refactoring and code quality. - https://phabricator.wikimedia.org/T100710#1324921 (Qgil) Can this task be resolved now? Also, I tried associating it to a project other than #Wikimedia-Hackathon-2015, but I could not find a clear candidate. [06:51:47] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Key performance indicator: analyze who contributes code - https://phabricator.wikimedia.org/T55485#1324950 (Qgil) p:Low>Normal [11:23:49] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1325303 (Aklapper) [11:24:09] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1167361 (Aklapper) [11:26:05] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1325307 (Aklapper) [11:26:07] Analytics-Tech-community-metrics, ECT-June-2015: "Median time to review for Gerrit Changesets, per month": External vs. WMF/WMDE/etc patch authors - https://phabricator.wikimedia.org/T100189#1325308 (Aklapper) [11:26:09] Analytics-Tech-community-metrics: Median time to review for Gerrit Changesets, per month - https://phabricator.wikimedia.org/T97715#1325305 (Aklapper) Open>Invalid >>! In T97715#1319876, @Qgil wrote: > Until now, we have been calculating the median age of the open changesets. If this is what you want,... [11:29:00] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1325318 (Aklapper) [11:34:25] Analytics-Tech-community-metrics, ECT-June-2015: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1325337 (Aklapper) NEW [11:34:38] Analytics-Tech-community-metrics, ECT-June-2015, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1167361 (Aklapper) [11:34:40] Analytics-Tech-community-metrics, ECT-June-2015: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1325337 (Aklapper) [12:31:23] Analytics-Cluster, operations: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1325431 (Ottomata) I think we should either backport or stick them manually in lib/. I don't think we should use Archiva for building this if we don't have to. [12:57:42] (CR) Ottomata: Add get pageview_info udf and underlying functions (7 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [12:59:01] Analytics, Security, Traffic, operations: Purge > 90 days stat1002:/a/squid/archive/api - https://phabricator.wikimedia.org/T92338#1325520 (Ottomata) AFAIK, not intentionally, but who knows what kind of stuff users send in POST data. [13:03:10] ottomata: Thanks for the review :) [13:03:17] And gooooood morning :) [13:04:44] moorning! [13:06:02] Just for fun ottomata : https://gist.github.com/jobar/0b8aabc39189e75a4544 [13:07:02] :) what's that from? [13:07:28] Intermediate aggregations :) [13:08:15] Long computation for top, but works :) [13:08:35] brgb [13:08:39] brb sorry [13:18:13] Quarry: Add list of query executions to the query page side-bar - https://phabricator.wikimedia.org/T100982#1325563 (Halfak) NEW [13:20:42] Quarry: Add list of query executions to the query page side-bar - https://phabricator.wikimedia.org/T100982#1325577 (Halfak) [13:22:39] Analytics-Tech-community-metrics, ECT-June-2015: Present most basic community metrics from T94578 on one page - https://phabricator.wikimedia.org/T100978#1325579 (Qgil) At least when it comes to our quarterly goal T94578, I think this task is a nice to have but not a blocker. I understood the "how they ar... [13:37:58] Analytics-Cluster, Fundraising Sprint Kraftwerk, Fundraising Sprint Lou Reed, operations: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1325639 (Ottomata) > My only concern so far is a difference in the number of entries for that 15-minute period: the... [13:38:51] madhuvishy: you awakeworkin? [13:39:15] Oof, actually, it is way too cold in this cafe, i'm going to move and find a warmer spot, maybe back home..>.>. [13:41:11] Hmm, naw, stayin for now [13:41:12] blabla [13:41:21] joal: do you know how the madhuvishy's spark stuff is going? [14:06:55] ottomata: I have not talked to her since last friday [14:07:36] But at that time, she had something working for 2/3 days if I remember correctly [14:09:00] aye, ok. [14:09:09] will wait for her [14:09:51] (PS1) Yuvipanda: Make quarry.wsgi file executable standalone [analytics/quarry/web] - https://gerrit.wikimedia.org/r/215023 [14:09:52] k [14:09:53] (CR) jenkins-bot: [V: -1] Make quarry.wsgi file executable standalone [analytics/quarry/web] - https://gerrit.wikimedia.org/r/215023 (owner: Yuvipanda) [14:10:20] ottomata: currently double checking results of intermediate aggregation with the one I did by projects [14:10:33] If fine, I'll get into trying to use impala [14:10:41] Will let you know if I get that dfar [14:10:47] k cool [14:16:32] (CR) Joal: Add get pageview_info udf and underlying functions (7 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [14:18:24] (PS4) Joal: Add get pageview_info udf and underlying functions [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 [14:40:25] (PS2) Yuvipanda: Make quarry.wsgi file executable standalone [analytics/quarry/web] - https://gerrit.wikimedia.org/r/215023 [14:40:36] (CR) Yuvipanda: [C: 2 V: 2] Make quarry.wsgi file executable standalone [analytics/quarry/web] - https://gerrit.wikimedia.org/r/215023 (owner: Yuvipanda) [14:45:11] ottomata: have you seen "The improbable rise and fall of Couchsurfing"? http://kernelmag.dailydot.com/issue-sections/features-issue-sections/13124/life-and-death-couchsurfing/ [14:45:22] milimetric: yes [14:45:28] very good article [14:45:58] relevant too, for our work, we keep wanting more editors but we're not spending any time thinking about growing the right way [14:46:18] ya, more does not always equal better [15:21:08] Analytics-Kanban, Analytics-Wikimetrics: Wikimetrics crashes when cohort description has special characters - https://phabricator.wikimedia.org/T100781#1326018 (mforns) a:mforns [15:21:39] Analytics-Kanban, Analytics-Wikimetrics: Wikimetrics crashes when cohort description has special characters [5pts] - https://phabricator.wikimedia.org/T100781#1326024 (mforns) [15:42:41] Analytics-Kanban, Analytics-Wikimetrics: Wikimetrics crashes when cohort description has special characters {dove} [5pts] - https://phabricator.wikimedia.org/T100781#1326076 (mforns) [15:53:04] ottomata: Heya, would you mind coming for 15/20 minutes to the tasking meeting, to discuss a bit more about spark issue and alternatives ? [15:55:59] sure [15:56:10] thanks dude [16:00:41] ottomata: tasking ? [16:01:10] ooo k [16:25:29] Analytics-Cluster, Analytics-Kanban: Assess how to extract Mobile App info in webrequest [8 pts] - https://phabricator.wikimedia.org/T99932#1326222 (kevinator) a:JAllemandou>mforns [16:31:02] Analytics-Kanban, Need-volunteer: Top Articles ad-hoc Report - https://phabricator.wikimedia.org/T99083#1326263 (kevinator) [16:33:57] Analytics-Kanban, Need-volunteer: Top Articles ad-hoc Report for Wikipedia Zero [5 pts] - https://phabricator.wikimedia.org/T99083#1326274 (kevinator) [16:34:36] Analytics-Kanban, Need-volunteer: Top Articles ad-hoc Report for Wikipedia Zero [5 pts] - https://phabricator.wikimedia.org/T99083#1284977 (kevinator) p:Triage>Normal [16:43:06] AH! Rain has stopped time to run home! [16:43:08] bbsoon [16:47:32] Analytics-Kanban, Analytics-Wikimetrics, Patch-For-Review: Deal with non-timeboxed queries recomputing too much data for the mobile report-card - https://phabricator.wikimedia.org/T98979#1326303 (kevinator) p:Triage>Normal [17:05:07] Analytics-Kanban, Analytics-Wikimetrics, Patch-For-Review: Deal with non-timeboxed queries recomputing too much data for the mobile report-card - https://phabricator.wikimedia.org/T98979#1326348 (kevinator) [17:08:36] Analytics-Cluster: Analyst has a table of Last-Access counts {bear} - https://phabricator.wikimedia.org/T101004#1326355 (kevinator) NEW [17:08:57] Analytics-Cluster: Analyst has a table of Last-Access counts {bear} - https://phabricator.wikimedia.org/T101004#1326362 (kevinator) [17:46:51] Analytics-Cluster, hardware-requests, operations, procurement: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1326488 (RobH) #hardware-requests, not #procurement for requesting hardware in phabricator, as outlined on https://wikitech.wikimedia.org/wiki/Operations... [17:47:45] Analytics-Cluster, hardware-requests, operations: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1326496 (RobH) [18:02:48] madhuvishy: we think we're going to keep tasking [18:10:28] milimetric: alright [18:11:30] madhuvishy: you're of course welcome to join [18:14:09] milimetric: thanks, i've to step out in a bit to meet a friend so i'm gonna pass. sorry! [18:14:20] Analytics-Cluster, Fundraising Sprint Kraftwerk, Fundraising Sprint Lou Reed, Fundraising Tech Backlog, operations: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1326616 (atgo) [18:14:26] ok madhuvishy, no prob [18:20:10] milimetric: Will read the paper you sent, seems very interesting :) [18:21:18] it's not super applicable, I don't think. I think Marcel's approach is a little more direct and to the point, and less error prone [18:21:50] Can't remember what it was :( [18:21:56] me neither :] [18:21:58] Removing too small points ? [18:22:02] basically - override SUM [18:22:09] the PostgreSQL / Mondrian hack [18:22:10] Right [18:22:13] I'll reply to his email now [18:22:15] oh yea [18:22:31] madhuvishy: [18:22:37] HI all, I have a community member who is trying to access searchindex table; he wants to try to query the content of wiki articles; I know this can be complicated, but would be great if someone can help him? [18:23:00] I'll finish my day soon [18:23:15] Do you have more info on the errors you get? [18:23:24] I'll have a look tomorrow if so [18:23:35] Guys, time for me to call it day [18:23:40] Will talk tomorrow ! [18:25:36] the override SUM thing sounded hacky to me at first mforns, but I love it because if we do it well and test and make sure it works, it basically means nobody can ever aggregate from that database without getting k-anonymized data. [18:26:15] milimetric, aha [18:26:42] joal|night alright lets talk tomorrow [18:26:43] milimetric, and we could send aggregated data to mondiran, and wouldn't necessarily need an api for that at least [18:26:48] bye joal|night ! [18:27:11] mforns: yeah, but papers like this make me feel like it may be more complicated: http://www.public.asu.edu/~ynsilva/publications/kAnonymity.pdf [18:27:49] k-anonimity is NP-Hard :( [18:27:50] nevertheless, it seems a great path to follow. Even if Impala doesn't support this, I think it's ok. It would mean we just need to constrain the dimensions for now to fit into our Postgre server [18:28:10] I know! But what if instead of dealing with it, we just always never returned any value lower than k from any aggregator in Postgre! [18:28:17] seems like a loophole, I can't find fault with it right now [18:28:58] milimetric, joal|night, yes that's the initial idea, aggregate normally, and then truncate [18:29:25] I'm gonna go grab lunch, but I'll keep thinking about it. [18:29:49] milimetric, joal|night I think wee should try it, as it is "short" amount of work, and see [18:29:51] k [18:29:54] milimetric: My view of it is to preprocess data, and truncate it at that time [18:30:11] Have the cube work done [18:30:19] but then we lose data in the way you showed [18:30:19] Mwarf, anyway :) [18:30:38] hehe, we could go on like this for hours :] [18:30:46] No non, truncate at the cube level [18:30:48] oh the whole cube, at all levels [18:30:49] after aggrgation [18:30:49] gotcha [18:30:53] :) [18:31:04] yeah, that would be a thing to do too, but a lot of pre-computation [18:31:10] http://www.kylin.io/document.html [18:31:15] the vast majority of it may not be used [18:31:19] ? [18:31:52] Let's continue to think about it ;) [18:31:57] G'd night :) [18:32:10] like, if you materialize the whole cube, a lot of the combinations of dimensions will never be queried [18:32:15] so you did all those pre-computations for nothing [18:32:22] true [18:32:22] but yeah, nite, get some sleep! [18:32:23] :) [18:32:32] That's why I think we need to carefully think our dimensions ;) [18:32:34] bye! [18:32:36] Kylin looks cool! [18:32:38] article should not be one [18:32:58] sadly we need article as parity with stats.grok.se is what we're going after [18:33:29] gok only provides time serie, no ? [18:33:47] top and time series [18:34:01] top articles per project, time series per article [18:34:07] right [18:34:17] And then, cube without article [18:34:31] it's ok, we can talk tomorrow [18:34:42] Like that, at article level, we have manageable data, and we can still have cube on interesting points [18:34:50] right ;) [18:34:52] Ciao ! [18:34:58] ciao! [19:01:29] madhuvishy: helLOoOOOO [19:04:20] Quarry: Add list of query executions to the query page side-bar - https://phabricator.wikimedia.org/T100982#1326793 (Capt_Swing) What if the query is changed between executions? Do we save the version the query that produced each result set? [19:09:11] (CR) Ottomata: Add get pageview_info udf and underlying functions (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [19:09:17] (CR) Ottomata: Add get pageview_info udf and underlying functions (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/214349 (owner: Joal) [19:10:46] Quarry: Add list of query executions to the query page side-bar - https://phabricator.wikimedia.org/T100982#1326827 (yuvipanda) Yes we already do have revisions - they are saved each time the query is executed :) We do need to expose them. [19:16:35] Analytics-Kanban, Analytics-Wikimetrics, Community-Wikimetrics: Give the option of using the same parameters for all reports for a given cohort {dove} [21 pts] - https://phabricator.wikimedia.org/T74117#1326859 (Capt_Swing) @kevinator is this task going to make it into any upcoming sprints? If so, jus... [19:28:10] madhuvishy: let me know when you are back [19:32:56] (CR) Ottomata: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change rep (2 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) (owner: Madhuvishy) [19:43:33] hm, joal|night, you are gone, but maybe you could backfill less after all [19:43:38] i can't even run madhu's job for a single day [19:43:59] job sorta gets accepted but it doesn't get enough exectors or something, so the app master keeps restarting [20:58:12] ottomata: Hey [20:58:27] was away for lunch with a friend. [21:01:04] (PS1) Mforns: Fix cohort description utf8 bug [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/215200 (https://phabricator.wikimedia.org/T100781) [21:01:11] (CR) jenkins-bot: [V: -1] Fix cohort description utf8 bug [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/215200 (https://phabricator.wikimedia.org/T100781) (owner: Mforns) [21:01:38] madhuvishy: s'ok [21:01:50] ah! i just got your job to run on 10 days! [21:01:52] gonna try 30 :) [21:01:58] woah! How! [21:02:04] --num-executors=40 [21:02:10] so, here's what I think is happening [21:02:13] (PS5) Madhuvishy: [WIP] Productionize app session metrics - Parse args using scopt - Move away from HiveContext to reading Parquet files directly - Change reports to run for last n days instead of daily or monthly (not sure if this is gonna work yet) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212573 (https://phabricator.wikimedia.org/T97876) [21:02:15] ha [21:02:27] hive/mapreduce launches TONS of mappers [21:02:35] 6 executors is not very many [21:02:45] ottomata: right. i tried 12. [21:02:56] the mobile data is like 80G per day [21:03:11] now, i still think spark should not OOM with 6 executors here [21:03:17] it should just take a long time to run :) [21:03:24] ottomata: right [21:03:40] but, i was just comparing the number of concurrent spark jobs to that of one of our refine jobs [21:03:51] each of which only works on asingle hour of json data [21:03:52] so much less [21:04:01] those were launching 100+ mappers at once [21:04:28] 10 days took me 18 minutes [21:04:37] ottomata: yeah. [21:04:38] i'm not sure what the right combinations of executors and memory is [21:04:44] but i think we can do it. [21:05:22] btw, i had to tweak your code to get it to run, mainly because of udf name [21:05:25] that is registered [21:05:34] ottomata: okay awesome. it has to run every week though. will that take a hit on our cluster? [21:05:45] the get_time_from_date thing? [21:06:00] yeah, you are referring to it as getTimeFromDate in the sql [21:06:09] at least, in the latest patch on gerrit [21:06:09] ottomata: oh, yeah i fixed it [21:06:12] aye cool [21:06:17] but then i got rid of it [21:06:24] btw, that is awesome! [21:06:26] cool UDF! [21:06:28] so simple! [21:06:30] go spark! [21:06:31] heheh [21:06:38] i think that was before the ts field existed on refine table [21:06:40] aye [21:06:43] ah [21:06:44] k [21:06:53] now that we have it, no need to make it ourselves [21:07:08] also, i did ---exectuor-cores 1 [21:07:10] well, removed it [21:07:24] i don't think there is a need to have more than 1 core per exector [21:07:47] anyway 40 executors isn't really that many [21:07:56] ottomata, question? [21:08:04] it really depends on how much ram each needs, because yarn is scheduling things based on memory [21:08:32] cluster has >1TB ram [21:08:42] 2g * 40 exectuors if 80 G [21:08:43] ottomata: right. [21:08:44] is [21:09:05] let's say the job takes an hour to run, or less [21:09:10] it'lll be fine to do that once a week [21:09:15] we can schedule that on the weekends or something [21:09:19] yup, cool [21:09:36] the annoying thing with spark though, is that it will wait until it has all the requested exectutors allocated by yarn before it starts doing anything [21:09:44] that's a nice thing about map reduce: it will do what it can when it can [21:10:20] ottomata: oh yeah, i noticed all the memory messages come before the task starts running. [21:10:21] anyway, let's talk with joal this week about tuning this thing. i'm going to peace out soon, if you are working on it more today think about it, do some experiemnts [21:10:29] really? [21:10:41] i know that can happen, if driver memory is too low [21:10:47] but i didnt' see that wiht --driver-memory 1500 [21:10:49] m [21:11:13] ottomata: aah, may be i dont understand what it's really saying, i'll look again :) [21:11:44] k, pretty sure i saw OOMs from executors when i was doing --driver-memory 1500m --num-executors 6 [21:11:47] Ironholds: hi! [21:11:48] whassup? [21:12:11] ottomata, I'm about to say a sentence that'll make you really happy [21:12:22] "I've been looking into oozie as a templating and scheduling engine for our data needs" [21:12:25] haha [21:12:29] nice! [21:12:34] i mean, that kinda makes me happy? [21:12:42] and I'd like to use it for the hadoop stuffs we want (dw about the actual stuffs; there'll be a thread) [21:12:44] i mean, i feel sad for what you are about to deal with [21:12:48] haha [21:12:55] will it abstract away having to manually specify dates? :P [21:13:06] yes [21:13:10] then idgaf [21:13:13] if you want to run somehting regular, then yes [21:13:16] grand! [21:13:21] you can do [21:13:31] startTiem = ... stopTime = year 3000 like we do :) [21:13:35] haha [21:13:38] ottomata: okay. I'll look at it more. thanks :) [21:13:42] madhuvishy: yup! [21:13:52] so I was wondering; is oozie not an appreciated thing, or..? [21:14:07] Ironholds: it is the best thing of its kind that I know of [21:14:12] because it looks like it works with MySQL too, which makes it sound valuable for general-purpose regular data scooping, so I was wondering why it isn't utilised more [21:14:14] ergo "least awful"? ;p [21:14:18] once you get it polished and working, it is great. [21:14:29] but, developing your jobs is terribly cumbersome, and uh, XML [21:14:54] XML is genuinely the main reason I dislike Java [21:14:57] like, I like Java! [21:14:59] Ironholds: it is the only thing that I know of that has 2 cool features: [21:15:02] I strongly dislike the frameworks around it [21:15:10] Ironholds: ottomata the thing thats worse than xml is having to manually specify the properties, making sure it's all there in all the different files [21:15:29] i wish it would just generate a properties file when you write a workflow [21:15:29] - dataset definition based on aggregate frequencies of data [21:15:29] - can schedule jobs based on existence of datasets [21:15:36] cron is dumb [21:15:41] madhuvishy, when I were a lad we had to engrave start times on to the HDDs with a tiny needle [21:15:42] it says "run this command every hour" [21:16:02] haha [21:16:09] Ironholds: ha ha. then you're going to enjoy oozie [21:16:12] oozie says: "run this command when the last 30 days of this data is present" [21:16:15] hahaha [21:16:17] ottomata, ooh, clever! [21:16:42] "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions." dude, just say "cron that isn't terrible" [21:16:44] Ironholds: I think the oozie organization in refinery repo is relatively good [21:17:00] Ironholds: that feature, the DAG of actions [21:17:01] awesome. I was going to steal from the pageviews datasets [21:17:06] is present in a lot of cooler things than oozie [21:17:09] Luigi, uhhhh [21:17:13] or rather, their jobs [21:17:14] i am forgetting another one [21:17:24] there are lots of job flow DSLs out there [21:17:38] that are much more pleasant to work with [21:17:53] but they don't have this awesome dataset feature [21:18:00] which imo is the whole point :) [21:18:41] ottomata: also, if you put in a start time for last month and schedule a daily job, does it start running from the start date? [21:18:47] *thumbs up* [21:19:33] as in, backfilling? [21:19:33] yes [21:19:42] also rerunning jobs in oozie is really good too [21:19:53] aah nice [21:21:07] Ironholds: if you are looking for an oozie example to work from, i think i'd look at oozie/webrequest/refine [21:21:13] it has a lot of nice abstractions [21:21:20] and uses lots of oozie features [21:22:32] perfect; thanks! [23:14:06] Analytics-Cluster, Fundraising Sprint Kraftwerk, Fundraising Sprint Lou Reed, Fundraising Tech Backlog, operations: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1327485 (AndyRussG) Thanks, interesting! >Sampling is done on the full incoming strea...