[00:06:27] Analytics-Backlog, Analytics-Wikistats, DevRel-October-2015: Clean the code review queue of analytics/wikistats - https://phabricator.wikimedia.org/T113695#1764384 (Aklapper) >>! In T113695#1705220, @ezachte wrote: > This week I will have to focus on T114379, as that one has been prioritized and other... [00:28:43] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Create new Hive / Oozie server from old analytics Dell {hawk} - https://phabricator.wikimedia.org/T110090#1764469 (kevinator) [00:28:44] Analytics-Cluster, Analytics-Kanban: {mule} Hadoop Cluster Expansion - https://phabricator.wikimedia.org/T99952#1764468 (kevinator) [00:29:11] Analytics-Cluster, Analytics-Kanban: {mule} Hadoop Cluster Expansion - https://phabricator.wikimedia.org/T99952#1764472 (kevinator) Open>Resolved a:kevinator This project is done, so I am going to close it :-) [00:36:27] hey, guys, disk space on stat1002 is starting to get low [00:36:41] some of you have dozens of gigabytes in their /homes [00:36:53] if you know of something you dont need, would be nice to delete [00:40:51] Analytics-Backlog, Research management, Research-and-Data: Pipeline for data-intensive applications from research to productization to integration - https://phabricator.wikimedia.org/T105815#1764495 (DarTar) https://etherpad.wikimedia.org/p/things-revscoring-has-now-in-labs @leila @yuvipanda added the... [00:44:42] Analytics-Backlog: Data Store hardware procurement - 2015 - https://phabricator.wikimedia.org/T117008#1764497 (kevinator) NEW [00:46:33] Analytics-Engineering, operations: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1764505 (Dzahn) NEW [00:46:51] Analytics-Backlog, Analytics-Kanban: Test Elastic search pageview data loading/retrieval on labs {slug} - https://phabricator.wikimedia.org/T116763#1764515 (kevinator) [00:46:52] Analytics-Backlog, Analytics-Kanban: Druid testing on labs to asses whether is a suitable Cassandra replacement. {slug} - https://phabricator.wikimedia.org/T116409#1764516 (kevinator) [00:46:53] Analytics-Backlog: Data Store hardware procurement - 2015 - https://phabricator.wikimedia.org/T117008#1764514 (kevinator) [00:47:17] Analytics-Engineering, operations: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1764517 (Dzahn) /dev/mapper/tank-home 1008G 979G 30G 98% /home 398G ellery 73G nuria 64G west1 58G ironholds 54G mforns 48G ezachte 47G spetrea 46G jamesur 43G milimetric 40G madhuvishy 3... [00:47:45] uh... oopsey [00:51:01] Analytics-Engineering, operations: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1764524 (Milimetric) I just deleted mine, you can safely delete /home/spetrea (he no longer works here). The stuff in ellery's folder looks hard to delete but I'll try to get him to store the... [00:51:19] nuria: can you check your /home/nuria folder on stat1002, it looks like we're running out of space ^ [00:53:56] Analytics-Engineering, operations: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1764525 (Dzahn) @milimetric thank you! i deleted spetrea's data. Yuvi and Leila have mailed ellery. 17:55 < icinga-wm> RECOVERY - Disk space on stat1002 is OK: DISK OK we have 10% free agai... [01:06:12] milimetric: will do now [03:08:02] Analytics-Backlog, Analytics-Kanban, MediaWiki-API: Add Application errors for Mediawiki API to x-analytics - https://phabricator.wikimedia.org/T116658#1764646 (DarTar) [03:50:23] Analytics-Backlog: Reformat pageview API responses to allow for status reports and messages - https://phabricator.wikimedia.org/T117017#1764669 (Ironholds) NEW [03:51:16] Analytics-Backlog: Are the per-article or top-100 lists meant to be working in the pageviews API yet? - https://phabricator.wikimedia.org/T117018#1764677 (Ironholds) NEW [07:50:38] Analytics-Tech-community-metrics, Outreachy-Round-11: Outreachy Proposal for Improving MediaWikiAnalysis - https://phabricator.wikimedia.org/T116733#1764830 (Anmolkalia) [09:37:45] (PS1) Joal: Compress output of pageview legacy files (gzip) [analytics/refinery] - https://gerrit.wikimedia.org/r/249698 [09:38:20] (CR) Joal: [C: 2 V: 2] "Self merging for fast deploy." [analytics/refinery] - https://gerrit.wikimedia.org/r/249698 (owner: Joal) [09:52:54] Analytics, Beta-Cluster-Infrastructure: deployment-fluorine fails puppet '/usr/sbin/usermod -u 10003 datasets' returned 4: usermod: UID '10003' already exists - https://phabricator.wikimedia.org/T117028#1764903 (hashar) NEW [10:04:46] (CR) Matthias Mullie: Measure the user responsiveness to notifications over time (6 comments) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [10:04:59] (PS2) Matthias Mullie: Measure the user responsiveness to notifications over time [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) [10:27:47] (PS1) Joal: Correct bug in gzipped legacy pageview filename [analytics/refinery] - https://gerrit.wikimedia.org/r/249715 [10:28:04] (CR) Joal: [C: 2 V: 2] "Self merging." [analytics/refinery] - https://gerrit.wikimedia.org/r/249715 (owner: Joal) [10:34:29] !log refinery deployed [10:34:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:34:40] !log restarted pageview job to archive gzipped files [10:34:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:35:03] !log Gzipped already archived pageview files [10:35:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [10:37:29] (CR) Joal: [C: 1] "Looks good to me Marcel, let me know when you want it deployed." [analytics/refinery] - https://gerrit.wikimedia.org/r/249659 (https://phabricator.wikimedia.org/T116931) (owner: Mforns) [11:11:28] Analytics-Tech-community-metrics, DevRel-October-2015: Correct affiliation for code review contributors of the past 30 days - https://phabricator.wikimedia.org/T112527#1764971 (Lcanasdiaz) >>! In T112527#1762001, @Aklapper wrote: >>>! In T112527#1757945, @Luiscanasdiaz wrote: >> #5 we have a bug in our qu... [12:03:09] * joal will be back in a few hours [12:43:50] (CR) Milimetric: Measure the user responsiveness to notifications over time (2 comments) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [13:22:55] (PS1) Matthias Mullie: Exclude moderated data [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) [13:25:34] (CR) Milimetric: [C: -1] Exclude moderated data (1 comment) [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [13:30:05] (PS2) Matthias Mullie: Exclude moderated data [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) [13:30:23] (CR) Matthias Mullie: Exclude moderated data (1 comment) [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [13:30:41] (CR) Milimetric: [C: 2 V: 2] Exclude moderated data [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [13:35:24] (PS3) Matthias Mullie: Measure the user responsiveness to notifications over time [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) [13:35:26] (CR) Matthias Mullie: Measure the user responsiveness to notifications over time (2 comments) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [13:37:08] (CR) Matthias Mullie: "Oh, 1 more question: will this cause old data to be regenerated, or will it only affect new data?" [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [13:38:26] (CR) Milimetric: Measure the user responsiveness to notifications over time (1 comment) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [13:39:49] (CR) Milimetric: "It will only affect new data, unless we manually delete the old results from the static files. In that case reportupdater will think it m" [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [13:39:52] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765167 (Halfak) > I think understanding the semantics of an event primarily requires knowledge of the topic. This is true if y... [13:48:16] (CR) Matthias Mullie: "Ah ok. Could you please clear out those old results? (no rush)" [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [13:50:22] (CR) Milimetric: "if you have access to stat1003, you can go to /a/limn-public-data/flow/datafiles/ and clear what you like. But, if not, I can easily do i" [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [14:01:54] Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [8 pts] - https://phabricator.wikimedia.org/T114379#1765212 (ezachte) Dan, here is a comparison of data for one hour in webstatscollector 1/2/3 Most counts are similar, or understand... [15:03:27] Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765348 (Ottomata) > @Ottomata, I think understanding the semantics of an event primarily requires knowledge of the topic. Hm,... [15:04:49] master/bdea350 (#179 by milimetric): The build has errored. https://travis-ci.org/wikimedia/limn/builds/88129986 [15:06:28] develop/bdea350 (#180 by milimetric): The build has errored. https://travis-ci.org/wikimedia/limn/builds/88130551 [15:11:33] o/ milimetric [15:11:54] It's late notice, but would you be interested in attending a research group meeting today in 1.5 hours? [15:12:10] 12:30 EDT [15:12:30] halfak: I would... but :( I have another meeting then [15:12:38] We're going to have a talk from some external researchers about what they are doing with pageviews. [15:12:49] dammit [15:12:51] I suppose anyone who wanted to attend would be welcome. [15:13:08] halfak: mind sending me any notes afterwards? [15:13:20] I'm attending this epic meeting I can't skip at 12:30 [15:13:34] Gotcha. now worries. Any chance you'll be done by 1:00? [15:13:44] no, it's 1 hour [15:13:46] NOW WORRIES :S [15:13:52] :) [15:13:55] Gotcha. OK. Notes it is :D [15:14:06] halfak: so you know about the pageview API? [15:14:11] Yes. [15:14:17] Waiting on per article stuffs :) [15:14:25] halfak: it's there, but only daily [15:14:32] hourly's too big, cassandra puked and died [15:14:34] that's pretty good for my needs. [15:14:37] lol [15:14:47] October is backfilled, and being kept up to date I think? [15:14:54] and we're filling more months as we speak [15:14:56] How is it that hourly is intractable! [15:15:06] ha! too many dimensions [15:15:21] it'd work if we just had article and view count and hour [15:15:25] It's one of those engineering problems where the gap between how simple it sounds and how complicated it really is -- is super duper vast. [15:15:41] yes, exactly [15:15:46] milimetric, that would be super useful. [15:15:50] Article/viewcount/hour [15:16:13] right, but when we said we wanted to do that the whole thread revolted and said they NEEDED all these other dimensions [15:16:14] :) [15:16:32] so, I'm curious, what kinds of use cases are there for hourly that can't be done with daily [15:16:38] I'd love it if you asked that in the meeting [15:17:13] joal: would you wanna go this pageview research meeting ^ [15:17:23] hey milimetric [15:17:27] * joal reads [15:17:39] Hi halfak :) [15:18:43] Hey joal [15:19:13] milimetric, this presentation will be all about studying circadian patterns. So sub-daily is essential. [15:19:53] milimetric, halfak : meeting as well at that time (for 1h) [15:20:10] Arg! [15:20:15] :s [15:21:48] damn meetings. Ok, so hourly is important, circadian patterns!!! [15:22:46] right halfak milimetric [15:23:02] * halfak forwards email [15:23:11] It'll have some plots and discussion. [15:23:26] Analytics-Cluster, Database: Replicate Echo tables to analytics-store - https://phabricator.wikimedia.org/T115275#1765426 (jcrespo) >>! In T115275#1722695, @Neil_P._Quinn_WMF wrote: > This is similar to {T75047}. Last word on that was in August—Ops said it should wait on some rearchitecting they're doing.... [15:23:40] joal: so if we didn't have access and agent, the hourly data would be about the same size as the daily data, right? [15:23:53] 24/16 times bigger [15:23:56] Analytics-Cluster, Database: Replicate Echo tables to analytics-store - https://phabricator.wikimedia.org/T115275#1765427 (jcrespo) a:jcrespo [15:23:58] hm, thinking milimetric [15:24:06] Analytics-Cluster, Database: Replicate Echo tables to analytics-store - https://phabricator.wikimedia.org/T115275#1719744 (jcrespo) p:Triage>Normal [15:24:50] so like, pageviews user only, all access methods milimetric [15:25:45] /<hour>/# of pageviews [15:26:03] <milimetric> halfak: but all-access all-agents or all-access and "user"? [15:26:28] <halfak> I don't know what those words mean. [15:26:31] <joal> halfak: as many spider removal as we can :) [15:26:44] <halfak> Yes. As much cleanup as makes sense. [15:27:13] <joal> halfak: project\tarticle\thour\tcount [15:29:22] <halfak> joal, yeah [15:29:42] <halfak> This is what Wikipedians have been asking for. [15:30:01] <halfak> Actually, I take that back. Daily would work for most Wikipedian use-cases. [15:30:10] <halfak> I think this is more on the research side of things. [15:30:42] <milimetric> right, so the question is whether the research is widespread enough to warrant a public API or it's ok to just sign an NDA and get data from our cluster [15:30:44] <halfak> We can learn a lot about what various part of the world are interested in by correlating the rotation of the planet with page view rates. [15:31:20] <halfak> milimetric, I'd say its very widespread. EZ has been doing this type of analysis in the past too. [15:31:31] <halfak> Ironholds has done some work on circadian patterns too. [15:31:42] <milimetric> right, but they both have access to the cluster [15:32:04] <halfak> Yeah. Just using example people. [15:32:05] <milimetric> I'd say if there are 10 people that want to do this research and would have to sign an NDA, then it warrants a public API, because that means there are hundreds of others who might be interested [15:32:23] <halfak> There's more than 10 for sure. [15:32:23] <milimetric> but if it's just us internally and like 4 others externally, we can get them to sign an NDA, not a big deal [15:32:30] <milimetric> k, then we can make a case [15:32:55] <milimetric> so joal, then, 24/16 the size of daily? [15:33:22] <milimetric> how many days do we have in AQS right now? [15:34:53] <joal> milimetric: currently running a query to have proper number :) [15:35:37] <joal> milimetric: about 500G / month in aqs (but replication downsize should make it smaller) [15:36:05] <milimetric> ok, so if we added hourly we'd have like 1100 / month [15:36:13] <milimetric> which means we would run out of space much faster [15:36:23] <milimetric> halfak: how far back do you think the hourly data should go? [15:36:33] <milimetric> like, to be useful, would you need 1 month? 6 months? a year? [15:36:39] <milimetric> that'd be very very interesting to know [15:37:31] <halfak> Hmm... I'll ask in the meeting today. [15:37:45] <halfak> I don't think I can answer that as well as EZ, David and Ironholds can [15:37:47] <wikibugs> Analytics-Engineering, operations: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1765475 (chasemp) p:Triage>Normal [15:38:14] <Ironholds> what did I do? [15:38:26] <halfak> CIRCADIANISM [15:38:35] <Ironholds> milimetric, from a consumer perspective I would like as much data as humanly possible [15:38:54] <wikibugs> Analytics-Engineering, operations: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1765481 (chasemp) a:Ottomata @ottomata I am going to toss your way as even though we OK on the check for the moment you have the best idea of whether this needs longer term attention or not [15:38:56] <halfak> But from an engineering perspective, how much is still useful [15:39:14] <halfak> I'd like to say that a year would be important. [15:39:22] <milimetric> Ironholds: if I could I would eat nothing but pure honey all day long. But I don't want to die. So there are other considerations :) [15:39:22] <nuria> milimetric: let's please not change anything on the cassandra cluster for loading now [15:39:36] <nuria> milimetric: it is not only about space but also failures of loading jobs [15:39:42] <milimetric> nuria: not changing anything, just talking about size / numbers / feasibility so I can give Aaron good questions to ask the researchers [15:40:25] <nuria> milimetric: sounds good. [15:40:44] <wikibugs> Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765493 (GWicke) @ottomata: Based on our backwards-compatibility rules, the latest schema will be a superset of previous schemas... [15:40:51] <nuria> milimetric: let's capture the use case and talk about it in terms of $$$ and hardware [15:40:55] <Ironholds> milimetric, with halfak on this. 12 months. [15:41:12] <Ironholds> and I have a very obvious use case the people with budgetary control will agree with: [15:41:16] <wikibugs> Analytics-Engineering, operations: disk space on stat1002 getting low - https://phabricator.wikimedia.org/T117009#1765495 (Ottomata) Open>Resolved Looks like all is well: /dev/mapper/tank-home 1008G 561G 448G 56% /home Thanks! [15:41:27] <Ironholds> Lila comes up to you and says 'pageviews went down in April and then up in May. Does this always happen?' [15:41:44] <halfak> Ironholds, ^ can do that with daily [15:41:47] <nuria> Ironholds: but you can answer taht with daily resolution [15:41:47] <Ironholds> we always get those kinds of questions. If we have less data than is necessary to answer them we have less data than is necessary. [15:41:51] <Ironholds> halfak, oh wait, this is hourly? [15:41:52] <nuria> Ironholds: not hourly [15:41:56] <Ironholds> sorry, my bad ;p [15:41:57] <halfak> yeah. [15:41:59] <halfak> :D [15:41:59] * Ironholds thinks [15:42:02] <halfak> CIRCADIANISM [15:42:03] <Ironholds> hourly project or page counts? [15:42:15] <halfak> project/title/count [15:42:20] <Ironholds> ohh [15:42:22] <halfak> *project/title/hour/count [15:42:30] <nuria> right, it is -like milimetric said- a trade of [15:42:34] <Ironholds> 6 months? More than 1 month but you probably won't need a year. [15:42:43] <joal> nuria: reading your TL;DR - Don't agree with everything [15:42:45] <Ironholds> Like, this is the sort of thing I would almost run by our direct consumers - people like Mako. [15:42:52] <halfak> What about comparing circadian patterns this year to the same time last year? [15:42:54] <nuria> if we need to spend 1 million dollars in hardware is easier to grant ndas to 20 people (far fetched example) [15:42:54] <halfak> Ironholds, ^ [15:43:10] <halfak> nuria, is it easier to grant NDAs to 20 people? [15:43:12] <nuria> joal: ah please, EDIT [15:43:15] <halfak> I don't know if that is as clear [15:43:18] <joal> k nuria :) [15:43:25] <halfak> We'd need a whole new people manager for NDAs at that scale. [15:43:26] <nuria> joal: forgot to send e-mail about that [15:43:30] <Ironholds> halfak, tbh I would't use these dumps for that because they're not localised or in a form that allows for localisation [15:43:43] <milimetric> halfak: circadian patterns have everything to do with light, so year over year matters in most of the world except like the equator [15:43:50] <nuria> halfak: agreed, but likely will make less than 1million a year on my -far fetched- example [15:44:10] <halfak> milimetric, even at the equator, they have seasons [15:44:18] <halfak> Sport seasons. Vacation seasons. [15:44:41] <halfak> nuria, everyone at the WMF isn't making 1 million / year ;D [15:44:43] <halfak> ? [15:44:50] <halfak> OK. Fair point. [15:44:53] <nuria> right right [15:44:59] <nuria> thus my "far fectched" [15:45:00] <halfak> It would be great if we could have a public hadoop [15:45:07] <halfak> Maybe an external provider [15:45:14] <halfak> And we'd only have this public data in it. [15:45:30] <halfak> The same stuff that might be made available via the API [15:45:40] <halfak> That'd be far less than 1 mill per year [15:45:57] <nuria> halfak: jaja, right, that is another way to solve use case [15:46:11] <halfak> OK. I'll collect what I can from this meeting and report back. [15:47:51] <nuria> halfak: i am al for capturing what we need ideally and after engineering can work with that and say: "this is what we can give you for this much" (not only money but manpower and such). At this time hourly resolution it is not only a capacity problem but also a scale one at the time of loading data. Solvable, but an issue. [15:49:30] <halfak> nuria, fair enough. But I'm showing up with use-cases and we wouldn't be discussing this unless milimetric & joal thought it might be possible if we drop all other dimensions -- which I think would still retain the primary value. [15:49:39] <chasemp> interesting topic [15:49:49] <halfak> Either way, I think you're point is totally fair. [15:49:52] <halfak> Everyone costs something. [15:49:56] <halfak> *everything [15:49:57] <nuria> halfak: space wise probably, loading wise I have to see it to believe it [15:50:03] * halfak is not planning to sell people yet [15:50:13] <nuria> halfak: two different issues [15:50:57] * halfak totally knows the difference between "your" and "you're" but needs more coffee to type the right one automatically. [15:51:01] <nuria> joal, milimetric : added TLDR here: https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI/DataStore [15:51:10] <nuria> milimetric: please edit. correct as needed. [15:51:12] <joal> nuria: actually I updated it, please review ;) [15:51:18] <halfak> ha [15:52:15] <nuria> joal: thank you! [15:52:44] <joal> nuria, milimetric, given the trouble we go through with this API and data size thing, I have been reading about Kylin today [15:52:50] <joal> Maybe another way to go [15:53:24] <nuria> Kylin.. [15:53:39] <joal> The reason why it might, is because it is hadoop based - Less storage concerns [15:53:54] <nuria> need to read more about it [15:53:54] <joal> nuria: http://kylin.incubator.apache.org/ [15:54:07] <joal> nuria: will present it after standup [15:54:28] <nuria> joa: : will stop working in a bit for today, will be back tomorrow [15:54:37] <Ironholds> milimetric, I threw my open questions and suggestions in as phab tickets, btw [15:54:37] <nuria> sorry, joal [15:54:37] <joal> oh yeah, forgot about that :) [15:54:41] <joal> Seen your email nuria [15:54:53] <joal> Have a good day, we'll talk about that tomorroz nuria [15:55:10] <joal> If everybody agrees, I'll invest some time testing Kyl;linm [15:57:49] <milimetric> joal: do you think Kylin is sufficiently isolated from Hadoop? Back to our old discussion of not allowing direct public access to the cluster for performance and availability reasons [15:58:28] <joal> milimetric: Kylin is hive (cube building) + hbase (data serving) [15:58:38] <joal> milimetric: with a front end [15:59:14] <milimetric> I thought we hated hbase with the force of a thousand suns [15:59:22] <joal> milimetric: so yes, it is based on hadoop, but the question is then, is that worse to have to maintain hadoop (which we maintain anyway), or a full Druid / cassandra / ES ? [15:59:46] <joal> Meaning, whatevber we'll use, seems like there'll be difficulties :) [16:00:01] <milimetric> well, that's obviously true :) [16:00:09] <milimetric> like, that's why we have a job [16:01:55] <madhuvishy> joal: coming to standup? :) [16:04:39] <wikibugs> Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Use Burrow for Kafka Consumer offset lag monitoring {hawk} [8 pts] - https://phabricator.wikimedia.org/T115669#1765582 (kevinator) [16:06:51] <wikibugs> Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765592 (Ottomata) Have we decided that defaults will be filled in for missing fields? [16:09:37] <wikibugs> Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1765597 (kevinator) [16:13:39] <wikibugs> Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765600 (GWicke) @ottomata, they will be filled in somewhere, but I think we haven't necessarily decided on filling them in at p... [16:13:44] <wikibugs> Analytics-Backlog, Analytics-Kanban: Test Elastic search pageview data loading/retrieval on labs {slug} [8 pts] - https://phabricator.wikimedia.org/T116763#1765601 (kevinator) [16:14:57] <grrrit-wm> (PS2) Milimetric: Improve details in browser report [analytics/refinery] - https://gerrit.wikimedia.org/r/249659 (https://phabricator.wikimedia.org/T116931) (owner: Mforns) [16:15:03] <grrrit-wm> (CR) Milimetric: [C: 2 V: 2] Improve details in browser report [analytics/refinery] - https://gerrit.wikimedia.org/r/249659 (https://phabricator.wikimedia.org/T116931) (owner: Mforns) [16:21:11] <wikibugs> Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765626 (Ottomata) Producer A has schema version 1. Producer B has schema version 2, which has added field "name" with default... [16:22:09] <grrrit-wm> (CR) Matthias Mullie: "I thought I didn't have access to stat1003, but apparently I do." [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [16:23:32] <grrrit-wm> (CR) Milimetric: "Cool. The files will be re-filled whenever the report runs. If it's weekly, it'll be Sunday. If it's hourly, it'll be soon." [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/249731 (https://phabricator.wikimedia.org/T116797) (owner: Matthias Mullie) [16:24:48] <wikibugs> Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765642 (GWicke) @ottomata, you are basically making the case for filling in the defaults at consumption time. [16:33:05] <milimetric> can someone tell mforns to log on? [16:33:05] <milimetric> :) [16:33:33] <mforns> milimetric, hey [16:33:38] <milimetric> hey hey [16:33:42] <mforns> sorry [16:33:54] <mforns> what's up? [16:34:07] <milimetric> ok so the lag parameter thing [16:34:11] <mforns> aha [16:34:24] <milimetric> (you can ignore this, I'll just type it here so you know what I meant) [16:34:36] <mforns> is it to permit the data sources to finish being collected? [16:34:40] <milimetric> what mathias wants is to be able to run the report on Nov 1st, but get data from Sep. 1st to Oct 1st. [16:34:51] <mforns> aha [16:34:53] <mforns> why? [16:34:53] <milimetric> so ideally the "label" as in "date" for that data would be September [16:34:57] <milimetric> because it's September data [16:35:14] <milimetric> (because the full data is not available until a month later, it's about how long notifications go without being read) [16:35:25] <mforns> I see [16:35:42] <mforns> can I see the query? [16:37:51] <mforns> is it limn-wikidata-data? [16:37:59] <milimetric> limn-ee-data [16:38:39] <mforns> mmm, the github repo is empty... [16:39:26] <wikibugs> Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765677 (Ottomata) Or produce time. But really, even if we fill in defaults during production or consumption, this will still b... [16:40:40] <joal> mforns: shall I deploy the change for the browser report ? [16:40:50] <mforns> joal, oh btw, thanks for the review [16:40:57] <joal> np mfori [16:41:01] <joal> mforns sorry [16:41:04] <mforns> yes, please, if you think it is good to go [16:41:26] <joal> It has been merged I think, so must ge good :) [16:41:27] <mforns> joal, is that something you always will do/want to do, or I can also do that? [16:41:35] <mforns> ok :] [16:41:45] <joal> mforns: You could do that, no problem for me [16:41:56] <joal> Only think is you need hdfs user rights to restart the jobs [16:42:03] <mforns> ok, do you want to do that in the batcave so that I can see? [16:42:08] <joal> sure ! [16:42:17] <mforns> cool, so next time I can do that [16:42:32] <mforns> I'm there [16:52:05] <ottomata> going to cafe, back shortly [16:56:00] <mforns> milimetric, I think we don't need any change in reportupdater [16:56:22] <mforns> just use granularity=months and frequency=months [16:57:15] <mforns> lie, frequency does not accept the value "months" [16:57:56] <mforns> but it could, and I think this would be sufficient [16:58:44] <mforns> mmmm, I'm not sure, will look at it [17:04:29] <mforns> a-team, are we having backlog grooming? [17:04:41] <joal> currently joining mforns [17:04:51] <mforns> batcave or backlog? [17:04:57] <joal> But if we only are the two of us, maybe we'll cancel? [17:05:03] <mforns> yes [17:10:38] <grrrit-wm> (PS1) Christopher Johnson (WMDE): updates local data files updates README [wikidata/analytics/dashboard] - https://gerrit.wikimedia.org/r/249784 [17:33:37] <wikibugs> Analytics-Backlog, Analytics-Kanban, MediaWiki-API: Add Application errors for Mediawiki API to x-analytics - https://phabricator.wikimedia.org/T116658#1765858 (bd808) a:bd808>None [17:41:53] <milimetric> mforns: batcave? [17:42:00] <mforns> milimetric, sure [17:42:59] <joal> milimetric: when you have finished with mforns, if not too late, let's spend a minute in cave :) [17:56:30] <milimetric> omw joal [17:57:28] <joal> omw as well :) [18:08:07] <wikibugs> Analytics-Cluster, Database: Replicate Echo tables to analytics-store - https://phabricator.wikimedia.org/T115275#1765958 (Neil_P._Quinn_WMF) Thanks for the update, @jcrespo! [18:11:39] <wikibugs> Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765962 (GWicke) @ottomata: If you fill in the defaults at consumption time, then you have a choice of how you want to treat old... [18:13:29] <wikibugs> Analytics, Discovery, EventBus, MediaWiki-General-or-Unknown, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1765969 (Ottomata) Events will be consumed into Hadoop close to production time (within an hour usually). Schema changes made y... [18:55:52] <joal> hey halfak, are you around ? [18:56:00] <halfak> Yeah. What's up? [18:56:28] <joal> I'll have some time tomorrow and possibly next week to (finally) add dumps data in hive :) [18:56:47] <halfak> Yay! I think we should plan to work together so that I can learn your process. [18:56:47] <Ironholds> halfak, you just missed out on a great "sorry, I'm not around, I'm actually ahypergeometric" joke opportunity [18:57:05] <halfak> lol [18:57:23] * joal looks through Ironholds with incomprehension [18:57:52] <joal> halfak: let's take a minute now to plan ? [18:58:42] <halfak> I have time between 1530 and 1700 UTC tomorrow [18:59:55] <joal> halfak: only 1/2 hour overlap (scrum + 1-1 between 16:00 and 17:00) [19:00:04] <Ironholds> joal, a-round, as in a round shape, compared to a shape existing on a hypergeometric plane, which contains >3 dimensions [19:00:28] <halfak> Oh! COuld do 1300-1430 too [19:00:51] <joal> early for you isn't it? [19:01:06] <joal> thanks Ironholds :) [19:01:12] <halfak> Na. I'm usually online at 1300 [19:01:21] <joal> I'll to look through you in 3d then ;-P [19:01:29] <joal> ok halfak [19:01:33] <halfak> If you're willing to tolerate 5-10 minute windows of uncertainty [19:01:34] <joal> L'et's take that time [19:01:39] <halfak> Also known as coffee [19:01:42] <halfak> Great :) [19:01:44] <joal> I'll plan the collaboration having scripts ready [19:01:51] <joal> :) [19:02:15] * halfak makes sure that data is in the right place. [19:02:21] <joal> :) [19:02:40] <joal> we'll use the research cluster, right ? [19:02:43] <halfak> joal, BTW, we'll be on a different cluster. The "ia" one. [19:02:44] <halfak> Yeah [19:02:46] <halfak> lol [19:02:52] <joal> :) [19:03:18] <joal> halfak: I am not sure I have access to that machine [19:03:25] * joal checks [19:03:52] <joal> I do ! [19:04:13] <joal> hmmmm, actually, not sure [19:04:46] <joal> I have access to 2 workbenches, but with the same address: ia.z42.altiscale.com [19:05:30] <joal> ok, halfak, I confirm I have access to IA [19:05:55] <mforns> milimetric, I think I have something working but have some difficulties testing, do you have a couple minutes to help me? [19:06:14] <milimetric> joal: fyi in case you missed it, aqs1001 just went down for hardware maintenance, it was about to catch fire due to old thermal paste we think [19:06:20] <milimetric> mforns: yes! [19:06:25] <milimetric> tothebatcave! [19:06:26] <mforns> milimetric, batcave? [19:06:35] <mforns> :] [19:06:36] <joal> milimetric: just read about that yeah ! [19:07:00] <joal> milimetric: those machines do some work indeed :) [19:07:11] <joal> not usefull work, but the work ;) [19:08:19] <joal> ok halfak and a-team, I'm gone for tonight, will see you all tomorrow :) [19:08:52] <halfak> o/ [19:13:49] <milimetric> nite! [19:19:26] <halfak> milimetric, where's the best reference for what the pageview API will likely look like and do? [19:19:44] <milimetric> halfak: our collective brain [19:19:47] <halfak> Boo [19:19:48] <halfak> BOOOO [19:19:50] <halfak> :P [19:19:52] <milimetric> well, we don't know! [19:20:00] <halfak> Maybe the current proposal? [19:20:04] <milimetric> it depends on soo many things [19:20:05] <halfak> Gotcha. [19:20:14] <halfak> Trying to communicate what it would be to the researchers who presented today [19:20:15] <milimetric> the last thing we talked about was like 5 minute ago [19:20:21] <halfak> Gotcha. [19:20:41] <milimetric> so obviously we'd love to just put all dimensions out for public use [19:20:47] <milimetric> but that doesn't seem possible [19:21:14] <milimetric> what we are leaning towards now is putting like 2 months of data with ALL dimensions into Druid so you can query it however you want internally [19:21:27] <milimetric> externally we would expose some endpoints to query it in a limited way [19:21:37] <milimetric> and that "limited" way can include a LOT of things [19:21:41] <halfak> Gotcha. [19:21:44] <milimetric> because we have everything as the source [19:21:52] <milimetric> and it depends almost entirely on what use cases people bring up [19:22:15] <milimetric> but after a lot of years of listening to what people want, the line between "ooh shiney I want that" and "I need this for my research" is still very blurry [19:22:25] <halfak> Do you think it is too early to ask researchers for their use-cases? I mean, if we can't even say what they might get. [19:22:49] <halfak> I mean, I'm hoping to pull in researchers who are currently working with our hourly dumps. [19:23:07] <halfak> So, "ooh shiney" would be "if I had this right now, I would <foo>" [19:24:21] <milimetric> halfak: not too early to ask for use cases, no, we definitely need to know that stuff [19:24:45] <milimetric> and yeah, if they can tie to "if I had this I would do <foo>" then that's great, we can say - <foo> is really important, we should do that [19:27:50] <halfak> OK. I'll do my best with imagining what I think the API will look like and go from there [19:28:55] <milimetric> halfak: oh, so the space of possibilities is anonymized data with any of the dimensions of pageview_hourly: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly [19:29:09] <milimetric> (if we want more dimensions in there we can do it, but we didn't see a pressing need) [19:29:52] <milimetric> by anonymized we mean that if you're pulling geographic data it can't be used in combination with other sources to de-anonymize and locate users [19:56:58] <wikibugs> Analytics-Kanban: Add lag option to reportupdater - https://phabricator.wikimedia.org/T117091#1766341 (mforns) NEW a:mforns [19:59:03] <grrrit-wm> (PS1) Mforns: Add lag option to reportupdater [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/249813 (https://phabricator.wikimedia.org/T117091) [20:01:47] <grrrit-wm> (CR) Milimetric: [C: 2 V: 2] Add lag option to reportupdater [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/249813 (https://phabricator.wikimedia.org/T117091) (owner: Mforns) [20:33:16] <wikibugs> Analytics-Cluster, Analytics-Kanban, Easy, Patch-For-Review: PM sees reports on browsers (Weekly or Daily) {lama} [8 pts] - https://phabricator.wikimedia.org/T88504#1766501 (kevinator) [20:33:37] <wikibugs> Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Browser Report. Small bugfixes {lama} - https://phabricator.wikimedia.org/T116931#1766503 (kevinator) [20:34:19] <wikibugs> Analytics-Kanban, Patch-For-Review: Create deb package for Burrow {hawk} [5 pts] - https://phabricator.wikimedia.org/T116084#1766505 (kevinator) Open>Resolved [20:34:20] <wikibugs> Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Use Burrow for Kafka Consumer offset lag monitoring {hawk} [8 pts] - https://phabricator.wikimedia.org/T115669#1766507 (kevinator) [20:39:10] <wikibugs> Analytics-Cluster, Analytics-Kanban, Easy, Patch-For-Review: PM sees reports on browsers (Weekly or Daily) {lama} [8 pts] - https://phabricator.wikimedia.org/T88504#1766522 (kevinator) [20:39:25] <wikibugs> Analytics-Cluster, Analytics-Kanban, Easy, Patch-For-Review: PM sees reports on browsers (Weekly or Daily) {lama} [8 pts] - https://phabricator.wikimedia.org/T88504#1766524 (kevinator) Open>Resolved [20:39:33] <wikibugs> Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Use Burrow for Kafka Consumer offset lag monitoring {hawk} [8 pts] - https://phabricator.wikimedia.org/T115669#1766525 (kevinator) Open>Resolved [20:40:08] <wikibugs> Analytics-Cluster, Analytics-Kanban, Patch-For-Review: camus offset fails/continues Load job {hawk} [13 pts] - https://phabricator.wikimedia.org/T113252#1766527 (kevinator) Open>Resolved [20:40:09] <wikibugs> Analytics-Backlog, Analytics-Cluster: Implement better Webrequest load monitoring {hawk} - https://phabricator.wikimedia.org/T109192#1766528 (kevinator) [20:47:22] <wikibugs> Analytics-Backlog, Analytics-Kanban: Test Elastic search pageview data loading/retrieval on labs {slug} [8 pts] - https://phabricator.wikimedia.org/T116763#1766585 (kevinator) Open>Resolved a:kevinator [20:47:23] <wikibugs> Analytics-Backlog: Data Store hardware procurement - 2015 - https://phabricator.wikimedia.org/T117008#1766587 (kevinator) [20:47:48] <wikibugs> Analytics-Backlog, Analytics-Kanban: Improve record size on cassandra storage for pageview API data {slug} - https://phabricator.wikimedia.org/T116209#1766589 (kevinator) [20:50:10] <kevinator> milimetric: did you see ezachte's comment on https://phabricator.wikimedia.org/T114379 ? [20:50:35] <kevinator> I was about to close the ticket, but want to know if we're going to do something about the small deltas [21:00:44] <wikibugs> Analytics-Cluster, Analytics-Kanban: Research whether no cookie header numbers improve Last access uniques {bear} [13 pts] - https://phabricator.wikimedia.org/T115350#1766691 (kevinator) [21:00:46] <wikibugs> Analytics-Cluster, Analytics-Kanban, Epic: {bear} Last Access Counts - https://phabricator.wikimedia.org/T88647#1766690 (kevinator) [21:05:01] <wikibugs> Analytics, MediaWiki-extensions-Gadgets: Gadget usage statistics - https://phabricator.wikimedia.org/T21288#1766719 (NiharikaKohli) [21:10:09] <wikibugs> Analytics-EventLogging, Analytics-Kanban: {stag} EventLogging on Kafka - https://phabricator.wikimedia.org/T102225#1766754 (kevinator) [21:11:05] <wikibugs> Analytics-Cluster, Analytics-Kanban: {wren} PV Aggregates - https://phabricator.wikimedia.org/T96314#1766755 (kevinator) Open>Resolved a:kevinator [21:11:11] <wikibugs> Analytics-Cluster, Analytics-Kanban: {musk} Pageviews in Vital Signs - https://phabricator.wikimedia.org/T101120#1766757 (kevinator) Open>Resolved a:kevinator [21:12:36] <wikibugs> Analytics-Cluster, Analytics-Kanban: {mule} Hadoop Cluster Expansion - https://phabricator.wikimedia.org/T99952#1766763 (kevinator) [21:26:26] <milimetric> kevinator: reading now [21:39:37] <wikibugs> Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1766800 (Milimetric) The .mw missing from "wsc 3.0" data. Hm, I missed [[ https://github.com/wikimedia/analytics-refinery/blob/f... [21:39:51] <milimetric> ok kevinator, added comments, I'll move it to In Review [21:48:09] <grrrit-wm> (CR) Mforns: [C: -1] "Hi Matthias," (9 comments) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [23:06:48] <grrrit-wm> (CR) Milimetric: Measure the user responsiveness to notifications over time (1 comment) [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/249394 (https://phabricator.wikimedia.org/T108208) (owner: Matthias Mullie) [23:12:58] <wikibugs> Analytics-Kanban, Analytics-Wikistats, Patch-For-Review: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts] - https://phabricator.wikimedia.org/T114379#1767209 (Tnegrin) I wonder if we redacted the W0 numbers in 2.0. I seem to remember some concerns about sharing those publicly.