[08:23:39] ironholds can now login on analytics1011-1020. access request resolved [14:02:07] morning guys [14:02:21] milimetric…… [14:02:31] qchris... [14:02:46] morning drdee :-) [14:03:01] i would like to scope down the M from minimum in regards of the page view api [14:03:10] considering that the largest labs instance has 160GB [14:03:21] and that we have to write to a NFS partition [14:03:42] i would like to propose to have 3 month rolling window of data [14:03:56] and drop the bytes transferred variable [14:03:56] morning [14:03:58] morning [14:04:08] just to get a better understanding of the write performance [14:04:29] hm [14:04:32] let's drop labs then [14:04:40] we're never going to get anything useful in 160GB [14:04:52] it's a prototype we are building [14:04:53] can we make a db publicly available in prod? [14:05:04] sure but not in 2 weeks [14:05:11] yeah but when your prototype for a building is made of sheep poop, it's not very useful :) [14:05:11] MVP [14:05:18] minimum viable prototype [14:05:48] i agree with all of your and qchris technical concerns [14:06:00] then my vote is for MVP to be the data ready to query in Hive [14:06:01] but remember -- built one to throw one away [14:06:11] this one we will throw away [14:06:23] but we have something to show and to build on [14:06:31] if you're 100% sure you're going to throw it away and it doesn't help you get any information, then you shouldn't build it [14:06:40] but we are going to get information [14:06:44] like what [14:06:46] please trust me on this one :) [14:07:18] in the end labs is just a storage backend for us [14:07:25] we will replace it with something that scales [14:07:40] but for now it allows us to get the datatream up and running [14:07:45] and get feedback on use cases [14:07:53] experiment with the db schema's [14:07:59] that's all very useful information [14:08:11] and we signal that we are working on it [14:08:16] that's also valuable [14:08:41] we can get feedback on use cases independent of where we store the data [14:08:42] I agree with milimetric that 160 GB is not enough. [14:08:55] according to yuvipand 160gb is not a hard cap [14:08:58] we can experiment with db schemas independent of where we store as well [14:09:02] Our concern is MySQL in >200GB. [14:09:08] and reporting that we're working on it is separate from building prototypes [14:09:18] I am more than willing to trust you, but you have to make the argument [14:09:32] but guys this only needs to work for a couple of months in it's initial setup [14:09:53] my point is, let's figure out if we would gain any information that would be useful to us in the long run [14:09:57] so far I can't think of anything [14:10:07] like, we could learn "160GB is not enough" :) [14:10:16] I agree with milimetric [14:11:02] i don't understand the resistance [14:11:09] *might* not be a hard cap. you'll need to probably give a *really* good use case to Ryan Lane for him to change it [14:11:21] we will build the data flows which we need anyways [14:11:38] we get the aggregation working in hive which we need anyways [14:11:55] we get people to use it and give feedback on dimensions and use cases which we need anyways [14:12:17] it's super likely that 3 months from now we will be using a mysql instance in prod with a 1tb hard drive or more [14:12:18] A prototype should make sure on how to proceed on the hard parts. The easy parts are never a problem. One of our hard problems is MySQL >200GB. If the prototype does not allow to gain knowledge there, it does not buy us anything. [14:12:58] right qchris [14:13:01] we have a whole battery of msyql experts in ops who can help us witht hat [14:13:08] we don't need to solve that problem from the get go [14:13:16] we anticipate the problem [14:13:19] it will hit us [14:13:22] but not right away [14:13:28] so let's not solve it right away [14:13:39] it's ok, nobody's decided one way or another, these are all just propositions [14:13:44] let's talk them through logically [14:13:55] build data flows which we'll need anyways [14:14:06] sorry, let's number [14:14:08] 1. build data flows which we'll need anyways [14:14:17] 2. aggregation in hive [14:14:30] 3. get people to use it and give feedback on dimensions and use cases [14:14:43] s/give/get [14:14:52] right, sorry :) [14:15:13] any more things we want to learn, independent of solution? [14:15:39] in a 2 week iteration i would like to add [14:15:48] MySQL size concerns are kind of inherent to the solution to me, as MySQL is part of the requirement. [14:16:08] that's why i said use 3 month rolling window [14:16:09] 4. find out if MySQL deals with data > 200GB well [14:16:10] The third party documentation is wrong [14:16:20] We know thath I disagree about the second of the three points :-) [14:16:55] we should not solve the scaling of mysql problem [14:17:03] it's something that we'd like to learn, but I agree not all people on the team want to learn it [14:17:17] we're not solving a problem, it's a question [14:17:35] basically, we're building the prototype in order to answer questions that we currently can't answer [14:17:53] milimetric: Oh... I want to learn it. I'd like to learn it right now. But it's not about me. It's about requirementes. And it's not a requirement of the card. [14:18:12] I got it qchris, that is an important distinction [14:18:21] in that case let's strike 2. from the list [14:18:31] and strike labs [14:18:34] so what's left? [14:18:40] labs wasn't on the list [14:18:41] we pipe the data flows to /dev/null? [14:18:43] 1. data flows [14:18:57] 2. get people to use it and get feedback [14:19:03] use what? [14:19:11] 3. does MySQL work for > 200 GB [14:19:25] i don't see why that is a relevant question [14:19:26] It would take too long to rewrite the code from scratch [14:19:44] the answer is obviously yes btw [14:19:46] why it's relevant to get people to use it? [14:19:47] Well, at least it displays a very pretty error [14:19:47] milimetric: sounds good to me. [14:20:01] there are many sql databases that are larger than 200gb [14:20:32] no sure, but what we want to know specifically is what the query performance is for this specific data and can we build indices fast enough to keep up with the insert/delete stream [14:20:44] we won't / we know that [14:20:53] we are in labs on a nfs partition [14:21:09] that's why i started this conversation with scoping down the prototype [14:21:10] I couldn't find any examples of how that can be done anywhere else in the project [14:21:16] drdee: Where will the final product run? [14:21:23] not in labs [14:21:30] but we are building a prototype [14:21:33] qchris: On dedicated machines? [14:21:43] s/qchris/drdee/ [14:21:43] highly likely i would say [14:21:58] Do we have those machines at hand already? When can we get them? [14:22:00] finally, someone who properly ends his regexes with a / [14:22:02] ;) [14:22:11] no we don't [14:22:15] but i think the route is [14:22:19] build the prototype [14:22:20] demo it [14:22:22] show limits [14:22:25] ask for hardware [14:22:31] deploy on new hardware [14:22:32] win [14:22:47] sigh, ok, we're not done talking over the points [14:22:49] Ok. Then we have to find, and identify the limits. [14:22:51] 1,2,3 I listed them [14:23:08] do we disagree on 3? [14:23:13] Sorry. milimetric. You are right. Back to the list. [14:23:22] yes 3 is not the question [14:23:26] 3 should be: [14:24:02] does the largest labs instance give reasonable performance to end-users and enough data to query [14:24:28] but we know the answer to that, and it's no [14:24:29] drdee: That question does not seem relevant to the product. [14:24:44] we don't know the answer at all [14:24:52] we have no clue how fast queries will be [14:24:57] our prototype compromised to one year of daily data [14:25:07] that was the compromise, not even what our stakeholders wanted [14:25:13] and i've been saying we should scale it back even further [14:25:21] right, but that's no longer a compromise [14:25:21] let people ask for more dat [14:25:22] a [14:25:28] sure it is [14:25:29] they did ask for more data though [14:25:32] i know [14:25:44] and they will get it but not with the first iteration of the prototype [14:26:01] I'm not sure they're getting anything for the first iteration, is my point [14:26:11] because they didn't ask for what you're proposing [14:26:23] i will take the blame [14:26:30] it's really not about that at all [14:26:31] and that's a great input [14:26:32] b [14:26:40] I agree with milimetric. If we want to scale down, let's ask the community before we decide upon it. [14:26:56] because then we can see look we need more resources to deliver what the community wants [14:26:58] We are building it for them aren't we? [14:27:00] i have an idea [14:27:10] well, it's a bad idea [14:27:11] my plan was to get you guys on board first and then ask it on the mailnglist [14:27:21] And let's also make sure that the prototype answers our questions as well. [14:27:22] but it seems that basically we *all* have a different thing we want to build [14:27:24] i would not make such a decision unilateral obviously [14:27:30] and a different way we understand what people are asking for [14:27:48] and everyone (including me) is being very difficult in giving up their own particular interpretation [14:27:50] :-) [14:28:00] so why don't we just collaboratively build all the different ways we are interpreting the problem [14:28:00] The accounting department must have cancelled that subscription [14:28:03] * qchris is difficult as well. [14:28:04] then we can compare results [14:28:30] milimetric: That sounds great from a learning point of view, but it burns resources. [14:28:33] sure, but this burns resources too [14:28:39] :-) [14:28:54] and I'm asking my stubborn co-processor and it's saying "full steam ahead" [14:30:13] in short, with the new constraint on labs, this has turned from a potentially publicly useful experiment into a pure experiment [14:30:22] disagree [14:30:24] You know my answer on this question. And it's not constructive. So I'll not interfere. [14:30:51] how about this: [14:31:19] i will propose the scope reduction on the mailing list and we see how the community responds and we take it from ther [14:31:33] but drdee [14:31:48] one of the major pieces of work that'll be involved here [14:31:53] will be getting the data from prod to labs [14:32:01] no [14:32:02] making sure people sign off on it, that it's not private data, etc [14:32:08] it's as simple as a wget from dumps [14:32:16] private data? [14:32:20] wait what? [14:32:25] when did dumps get involved [14:32:31] dumps.wikimedia.org [14:32:36] that's where the data is hosted [14:32:39] Yes. [14:32:48] I took the data from there for my experiments. [14:32:57] i thought we were moving that through production through some kind of crunching (not necessarily hive) [14:33:19] that data is ready to be imported as of now [14:33:34] so we're not crunching it? [14:33:45] http://dumps.wikimedia.org/other/pagecounts-raw/ [14:33:48] We're aggregating it from hourly (that's what is available) to daily. [14:33:56] right, i thought we were doing that in production [14:33:58] yes that's the only step we would do [14:34:06] but not in labs right...? [14:34:11] no data as-is is at hourly level [14:34:29] Where we do that is not part of the card. We pick whatever fits best. [14:34:29] not sure what you mean by production [14:34:31] so we'd take hourly level data, crunch it in labs, then put it in a mysql labs db [14:34:38] that's the proposal? [14:34:41] for a 3 month sliding window? [14:34:42] Yes, for example. [14:34:43] no [14:34:47] my proposal would involvehive [14:34:54] oh for the love of [14:35:01] that's production [14:35:04] i don't hunk crunching that in labs is going to work [14:35:08] hive == production cluster [14:35:10] No sliding window. That would complicate things unnecessarily, and we do not need it in the final product. [14:35:29] drdee introduced the sliding window as the only way to solve the 160GB limit in labs [14:35:38] I agree that a sliding window is indeed the only way to solve that [14:35:41] it was a suggestion [14:35:44] if the data is to be even remotely useful [14:35:56] because otherwise we just have a 3 month span of useless pageviews :) [14:36:20] Removing scope from 2013 to second half of 2013 would also reduce scope and not buy us new requirements. [14:36:38] sure [14:36:52] ok, so not sliding, >= 6/2013 [14:37:18] well, minus bytes_served [14:37:35] I am not sure if that's the way to go. It was only an example .... [14:37:40] right, true [14:37:40] I think we're moving too fast. [14:37:50] right, agreed [14:39:23] i am curious to hear what you guys think is the MVP [14:39:34] and what the technical implications are of that [14:39:47] qchris, you go first [14:40:00] qchris. [14:40:16] Ok. [14:40:32] I do not understand the card well enough, [14:40:38] so take it with a grain of salt. [14:40:53] Take the pagecount data from 2013. [14:40:53] no problem, this card is all salt right now [14:41:04] Get it into /some/ MySQL instance [14:41:19] Give read access to this data to the community. [14:41:32] Test performance on real queries. [14:41:53] That's about it. [14:41:54] Q about " Give read access to this data to the community." [14:42:11] is that through an http request or a CLI, or something else [14:42:25] I forgot to add that we aggregate to daily /somehow/. [14:42:36] drdee: That's undefined for me. [14:42:53] what would you propose? [14:43:07] They are a handfull of trusted users? [14:43:24] trust no-one :) [14:43:27] :-D [14:43:37] i would not assume trusted userse [14:43:42] I would say not a handful - the general analytics-l at large [14:44:05] I could imagine giving them access at MySQL level. [14:44:10] k [14:44:19] But as said ... that's undefined and only a first approach. [14:44:29] i think that rules out running it in prod -- that we can only do in labs [14:44:40] but his first requirement rules out running it in labs [14:44:45] exactly [14:44:50] this is the problem [14:44:55] i don't see it as a problem [14:44:57] my turn? [14:45:00] sure [14:45:01] go ahead [14:45:16] import all hourly data from Domasz [14:45:23] aggregate to daily in Hive [14:45:29] dump into MySQL in prod [14:45:49] make a website that takes in sql queries and executes them through a read-only user on that MySQL instance [14:45:58] publish the site on stat1001 [14:46:38] when you say " sql queries" you mean actually raw sql queries? [14:46:44] (reduce to 2013 data if MySQL chokes) [14:46:46] yes, raw sql [14:46:51] injection be damned [14:46:56] that makes me very scared [14:47:04] why? it's read only [14:47:04] The WYSIWYG must have produced an invalid output [14:47:18] until it isn't [14:47:30] also easy for people to choke the machine [14:47:45] but this is something we'll have to deal with [14:47:52] regardless of solution [14:47:56] drdee: Let them choke the machine! We want to learn about that. [14:48:01] in prod? [14:48:02] exactly [14:48:03] no thanks [14:48:13] that's why we have labs [14:48:14] Oh, you said you DIDN'T want that to happen? [14:48:21] drdee: It's a prototype. Isn't it? [14:48:31] right but not in prod [14:48:49] We're still talking about the same card? [14:48:57] yes :) [14:48:59] i think we are [14:49:04] Ok. [14:49:55] i mean, we all said eventually we'd expose a public database in prod somewhere [14:50:00] so the real question is not how to get the data in [14:50:01] and allow read-only access [14:50:03] but how to get the data out [14:50:26] so my proposal is meant to learn as much as possible about that use case, within the 21 points [14:50:29] milimetric is slightly more opinionated about how to get data in compared to qchris [14:50:43] sorry, I'll rephrase [14:50:57] s/aggregate to daily in Hive/aggregate to daily in the fastest way possible/ [14:50:57] but getting the data is the real challenge if read both of your proposals [14:51:33] fastest from a development point of view, not performance [14:51:55] in my mind we would expose the data in prod through an api [14:51:57] and yes, I think we all agree drdee that giving access to this data is the tough problem here [14:52:26] but that's the core benefit of labs [14:52:34] that we can give access to the data [14:52:43] the price we pay is in performance and scalabality [14:52:44] that's one technical solution [14:52:50] which is a dead end [14:52:58] but people can access the data [14:53:17] but it's not a dead end for a prototype [14:53:18] Why don't we do the obvious thing and ask for a >160GB labs instance. We have a good use case, haven't we? [14:53:19] The user must not know how to use it [14:53:35] why ask for it now? [14:53:36] That wasn't in the original specification [14:53:36] I don't think that's true from Ryan's point of view qchris [14:53:48] why not first use 120gb and show the rate of growth [14:53:48] My time was split in a way that meant I couldn't do either project properly [14:53:51] and then ask for more [14:53:53] because this is definitely a production type of thing [14:54:02] yes that's very likely the answer [14:54:11] drdee: because we can show the rate of growth without doing anything [14:54:17] qchris already did it on the back of a napkin [14:54:25] imagine what we could do with proper pen and paper :) [14:55:18] ok, refocus time guys [14:55:20] it's been over an hour [14:55:32] what do you guys suggest? [14:55:34] so - key question [14:55:44] drdee: Getting something to eat :-) [14:55:55] do we need to move on this card now to potentially support Lane Raspberry's recent email? [14:55:56] let me attempt to summarize this discussion at a very high level [14:56:12] or do we still all try to help qchris with his stuff and do this card at the end of the sprint [14:56:26] i liked your suggestion milimetric yesterday [14:56:32] let's first crank out the other cards [14:56:39] because this one is a time sink [14:56:44] 1) Diederik proposed a further reduction in scope [14:56:51] but since then Lane mentioned his Oct. 11th deadline which is new info [14:57:07] i gave him the answer to his question [14:57:11] that was awesome of you to give him the queries [14:57:21] 2) qchris & milimetric do not think ithat would be a viable prototype [14:57:25] but there's a high chance he won't be able to do much with it [14:57:36] 3) qchris & milimetric do not think the MVP fits in the constraints of labs [14:57:38] Oh. I am working on my cards. And I can work on them alone. That's not a problem. I am still stuck mostly in clarifying them out. [14:57:52] do you guys agree with my summary? [14:57:57] yes [14:58:08] I agree with 1+2. [14:58:29] how would you phrase 3? [14:58:30] qchris: you can definitely work on them alone, you're pretty awesome [14:58:36] I do not take the 160GB constraint as hard constraint until we hear back from Ryan_Lane. [14:58:38] but the point is, let's all help out so we can get to this card sooner together [14:59:02] qchris: what would you say is the minimum amount of storage that we require? [14:59:22] 3+) qchris & milimetric do not think the MVP fits in a 160GB constraint of labs [14:59:34] That would work for me. [14:59:39] k [14:59:50] qchris: what would you say is the minimum amount of storage that we require? [15:00:14] We have 480GB of data that should go in. [15:00:23] So probably ~700GB? [15:00:35] But do not nail me down on that. [15:01:00] I have not done computations on that. [15:01:55] there is no way ryan_lane will give us that [15:02:33] Not even timeboxed? [15:02:40] but i really don't see what the problem is to an n-month rolling window approach [15:02:50] drdee: You do not have to implement it :-) [15:03:07] timeboxed: you mean we will give it back ? [15:03:11] Sure. [15:03:15] It a prototype. [15:03:28] That was one of the key ingrediengs of the card, wasn't it? [15:03:53] remember scrum; conversation over process; cards are not set in stone [15:04:12] But we want to have it in prod at the end of the day, wouldn't we? [15:04:46] yes, I think so [15:08:16] Guys ... the hour is over, and I am starving :-) [15:08:24] gogo qchris [15:08:26] I'll read the backlog. [15:08:36] never choose us over food :) [15:21:22] milimetric: using qchris's back of the envelop estimates [15:21:38] if we go for 3 months of data then we would need 111Gb [15:23:28] and together with a rolling window i believe we would have something viable [15:23:42] if only i could have access logs to stats.grok.se to see what people are querying :) [15:25:07] also if people really need to query historic data then they can always fall back to stats.grok.se [15:25:19] they can always do that anyway [15:25:31] but that's fine [15:25:38] we can do both things [15:25:51] we can push 3 months worth of rolling data into labs [15:26:09] and we can put all of it in a mysql database in production to test the performance [15:26:19] sure [15:26:28] i like that [15:26:38] we can even say the labs mysql is a place for people to write queries and if they want access to more data, just send an email and attach the query [15:26:41] then we'll run it in prod [15:26:49] that'll help us understand what people are asking [15:27:02] and how a "real" MySQL database performs for them [15:27:07] true [15:27:31] but mostly I think my point is, if getting data to labs from hive is hard, let's skip it [15:27:35] since it's not all that useful anywa [15:27:37] *anyway [15:28:06] i am not sure if it's hard -- i think some ports need to be opened by ottomata [15:28:09] and by hard I mean like we're sitting there banging our heads into walls for like 3 days [15:29:40] and to q-c-h-r-i-s's point, if hive is hard in the same way, we should drop it, and in general we should just learn what exactly is hard [15:29:42] (not using his screenname because he's eating) [15:34:11] so what's next? [16:05:05] average are you around? [16:46:17] (PS1) Stefan.petrea: Adding one more test to survival metric [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/87588 [16:46:38] ok, added one more test to the survival metric [16:54:46] milimetric: ^^ [16:55:03] cool [17:01:45] average scrum [17:03:37]