[14:38:12] (CR) Milimetric: "This looks great at a glance Andrew, I've been out of the office for the last few days but I'll take a look at it soon." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/138151 (owner: AndyRussG) [15:45:52] milimetric: Morning! Thanks for commenting on the patch... :) I have a quick question about project_host_map, could you maybe let me know sometime when you're not to busy? Thx... [15:46:08] hi AndyRussG, shoot [15:47:07] Ah K... So I see it builds up a dictionary with the dbname as the key and something, I think about the cluster shard or something, as the value [15:47:27] as in {'enwiki': 's1'} [15:47:41] but I don't see the value part being read anywhere in the code [15:47:47] it's not [15:47:51] Ah OK [15:48:01] Or is the cached json file read from anywhere else? [15:48:04] it used to be when we needed to use production back when it was user-metrics [15:48:24] Ah OK [15:48:39] so no, right now project_host_map should probably be a set [15:48:46] The reason I'm wondering is that I have some additional data I want to store for each project: the API url [15:49:20] So I was thinking of how to add that in to the same data structure, and change whatever code that was needed to maintain compatibility [15:49:49] But if it's not used I can just hose it, then, and have the URLs as the value, I think [15:50:09] Would that be OK? [15:50:21] oh, yeah, that would be fine. Maybe call it project_info and store a new object in there for each key, like wikimetrics.models.ProjectInformation [15:50:43] or just use the value for the URLs and we'll do that next time we need to expand it [15:50:57] yeah, that'd be fine [15:51:16] ooooh the first option sounds elegant and I'm quite tempted, though maybe I should go for the more expeditious route [15:52:44] I'm also adding this to db_config.yaml: SITE_MATRIX_URL : 'http://meta.wikimedia.org/w/api.php?action=sitematrix[...]' [15:55:05] Thanks for the speedy answer :) [16:55:57] bleek [16:56:00] ottomata, yo :) [16:56:28] yoyo [16:56:31] ok so [16:56:36] quick update and q [16:56:40] last weekend a kafka broker died [16:56:45] this past weekend* [16:56:58] we knew that one broker couldn't handle all of the traffic we are currently sending [16:57:05] hence the upcoming node procurements [16:57:21] *nods* [16:57:27] DC datacenter techs are in dallas this week [16:57:30] so we aren't goign to get things fixed [16:57:47] the plan was to order new hadoop datanodes, and take a couple of the old datanodes to use as kafka brokers [16:57:51] now that one is dead [16:57:54] gotcha [16:57:56] and things are basically broken anyway [16:58:05] we are thikning of taking 3 datanodes out of the hadoop cluster now [16:58:09] to make into kafka brokers [16:58:10] now [16:58:20] so that we can go ahead and put the full traffic stream (including upload) into kafka [16:58:23] and do failover tests [16:58:25] and make sure that we can handle it [16:58:40] that would limit the amount of data we can currently keep in hadoop even further [16:58:44] need to do some numbers as to how much [16:58:57] we can keep more if we don't import things like bits and upload into hdfs for now [16:58:59] so [16:59:00] q: [16:59:12] do you need bits traffic for your research right now? [16:59:12] and [16:59:27] how much would it hurt if we only kept say, a week or two of historical data? [16:59:34] (have to run the numbers to see how much we can keep) [16:59:46] I do not, and not at all. [17:00:14] ok cool [17:00:21] not importing bits would enable us to keep more mobile and text [17:00:27] ok, will run numbers and start working on that then [17:00:41] hm [17:00:42] one more q [17:01:02] how would you feel if we wiped hadoop and installed CDH5 sooner rather than later? (like, within a week or two instead of a month or more?) [17:01:07] absolutely fine [17:01:21] In fact, if we want to only keep a week or 2 of mobile and text data, that's also fine. [17:01:36] so, I have been thinking about why queries first worked, and then didn't work, and the conclusion is'time'. [17:01:55] When I ran them over month=05 and that consisted of 15 days of data, fine. 30 days of data, ah...not so fine. [17:02:17] So, if we can't efficiently retrieve datasets of that size anyway, and we know we're going to have to wipe it all for CDH5... only keeping a couple of weeks seems reasonable. [17:02:39] Realistically the only reason we need data over a longer period of time is for dealing with production-side errors and needing to reconstruct, or looking at longer trends. [17:02:53] We have no production systems dependent on this at the moment, and when we say 'longer trends' we mean a year ;). [17:02:56] well, and for things like you are doing [17:03:02] we *shoudl* be able to run the queries you are doing [17:03:06] i see no reason why we can't support that [17:03:08] yeah, but I can do that just as well on 2 weeks of data as on 4. [17:03:13] yeah [17:03:17] there are surely ways to make it more efficient [17:03:24] and cdh5 won't immediately solve your problem [17:03:26] me either! But in the short term, I would rather we have something that lets me get 2 weeks of data than something that doesn't let me get 4. [17:03:29] sure [17:03:30] and i'm sure there are many knobs that I dont' yet know how to twewak [17:03:31] tweak [17:03:34] my thinking is this is sort of an interim measure. [17:03:36] aye [17:04:14] In the long-term we need 30 days of storage because, if nothing else, we need cover for "there was some flaw in how we were handling PVs and we need to reconstruct it". For the time being I need enough data to not be thrown off by behavioural differences on weekends. [17:10:07] yeah [17:11:20] so yeah, that sounds awesome :) [18:22:48] * greg-g waves [18:22:54] drive by question: [18:23:00] do we have similar data/graphs? https://twitter.com/anto_l/status/475990277967339520/photo/1 [18:23:18] "Desktop usage vanishes during the weekend in favor of mobile, and soon during the week" [18:30:18] yes [18:30:25] I just got to use run-length encoding in data analysis [18:30:27] * Ironholds throws horns [18:30:39] greg-g, yes, I generated them and showed them at metrics ;). [18:30:49] I wouldn't say 'vanishes', but it's definitely a noticeable and consistent trend. [18:30:53] you have better ones now right Ironholds [18:31:07] well, the better ones show the variation in behaviour. [18:31:11] Not in volume or proportion, though. [18:31:14] ok [18:31:17] Since I was using equally sampled mobile and desktop datasets [19:45:39] (PS1) QChris: Make dammit aggregation try to grab its semaphore only once [analytics/wikistats] - https://gerrit.wikimedia.org/r/138405 (https://bugzilla.wikimedia.org/65627) [20:03:19] (CR) QChris: Make dammit aggregation try to grab its semaphore only once (1 comment) [analytics/wikistats] - https://gerrit.wikimedia.org/r/138405 (https://bugzilla.wikimedia.org/65627) (owner: QChris) [21:33:05] hm, i don't think i've gotten any instructions about how to do peer reviews yet... [21:33:16] i'm searching my email [21:33:19] but I don't see anytihng [21:33:24] should I have received somethign? [21:46:34] * marktraceur is introducing bistenes to interesting teams [21:46:44] He's been exploring our code [21:46:48] (don't worry, I'm apologizing on our behalf) [21:47:01] Hi Analytics folks! [21:47:30] Yeah, I’m just sort of exploring what y’all have going on right now [21:47:36] * terrrydactyl waves [21:47:52] big data is kind of a thing for me, so this looks like a pretty sweet opportunity [21:48:36] * marktraceur turns on [[Big Data (band)]] [21:50:19] cool. right now is kind of a dead-ish time for people, i think. since most people are in different time zones. but feel free to pop in here and hang. :) [21:50:44] But luckily Ironholds is here and at the ready with terrible puns. [21:54:25] * qchris waves at new faces ... Hi [21:54:50] qchris: feels like i haven’t talked to you in forever. i keep missing the meetings or they get cancelled [21:55:34] terrrydactyl: Ja. The tasking meetings were not too regular lately :-) [21:56:19] this team isn’t all too regular :) [21:56:23] If you miss the meetings ... just crash our post-standup discussions in the batcave. [21:56:31] terrrydactyl: Right :-D [21:56:38] * YuviPanda|zz waves at bistenes [21:56:39] i didn’t know you had post-standup discussions.. :O [21:56:44] bistenes: you should check out our mailing list as well [21:56:49] feels a bit like fight club now [21:56:50] hahaha [21:56:56] Well ... it just happens. [21:57:07] bistenes: https://lists.wikimedia.org/mailman/listinfo/analytics [21:57:15] qchris: i’ll keep it in mind. :) [21:57:43] (PS1) Terrrydactyl: Added projects to csv output [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/138474 [22:05:30] hi bistenes, I take it you know Analytics is hiring and you've seen the job postings on jobvite [22:05:43] but either way, hello :) [22:05:57] yep! [22:05:58] thanks [22:06:12] i'm making dinner so can't be too chatty atm, but I'm usually around normal hours on EST [22:06:38] milimetric: just a fyi, the android app is out, so I'll probably build the dashboard a month or so out :) Will keep you informed [23:08:47] Anyone have recommendations for an analytics Hello World? [23:11:08] bistenes: https://bugzilla.wikimedia.org/buglist.cgi?keywords=easy&keywords_type=allwords&list_id=320607&product=Analytics&query_format=advanced&resolution=---&resolution=LATER&resolution=DUPLICATE has the easy bugs in the analytics product [23:11:15] Not sure if that will contain helpful things.