[15:02:06] csalvia, charles-salvia, we are in standup if you wanna join, but you don't have to [15:02:13] it is an informal holiday stand up meeting :) [16:03:56] (Abandoned) Milimetric: [ready for review] Added jumpstart [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/105893 (owner: Stefan.petrea) [18:00:51] (CR) Ottomata: "I agree. But! Stefan this is really great work! You should maybe host it on your own github and point people to it." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/105893 (owner: Stefan.petrea) [18:31:35] milimetric: [18:31:36] you back? [20:00:41] ottomata: back [20:01:08] I'm not actually working today, but what's up, wikimetrics puppet stuff? [20:01:25] oh! ottomata, you goin to San Fran this week? [20:02:28] san fran? [20:02:29] oh for thing [20:02:30] naw, are you? [20:06:14] yeah, I'll be in san fran [20:42:53] (PS1) Ottomata: Upgrading pip to install wikimetrics dependencies in scripts/install [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108625 [20:43:12] (CR) Ottomata: [C: 2 V: 2] Upgrading pip to install wikimetrics dependencies in scripts/install [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108625 (owner: Ottomata) [20:53:15] milimetric, awesome! [20:53:33] (unrelated; I'm about to make a commit to the hadoop puppet files. Please be gentle, etc, etc.) [20:55:16] ooooooOOOo cool [20:56:19] ottomata, you say that now, you haven't seen it ;p [21:06:15] (PS1) Ottomata: Upgrading setuptools with scripts/install [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108629 [21:06:32] (CR) Ottomata: [C: 2 V: 2] Upgrading setuptools with scripts/install [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108629 (owner: Ottomata) [21:07:11] Ironholds: just curious, what hadoop changes are you submitting? [21:08:52] ottomata, just adding r-base to the worker machines [21:08:58] (for reasons that will be explained in the commit) [21:10:46] r-base hmmm [21:10:50] is that a debian package? [21:11:10] debian and ubuntu both [21:11:15] it's the base language/compiler for R. [21:11:26] was this manifest written by python nerds? [21:11:29] everything has four spaces ;p [21:11:34] ah ok [21:11:34] cool [21:11:58] haha, it is ops style choice [21:12:04] i prefer 2 spaces myself [21:12:08] what?! [21:12:08] but ops makes us use 4 :/ [21:12:09] madness [21:12:58] you guys know that tabs exist, right? [21:13:04] they're LIKE spaces but less uncertain [21:13:04] ;p [21:14:36] 4 spaces is truth [21:14:49] 4 spaces is beauty [21:15:00] 4 spaces is everything [21:15:16] 4 spaces is stupid, but so is the spaces/tabs debate, so ;p [21:15:29] I thought we were just reciting poetry, not having a debate [21:17:24] their like spaces but more uncertain [21:17:33] editors get all confuuuuused [21:17:39] how fat shoudl this be? [21:17:40] iunnooo? [21:17:43] alignment never works out [21:18:04] the heathen who liketh fewer spaces speaks truth [21:24:50] ottomata, milimetric, https://gerrit.wikimedia.org/r/#/c/108633/ if you have any interest :) [21:27:52] (hangon, amending the commit for stylistic changes [21:30:41] so Ironholds, I'm not oopsed to this, but in this case the issue isn't exactly with R or other python stuff not being availble on hadoop workers [21:30:51] it is the fact that the libs aren't avail on your hadoop client [21:30:56] you happen to be using analytics1011 [21:31:03] which, btw, I'd prefer if you used analytics1026 [21:31:06] these days [21:31:10] but [21:31:20] i think that we shoudl probably install a hadoop client on stat1 or stat1002 [21:31:28] but that would be an issue i'd have to bring up with other ops folks [21:31:31] stat1002 would be better [21:31:38] stat1 means a hadoop link to something with a public IP [21:31:39] that would also be easier to convince opsen of [21:31:41] yeah [21:31:43] (which, hey, I'm down with, but security) [21:32:12] so the long-term solution is connecting the cluster up to a machine with analytics packages already on it; in the short-term, does this work for you? [21:32:38] how are you expecting to work with this? [21:32:40] run ahdoop job [21:32:47] hadoop fs -get ../whatever to/whatever? [21:32:57] then work with r locally? [21:33:06] I think what Ironholds is actually looking for is RHive .. [21:33:09] am I right Ironholds ? [21:33:14] http://www.slideshare.net/SeonghakHong/r-hive-tutorial-udf-udaf-udtf-functions [21:33:33] RHive would be _good_, but probably more of a pita to set up [21:33:50] ottomata, pretty much. RHive would be better, of course [21:33:58] also [21:33:59] http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/ [21:34:08] although presumably necessitate r-base anyway ;p [21:34:24] yeah [21:34:27] if we want to use r in that way [21:34:31] then I am all for installing on worker [21:34:32] s [21:34:40] that and pandas and numpy or whatever else [21:34:45] as long as it has debian packages :) [21:34:49] yep ;p [21:36:03] i think the nicer solution is to add RHadoop to R and then run R jobs on the hadoop cluster from within R, I think the newest version of RHadoop supports YARN (not sure though) [21:36:03] average, re rhive; see the "allows us to experiment with distributing the post-processing" comment in the commit [21:36:34] well, rhadoop has been all split up, but yes [21:36:45] we probably want rhdfs [21:36:55] right rhadoop plus rhdfs [21:37:15] well, no [21:37:22] rhadoop is a collection of packages, of which rhdfs is one [21:37:43] but, semantics [21:39:20] my worry would be whether the rhadoop collection is dependent on RevolutionR rather than R proper - I haven't investigated it in detail yet [21:39:26] in any case, r-base is presumably a prerequisite [21:40:13] I get the impression it is compatible [21:40:42] although has RCpp as a dependency. I always found that a pain to set up. [21:50:12] Ironholds: all the rhadoop/rhdfs/rcpp/revolutionr you mentioned, you can actually try adding that to the manifest too, and if it gets merged, then you can have it . For example, ottomata taught me about "vagrant enable-role analytics" (which you use with mediawiki-vagrant) and you can have a hadoop/hive test environment with it, and you can play with said r-related until they work on your mediawiki-vagrant and then you can submit all of them [21:50:32] maybe the only impediment is that rcpp/rhdfs/rhadoop are not debianized yet, or maybe they are [21:50:46] in some package repo somewhere [21:51:08] they probably are; I know rcpp is (r-cran-rcpp) [21:52:24] so, my thinking on this at the moment is: I'd like to do some in-depth research into the pros and cons of each option, precisely what's needed, so on and so forth, before throwing a load of requirements into the manifest (nobody likes clutter) [21:52:45] and then ideally have someone from the analytics side review it for stupid (volunteers? ;p) [21:53:05] ...and /then/ throw it in. Uninstalling is trivial, but I'd like to avoid having to keep tweaking and redoing the setup if at all possible. [21:53:28] Ironholds: use labs or mediawiki-vagrant? [21:53:32] to play and see what works best [21:53:36] and then we can work together to productionize it [21:53:55] makes sense [21:54:00] thanks all :) [21:55:03] so yeah, ideally all analysis packages that we install on hadoop workers would be hadoop related [21:55:05] but! they don't have to be [21:55:18] but, also ideally, if they aren't, we'd install a hadoop client in a place like stat1002 [21:55:21] indeed [21:55:21] that is not part of hte hadoop cluster [21:55:25] which would totally make sense [21:55:27] and is meant for general purpose computation [21:55:46] but that sounds like a longer discussion, and I have mobile requestlog information requests /now/ ;p [21:56:11] hopefully everything will settle in a few months. fwiw, I'm totally supportive of leaving the analytics machines relatively un-clogged and installing a client on 1002. [21:56:19] Ironholds: if you want to debianize yourself, you may pick up one of the already debianized R modules here http://cran.r-project.org/bin/linux/ubuntu/precise/ and try out debianizing what you need [21:56:21] so, if you need people on the research side arguing for that, poke me [21:59:05] I'll experiment with the hadoop/R options and write something up [21:59:54] I suspect that if the answer to a question is "just debianise a package", the question is fundamentally flawed ;p. As said, keeping the analytics machines uncluttered is something I am supportive of, so I'll avoid adding things like ggplot2 or plyr. [22:00:43] at the moment it's just "having a way of post-processing data and breaking it down into an easily exported format, which can be taken to a machine with ggplot2". The client-on-stat1002 long-term solution should, well, solve for this in the long term. [22:06:55] Ironholds: cool, thanks [22:07:12] i think we want to put a client on stat1002 anyway, (or at least, I do), so having you write something up with some motivation will make that happen [22:07:38] shall do [22:07:43] can you drop me an email so I don't forget? [22:07:55] oliver@wikimedia.org has actually replaced my short-term memory. It's sad. [22:08:12] and then GO ENJOY YOUR DAY OFF ;P [22:18:26] done [22:19:25] danke! [22:36:53] heyyy milimetric :) [22:37:00] i am very close! having oauth problems [22:39:09] https://github.com/wikimedia/analytics-wikimetrics/blob/master/wikimetrics/config/web_config.yaml#L21 [22:39:38] had a question about line 21 , can it be made such that it uses a local mediawiki instead of meta.mediawiki.org ? [22:39:44] meta.wikimedia.org [22:41:11] ottomata , milimetric maybe this is related to the oauth [22:41:55] oo [22:41:57] that would be nice if it could [22:42:01] dunno