[19:36:58] * jeremyb waves milimetric [19:37:04] howdy [19:37:09] * jeremyb wonders where the EST comes from? [19:37:12] * jeremyb is NYC [19:37:17] yeah, Philly [19:37:24] isn't our timezone called EST? [19:37:24] :) [19:37:40] oh like what city [19:37:46] yes, philly [19:38:01] yes, that :) [19:38:57] milimetric: so, i think i could set aside a few hours this wednesday to play with stuff... [19:41:23] cool :) [19:41:23] i mean, we're happy to have you jeremyb [19:41:23] you're coming at a bit of a transitional time [19:41:24] we're adding some big features to wikimetrics [19:41:24] and I forsee big merges ahead of us [19:41:25] but wait, what specifically did you want to get involved with [19:42:50] well i did click the hadoop/zero dash link because it interested me [19:43:07] idk if it's feasible for a non-WMF person to do that one though [19:43:35] well, hm, i'm sure we could get you the access you'd need if you really wanted to work on it [19:43:44] but just so you know, it's not any new development [19:43:58] we have all the scripts and jobs in place running in hadoop local mode right now [19:44:00] and they're creating reports [19:44:27] this task is basically taking that process and moving it to the actual hadoop cluster and is going to run it on the raw logs instead of the sampled logs [19:44:30] i thought it was all running on sampled data and you wanted to duplicate it on hadoop (so i guess 2 concurrent dashboards) [19:44:36] and then later find a time to cut over [19:44:37] yep [19:44:51] but we're not going to change the process at all [19:45:13] there's some python and some bash, hang on getting the gerrit repo for it [19:45:35] yeah, that was another reason to like it, i like python and bash :) [19:45:40] https://git.wikimedia.org/summary/analytics%2Fwp-zero [19:46:03] anyway, maybe there's some better fit elsewhere. idk [19:46:18] well, no, I don't mean to discourage you from this [19:46:23] just want to explain what it is [19:46:42] so right now there's a bunch of stuff not yet comitted to this repo that's waiting for review [19:46:50] and qchris_away is the author [19:47:35] but it's in gerrit? [19:48:14] he's the best one to talk about it and potentially doing work for making it run in the hadoop world [19:48:14] yes, the code reviews are in gerrit [19:48:15] i'll link you to one (they're in-progress) [19:48:19] are you guys responsible for https://git.wikimedia.org/metrics/analytics%2Fwp-zero ? [19:48:28] or the metrics come for free with gitblit? [19:48:48] i think that comes with gitblit [19:49:03] i was just thinking we could dedupe the evans [19:49:17] qchris_away: ^^ [19:49:21] right, not sure how [19:49:28] yeah, average, i'm sure he'll see this tomorrow [19:49:41] ah yeah, I guess he's busy right now [19:49:44] so, we've tasked it out as a team and the first step is basically tring to figure out if Apache Pig will work on top of our compressed JSON files in HDFS [19:49:47] he's +0100 ? [19:49:53] I think +2 [19:50:06] because right now the sampled files are uncompressed and not json :) [19:50:10] a lot of these tools tend to accept a list of mappings from alias to canonical name [19:50:15] +0100 at this time of the year [19:50:35] heh, darn farmers [19:51:07] yeah, darn subsidies too [19:51:30] were you interested in other tasks so you can compare the work? [19:51:53] btw, our current thoughts on this are in the first section of this etherpad: http://etherpad.wikimedia.org/p/analytics-tasking [19:51:58] (it's card 1453) [19:52:40] milimetric, yes pig will work on compressed json :) [19:52:50] woo! [19:52:55] that saves us 40 hours [19:52:55] i'm not entirely what you guys are talking about, but! hive will too and maybe hive is the way we want to go? [19:52:58] not sure [19:53:16] yeah, we don't want to move this to hive [19:53:19] it's the wp zero stuff [19:53:22] ok, lemme replace that strong 'will' in that sentence to 'should' [19:53:24] that's already implemented in pig [19:53:27] as I haven't tried, but it should! [19:53:35] right, i thought so too [19:53:40] but people wanted a spike to figure it out [19:53:45] and allocated 40 hours to it [19:53:50] (which I objected to slightly) [19:54:17] that would be fantastic if that gets reduced [19:54:54] how many hours are in a mingle week? [19:55:09] i.e. not including time spent working on e.g. privacy policy [19:55:43] we don't really keep track of that jeremyb [19:55:50] spikes are the only exception [19:56:03] they're when we don't know something and we want to figure it out but not waste unbounded time on it [19:56:08] i never even heard of a spike until ~3 mins ago :) [19:56:21] the rest of the work is estimated in points [19:56:21] that's about what i guessed [19:56:37] and since the team is basically brand new at working together, we don't know how many points we can deliver per week [19:56:45] last sprint, we delivered 8 points [19:56:55] which sounds crazy low but it's because we had a ton of things we were doing in parallel [19:57:11] this sprint we're trying to push for finishing more stuff and focusing more [19:57:15] well i don't know what a point really means [19:57:23] it is a relative measure of complexity [19:57:25] how many FTEs are in a typical week? [19:57:30] and our cards in mingle have them as "estimate" [19:57:58] 3-4 FTE? [19:58:00] roughly [19:58:05] k [20:01:01] jeremyb, if you wanted to work on figuring that out [20:01:07] that would be a really awesome and easy to help task [20:01:27] there should be some snappy compressed json files like this in the labs hadoop cluster [20:03:57] and if there isn't we can put somethere [20:04:04] oof, it might be time to make a new labs hadoop cluster [20:04:08] this one is getting crufty [20:04:10] maybe in eqiad! [20:04:16] ok. how does one get to the labs hadoop? [20:04:25] are you in the analytics labs ?project [20:04:31] isn't eqiad waiting on firewall stuff or something? [20:04:37] i don't think so? let me check [20:04:48] yeah you aren't [20:04:53] are you jeremyb there? [20:04:55] in wikitech? [20:05:03] ottomata: https://rt.wikimedia.org/Ticket/Display.html?id=6860 [20:05:07] ottomata: yeah [20:05:35] ok, just added you [20:05:35] ummmm [20:05:37] try [20:05:42] kraken-namenode.pmtpa.wmflabs [20:06:32] ugh, i wish there were a way to turn off these warnings. > If you are having access problems, please see: https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [20:06:51] jeremyb: they are in puppet :-] [20:06:52] ok, in [20:07:29] ok [20:07:29] hashar: well for me, not everyone :) [20:07:35] hmmm, ok yeah, not sure if I have a file there [20:07:35] hmmm [20:07:36] will have to get one [20:07:38] hmmmmmmm [20:07:42] ok [20:08:06] ottomata: no rush, maybe you missed it, i was planning to set aside time wednesday (not so much today) [20:09:03] ok [20:11:24] well, anyway [20:11:29] you should be able to issue hadoop and hive commands [20:11:40] hdfs dfs -ls /user [20:11:41] etc. [20:12:25] yup, that worked [20:12:28] and the task / unknown is whether Pig works with snappy compressed JSON format files [20:12:35] we know that Hive does [20:12:42] ok [20:13:10] qchris_away: when you read this, be advised - jeremyb may be looking into this [20:13:20] and jeremyb: thank you for the interest :) [20:13:35] so, do some arbitrary/trivial calculation just to see if it can read them at all [20:13:43] the output should be snappy too? [20:14:14] output can be anything, doesn't even have to be output [20:14:25] if pig can read them, then how we're outputting in the current scripts will work [20:14:35] and the current scripts are... one moment [20:15:15] I wanna say this: https://git.wikimedia.org/blob/analytics%2Fkraken/35cfc89027b0220fe9ff42a626332e9d32813610/pig%2Fzero_carrier.pig [20:15:21] but I could be wrong [20:15:35] and they could be dramatically modified on the local checkout qchris_away has [20:15:48] k [20:16:25] is there a default place to look for analytics docs? wikitech? [20:18:02] jeremyb: in the repo where you place the code in [20:18:07] AFAIK [20:18:32] k [20:18:33] wikitech [20:18:54] https://wikitech.wikimedia.org/wiki/Analytics/Kraken [20:18:58] this isn't pig, but it might be useful: [20:18:58] https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Hive [20:19:00] jeremyb: ignore me, listen to ottomata [20:19:05] also [20:19:05] https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Hive/Compression [20:19:18] oh also that :) [20:19:21] * jeremyb waits for ottomata to link to https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Hive/Compression/snappy [20:19:31] hha [20:19:36] jeremyb: https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Hive/Compression#Snappy_compressed_Sequence_Files [20:19:50] * average hides [20:20:25] * jeremyb runs some analytics on average's CPNI. can't hide! [20:21:05] Checklist of Nonverbal Pain Indicators ? [20:21:21] https://en.wikipedia.org/wiki/Customer_proprietary_network_information [20:21:42] woah [20:22:09] I rarely use a phone nowadays. I use Mumble [20:22:13] * jeremyb will bbl [20:47:25] milimetric, jeremyb: Thanks for the heads up. [20:47:57] hello CET :) [20:47:59] And sure ... whoever picks up the Pig spike is fine by me :-) [20:48:06] Hi jeremyb. [21:00:18] will you groom with me? [21:01:23] oops -- mean to be message to milimetric