[00:01:40] back [00:04:04] i guess otto has disappeared [00:06:22] (back and forth making dinner) [00:32:54] woa that's a lot of pig mappers [00:32:57] you're a pig mapper farmer [01:23:18] i am here dschoon [01:23:27] and ottomata [01:24:09] ok [01:24:23] I'm afk dinner atm :) [01:24:26] back in a few, tho [01:24:35] last i checked, the device job was green [01:24:37] which is good. [01:27:19] hiii [01:27:21] ok cool [01:27:24] yeah, device is running [01:27:34] and i'm running a giant single workflow for the existing data for zero [01:27:42] and will start a coordinator wherever it last works [01:29:46] you're a giant workflow alright! [01:29:54] oh. running. [01:29:55] right. [01:30:02] sounds awesome. [01:30:42] ottomata, we'll touch base in 30 to see where we stand, and what's lefT? [01:31:14] ok, i might be at dinner in 30 [01:32:27] s'ok [01:32:31] ottomata, did you see https://gist.github.com/dsc/5148082 ? [01:32:35] dunno if it helps [01:33:23] i kinda want to add something like that to hue. it's super-simple and it'd be a help. [01:34:40] (a simple table to browse conf, lookahead find, sort by col, etc) [01:35:39] aye [01:35:43] ja that'd be cool [01:35:54] esp since each job can override conf, yes? [01:42:05] cool! [01:42:10] and useful :D [02:01:40] the megajob is still running but shouldn't take too much time (i hope) [02:07:22] okay, so this basically leaves the alpha/beta pig script [02:07:31] which i can easily derive from the zero script [02:07:40] but since we have no data, it can't be verified [02:07:53] so i'll write the script, but we will await their deploy. [02:08:11] as a note, i reviewed the change and said they need to be url-encoding both keys and values [02:08:20] so we need to be sure to URL-decode the fields. [02:09:13] saw that! [02:09:23] very awesome, ty [02:09:39] good to hear that sounds right :) [02:09:59] and you are right about the other script that is now blocked [02:10:03] but should be trivial to adjust [02:10:18] very happy that we made a lot of progress toay [02:10:35] same. [02:10:50] btw check http://localhost:8088/cluster [02:10:52] sorry if i get snippy, but i worry whenever we're behind [02:11:08] this is the yarn view? [02:11:22] yes, of all the jobs finished /queued/running [02:11:41] this shows we need another scheduler than FIFO [02:12:03] something nice for the base cluster :D [02:12:15] because all jobs are now blocked by the big one [02:21:13] ...why? [02:21:20] because it has allocated all slots/resources? [02:21:24] (i will look into it) [02:22:27] the cluster has >500G of ram, so it's not memory [02:22:54] (note /scheduler gives a 500) [02:24:19] it says 0kb available memory on every node!? [02:24:36] this is despite the workers having at least 40G free [02:24:42] okay, clearly that's a conf var somewhere [02:30:20] FIFO = first in first out [02:30:32] so it schedule one job after another [02:30:54] yeah, but even so, it's not using all RAM [02:31:12] right, but i think that's why we need FairScheduler [02:31:15] ah, there it is [02:31:15] yarn.nodemanager.resource.memory-mb 8192 [02:31:23] so each resource is limited to 8G [02:31:30] okay, that explains [02:31:31] and since it's FIFO, it's one resource at a time [02:31:50] kraken definitely needs to be finetuned [02:31:59] and i am sure there is a lot of low hanging fruit [02:32:11] well, at the very least, we need to eliminate the obvious bottlenecks like this [02:32:14] :D [02:32:19] tuning is an ongoing thing, right? [02:32:21] == low hanging fruit [02:32:27] definitely [02:33:18] indeed. [02:34:03] i find a lot of things in here somewhat mysterious. [02:34:13] both mapred.child.java.opts and mapreduce.child.java.opts are defined? [02:34:38] that's odd [02:34:43] it only should be mapreduce [02:34:48] (afaik) [02:34:54] dunno [02:34:57] why that is [02:35:15] also: where does it say only one resource at a time? [02:35:30] the fifo thing? [02:36:00] that just dictates the order [02:36:03] not the parallelism [02:36:16] i don't understand your question [02:36:37] well, the "first out" part of the FIFO makes me agree with drdee [02:36:42] the scheduler could schedule many jobs at the same time [02:36:45] based on their resource needs [02:36:49] it is choosing "1" [02:37:01] that is bogus, as we have tons of free memory and cpu [02:37:06] i really think that's the FIFO [02:37:11] it will just schedule one job [02:37:18] wait for it to fnish [02:37:19] yeah, like what you're saying david would just be called FI [02:37:30] and then move on to the next job [02:37:35] "fifo" doesn't (to me, in hadoop) necessarily imply a strict total ordering based on time-in [02:37:42] if it does, then yes. we should switch immediately [02:37:46] yep :) [02:37:52] what are the possible values there? [02:38:36] and where would one find docs on something like that (I like the gist btw) [02:39:15] milimetric: i'd start with looking at the docs on YARN [02:39:16] reading source http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.0.0-alpha/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java [02:39:32] http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-yarn/hadoop-yarn-site/YARN.html [02:39:50] that's what i'm looking at next [02:41:04] there is a fair scheduler and a capacity scheduler milimetric [02:41:18] those are part of cdh4 [02:41:22] yeah, looks like Capacity gets recommended [02:41:24] there are also some custom scheduler [02:41:41] but not sure if that one already works with yarn [02:42:34] shouldn't they all? [02:42:40] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler [02:42:40] but yeah. we'll try out capacity tomorrow. [02:42:52] that's the setting for capacity ^^ [02:42:59] in yarn-site [02:43:22] there are capacity tuning params, also [02:45:34] dschoon, milimetric: please add your notes to https://mingle.corp.wikimedia.org/projects/analytics/cards/309 [02:45:45] cool, will do [02:45:57] most of those schedulers were not working with yarn last time i checked [02:46:03] (that was december IIRC) [02:46:11] we should have a general "cluster improvement" card, you know? [02:47:02] i think feature card is fine, there already enough card types to wrap my head around :D [02:47:29] i was just thinking a task that holds other tasks [02:47:36] or we could start adding tags? [02:47:42] i just find it hard to navigate things [02:48:24] i think the idea is to only do tasks that are related to features championed by people [02:48:42] i would just use the search box for now and let's wait until the dust settles around the actual workflows [02:48:49] so in this case, if we're trying to deploy a new analysis and kraken is slow as molasses, we'd add a task to look into the scheduler [02:49:16] i don't think we should immediately jump into this and try to fix it [02:49:27] i think that's the agile idea. but i think it's not crazy to think we'll be making modifications based on our needs. [02:49:37] right, only if "don't be slow as molasses" was in scope would we do it [02:49:37] exactly. we should just have a bucket so we can see it all at a glance [02:49:46] (that's merely what i'm thinking) [02:50:26] right, i see the bucket point, it's a fair one [02:50:42] drdee will hopefully help us figure the best way to do that before we all get mingle-dizzzy [02:50:50] i find the card-hierarchy stuff to be a bit annoying atm [02:51:15] but i don't really care what the means is so long as the ends is that all the ideas are grouped [02:53:04] i think the idea is that diederik and kraig are doing a lot of grouping and nesting for their own purposes [02:53:13] sure. [02:53:16] ideally, all we'd ever have to worry about would be features and tasks [02:53:21] features are on WIP - Features [02:53:25] tasks is an unsolved problem [02:53:27] i just want "all the cards related to making kraken better" as a thing [02:53:31] with an url [02:53:37] and a way to easily add a new thing [02:53:44] I don't think that's a thing :/ [02:53:47] call it a "task" if you wish :) [02:54:11] i do not think this is a crazy request [02:54:28] neither do I, I just don't think it's possible right now [02:54:34] we could use tags [02:54:39] we could use card-heirarchy [02:54:44] we could use a new card-type (bad!) [02:54:50] we could use a new property [02:54:53] lots of ways [02:54:54] yeah, me not like any of those things [02:54:56] it's possible [02:55:00] we just have to pick something [02:55:09] not possible within the constraints they picked already is what I meant [02:55:24] and personally, I'd like to remove some of the complexity not keep adding more [02:56:41] i think everyone (inc drdee and kraig) wants to find a solution that makes us ALL productive [02:56:49] we want to work TOGETHER after all [02:57:04] so if you find something good with mingle, i think you should let us know, milimetric [02:57:08] rather than suffer in silence :) [02:57:44] heh, believe me, if I find something good in Mingle, I'll be pretty loud about it [03:04:02] dschoon: i just want "all the cards related to making kraken better" as a thing [03:04:04] https://mingle.corp.wikimedia.org/projects/analytics/cards?favorite_id=722&view=MVC+Value+Path [03:04:11] it's incomplete but it's in progress [03:04:19] those are deliverables [03:04:23] this is non-feature work [03:04:25] right? [03:04:26] no [03:04:30] look at the page first [03:04:37] i did! [03:04:42] it's the "releases" [03:04:48] which (i thought) were feature-driven [03:04:54] plus the features for each release [03:06:34] we're looking for a low-friction way to both see and capture stuff to make X better (where X, in this case, is Kraken, but might also be Limn) [09:30:27] I would like to join the Analytics Team to help. [09:32:58] anyone is online? [09:33:02] I would like to join the Analytics Team to help. [09:33:37] gabrielchihongle: hey! Most of them are in the US, and hence sleeping [09:33:47] oic... [09:33:52] gabrielchihongle: better ask there :-] [09:34:03] which "there? [09:34:04] most people are in San Francisco so they are sleeping right now [09:34:18] I understand [09:34:35] I live in HK [09:34:42] So where should I put my reply? [14:21:18] ./bite hashar [14:28:01] morning [14:28:03] ping ottomata [14:28:17] we lost again 6 hadoop nodes [14:28:42] i see this [14:28:57] log-dirs turned bad: /var/lib/hadoop/data/c/yarn/logs,/var/lib/hadoop/data/d/yarn/logs,/var/lib/hadoop/data/e/yarn/logs,/var/lib/hadoop/data/f/yarn/logs,/var/lib/hadoop/data/g/yarn/logs,/var/lib/hadoop/data/h/yarn/logs,/var/lib/hadoop/data/i/yarn/logs,/var/lib/hadoop/data/j/yarn/logs,/var/lib/hadoop/data/k/yarn/logs,/var/lib/hadoop/data/l/yarn/logs 0 0 KB 0 KB [15:34:29] [travis-ci] master/29576ab (#84 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5474152 [15:40:13] morning milimetric [15:40:25] power back online? :D [15:40:44] yep, just got it back [15:40:52] my battery died right as it was coming back :) [15:41:15] i spoke with maarten from glam btw [15:41:28] great! [15:41:33] helping him with his Limn install, they seem ready to go [15:41:41] awesome [15:43:36] we lost for the 2nd time this week 6 hadoop nodes [15:43:49] i think something is off with the iptables [15:43:55] we lost for the 2nd time this week 6 hadoop nodes [15:43:59] ottomata ^^ [15:44:15] HM [15:44:15] hi [15:44:20] morning :D [15:44:25] morning! [15:46:35] did you restart them yet? [15:46:37] i am looking into it [15:46:39] no [15:46:44] ok good, lemme see [15:46:49] it would be weird if this was iptables, [15:46:54] because restarting is not a structural solution [15:46:54] thats usually an all or nothing thing [15:46:57] not intermtient [15:47:03] yeah but [15:47:18] look at http://localhost:8088/cluster/nodes/lost [15:47:29] (on the job server) [15:47:49] LOSt and Healthy? [15:48:03] yes [15:48:13] that's weird huh ? [15:48:18] also look at http://localhost:8088/cluster/nodes/unhealthy [15:51:17] that is more interesting, yeah i'm seeing weird stuff from namenode too, I the namenode web ui isn't responding [15:51:44] hmmm, yeahhh, hmmm [15:52:29] restarting namenoe [15:56:42] HMM [15:56:47] nodemanagers on those datanodes are dead [15:56:50] at least on an13 [15:58:46] hm [15:58:47] 2013-03-13 06:28:06,703 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread [15:58:47] java.lang.NullPointerException [15:59:08] can you paste the stack trace in a gist? [16:00:06] https://gist.github.com/ottomata/5153524 [16:00:14] this is from nodemanager logs on an13 [16:01:02] ai ai ai, that might be a bug in yarn [16:01:15] same error on an11 [16:01:28] starting nodemanater on an11 [16:01:46] probably same on all those 6 nodes [16:01:49] yeha [16:02:05] ask question on ch4 user group? [16:02:20] HM [16:02:24] it didn't start back up! [16:05:35] yeah totally weird [16:05:38] same error on startup [16:05:39] HMMM [16:06:12] oink [16:06:15] that's weird [16:08:40] i think this is related [16:08:40] http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-issues/201207.mbox/%3C612798421.59963.1342466795250.JavaMail.jiratomcat@issues-vm%3E [16:09:10] https://issues.apache.org/jira/browse/MAPREDUCE-4448 [16:09:32] not sure though [16:10:20] that's easy to test, disable log aggregation [16:10:41] ha, is that like the history thing? [16:10:57] yarn.log-aggregation-enable [16:10:57] ok will try [16:12:12] cool [16:12:36] bnope still tied [16:12:38] same error [16:13:01] i'm going to restart resource manager on an10 and see what happens [16:13:13] good morning [16:13:20] GRR, why is nodemanager running on an10>…. [16:13:22] hi! [16:13:23] morning [16:13:53] morning kraigparkinson [16:13:57] mornin [16:15:15] WEIRD [16:15:18] still not starting [16:16:14] yeah what that crap crackers [16:18:26] HM [16:18:36] drdee, could this be related to me changing the jvm.reuse setting? [16:18:40] doesn't seem related. [16:18:44] but that's the only thing that has changed [16:18:47] don't think so [16:18:53] no the problem was prior to that change [16:18:54] hm, that and I'm running more jobs concurrenlty these days [16:18:56] yeah probably not [16:19:00] it happened on monday for the first time [16:19:03] yeah that's true, well we don't know if this is that same problem [16:19:06] but it probably is [16:19:10] pretty sure it is [16:19:21] we didn't check the logs then right, we just restarted hadoop? [16:19:25] yes [16:19:38] maybe check hdfs health [16:19:59] the unhealthy name node is worrying [16:20:05] welll [16:20:05] sorta [16:20:08] and that was the exact same error on monday [16:20:15] that is werid though, i thikn that is a consequence of nodemanager running on an10 [16:20:18] it shouldn't be [16:20:34] have you tried disabling the nodemananger on an10? [16:20:39] i just stopped it [16:20:41] ok [16:20:41] still same thing [16:20:46] :( [16:20:46] cause [16:20:53] these dirs [16:20:53] /var/lib/hadoop/data/c/yarn/local [16:20:53] etc. [16:20:54] shouldn't exist on an10 [16:20:55] and they don't [16:20:58] so that is correct [16:21:03] if a nodemanager tries to run there it won't work [16:21:51] i just removed the nodemanager package too [16:21:56] to make sure it doesn't try to start upa gain [16:22:18] k [16:23:44] drdee, let me know when you have a sec. want to talk mingle again. :p [16:23:54] sure give me 3 minutes [16:25:07] ok ready [16:25:47] kraigparkinson: [16:25:47] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [16:37:50] SIGGHHGHGH last night everything was working so well! [16:49:29] WAH, nodemanager started this time [16:49:33] I didn't change anything. [16:49:58] based on Mingle, it looks like we don't have anything to showcase today. any objections to freeing up our stakeholders? [16:50:19] no objections from me [16:50:22] no objecttions [16:51:10] ah, but now it died, it just lasted longer, hm. [16:57:43] UHHH, am I running different versions of CDH4 on different nodes????? [16:57:49] waaatttt? [17:00:01] scrum? [17:00:08] ottomata, dschoon? [17:00:12] omw [17:11:00] fyi, i'll be in the office after scrum [17:11:16] (was finishing up the task for legal) [17:22:01] which task for legal? [17:22:08] maybe PM me. :) [17:23:00] i'll forward the email. [17:23:01] brb [17:23:09] then office :) [17:23:58] k, heading out [17:24:02] brb15 [17:24:03] thanks [17:33:42] hey drdee [17:33:48] I was gonna ask [17:33:49] yo [17:33:53] canceled? [17:34:00] wrong button [17:34:06] ha [17:34:12] but nobody was there a couple of minutes ago [17:34:26] I'm here if you want to chat for a moment [17:34:44] I just need to wrap up something, give me 2 mins [17:34:52] back [17:35:02] i think I fixed the nodes dying problem [17:35:11] drdee and I noticed there were different versions of cdh4 running... [17:35:13] i do not know why [17:35:28] the nodes that died were running 4.1.2, all other 4.2.0 [17:35:29] yeah….that's odd indeed [17:35:34] DarTar: link? [17:35:44] hopefully that will fix the issue [17:35:47] i upgraded the 4.1.2s to 4.2.0 [17:35:49] but only data nodes [17:35:57] i'd like to upgrade an27 soon too [17:36:01] maybe new hue is awesome [17:36:12] fingers crossed [17:36:15] definitely hope so [17:36:28] DarTar: hangout link? [17:36:50] drdee: yep, 1 sec [17:37:41] yeah, sorry drdee, I was totally wrong about 4.1.2 vs 4.2.0. the only upgrade I knew that I had done was a minor one [17:37:55] so when you saw that we were runnign 4.2.0 I assumed that it was just that [17:38:04] totally wrong, pester me more next time when you are right :p [17:41:12] i am now confused :) [17:42:32] ottomata: could you help me add jessie on this ticket? https://rt.wikimedia.org/Ticket/Display.html?id=4726 [17:42:50] i keep getting permission denied when I try to add her as a cc [17:43:01] hmm [17:43:10] what's her email? [17:43:10] jwild@ [17:44:16] ok cool, iet let me [17:46:10] drdee, I updated the feature status of card #312 to not set, so that we could review it in our meeting this afternoon… [17:46:19] ok [17:46:29] I want to make sure that when we put things into the backlog, we're assigning a release to it. [17:46:51] I want to make sure we have at least some vague sense of priority when agreeing to take work on. [17:46:59] grr, duplication [17:47:44] kraigparkinson: I called you kyle in my email to you just now [17:47:46] sorry about that ;] [17:47:54] but yer all setup for RT access now [17:47:55] lol, no worries. [17:48:09] I was talking to a kyle at digicert ;] [17:48:14] happens a lot with K names. [17:52:15] ottomata: so, uh. [17:52:20] multiple versions of CDH? [17:57:47] yeah [17:57:51] i know right [17:57:52] dunno why [18:04:14] http://etherpad.wmflabs.org/pad/p/AnalyticsSpringShowcaseRetroPlanning [18:08:41] hey, I need access to kripke to update the mobile dashboard (http://mobile-reportcard.wmflabs.org/), can someone help me with that? [18:09:11] milimetric ^^ [18:10:53] jgonera: yeah, we're in a meeting at the moment. will ping you after [18:14:25] http://etherpad.wmflabs.org/pad/p/AnalyticsSpringShowcaseRetroPlanning [18:17:15] ok, sure [18:19:31] kraigparkinson, ottomata, milimetric, DarTar, robla: http://analytics-pad.wmflabs.org/p/AnalyticsSpringShowcaseRetroPlanning [18:20:05] hey jgonera, I'll help you personally in about 1 hr. when this meeting is over [18:20:32]