[00:01:40] back [00:04:04] i guess otto has disappeared [00:06:22] (back and forth making dinner) [00:32:54] woa that's a lot of pig mappers [00:32:57] you're a pig mapper farmer [01:23:18] i am here dschoon [01:23:27] and ottomata [01:24:09] ok [01:24:23] I'm afk dinner atm :) [01:24:26] back in a few, tho [01:24:35] last i checked, the device job was green [01:24:37] which is good. [01:27:19] hiii [01:27:21] ok cool [01:27:24] yeah, device is running [01:27:34] and i'm running a giant single workflow for the existing data for zero [01:27:42] and will start a coordinator wherever it last works [01:29:46] you're a giant workflow alright! [01:29:54] oh. running. [01:29:55] right. [01:30:02] sounds awesome. [01:30:42] ottomata, we'll touch base in 30 to see where we stand, and what's lefT? [01:31:14] ok, i might be at dinner in 30 [01:32:27] s'ok [01:32:31] ottomata, did you see https://gist.github.com/dsc/5148082 ? [01:32:35] dunno if it helps [01:33:23] i kinda want to add something like that to hue. it's super-simple and it'd be a help. [01:34:40] (a simple table to browse conf, lookahead find, sort by col, etc) [01:35:39] aye [01:35:43] ja that'd be cool [01:35:54] esp since each job can override conf, yes? [01:42:05] cool! [01:42:10] and useful :D [02:01:40] the megajob is still running but shouldn't take too much time (i hope) [02:07:22] okay, so this basically leaves the alpha/beta pig script [02:07:31] which i can easily derive from the zero script [02:07:40] but since we have no data, it can't be verified [02:07:53] so i'll write the script, but we will await their deploy. [02:08:11] as a note, i reviewed the change and said they need to be url-encoding both keys and values [02:08:20] so we need to be sure to URL-decode the fields. [02:09:13] saw that! [02:09:23] very awesome, ty [02:09:39] good to hear that sounds right :) [02:09:59] and you are right about the other script that is now blocked [02:10:03] but should be trivial to adjust [02:10:18] very happy that we made a lot of progress toay [02:10:35] same. [02:10:50] btw check http://localhost:8088/cluster [02:10:52] sorry if i get snippy, but i worry whenever we're behind [02:11:08] this is the yarn view? [02:11:22] yes, of all the jobs finished /queued/running [02:11:41] this shows we need another scheduler than FIFO [02:12:03] something nice for the base cluster :D [02:12:15] because all jobs are now blocked by the big one [02:21:13] ...why? [02:21:20] because it has allocated all slots/resources? [02:21:24] (i will look into it) [02:22:27] the cluster has >500G of ram, so it's not memory [02:22:54] (note /scheduler gives a 500) [02:24:19] it says 0kb available memory on every node!? [02:24:36] this is despite the workers having at least 40G free [02:24:42] okay, clearly that's a conf var somewhere [02:30:20] FIFO = first in first out [02:30:32] so it schedule one job after another [02:30:54] yeah, but even so, it's not using all RAM [02:31:12] right, but i think that's why we need FairScheduler [02:31:15] ah, there it is [02:31:15] yarn.nodemanager.resource.memory-mb 8192 [02:31:23] so each resource is limited to 8G [02:31:30] okay, that explains [02:31:31] and since it's FIFO, it's one resource at a time [02:31:50] kraken definitely needs to be finetuned [02:31:59] and i am sure there is a lot of low hanging fruit [02:32:11] well, at the very least, we need to eliminate the obvious bottlenecks like this [02:32:14] :D [02:32:19] tuning is an ongoing thing, right? [02:32:21] == low hanging fruit [02:32:27] definitely [02:33:18] indeed. [02:34:03] i find a lot of things in here somewhat mysterious. [02:34:13] both mapred.child.java.opts and mapreduce.child.java.opts are defined? [02:34:38] that's odd [02:34:43] it only should be mapreduce [02:34:48] (afaik) [02:34:54] dunno [02:34:57] why that is [02:35:15] also: where does it say only one resource at a time? [02:35:30] the fifo thing? [02:36:00] that just dictates the order [02:36:03] not the parallelism [02:36:16] i don't understand your question [02:36:37] well, the "first out" part of the FIFO makes me agree with drdee [02:36:42] the scheduler could schedule many jobs at the same time [02:36:45] based on their resource needs [02:36:49] it is choosing "1" [02:37:01] that is bogus, as we have tons of free memory and cpu [02:37:06] i really think that's the FIFO [02:37:11] it will just schedule one job [02:37:18] wait for it to fnish [02:37:19] yeah, like what you're saying david would just be called FI [02:37:30] and then move on to the next job [02:37:35] "fifo" doesn't (to me, in hadoop) necessarily imply a strict total ordering based on time-in [02:37:42] if it does, then yes. we should switch immediately [02:37:46] yep :) [02:37:52] what are the possible values there? [02:38:36] and where would one find docs on something like that (I like the gist btw) [02:39:15] milimetric: i'd start with looking at the docs on YARN [02:39:16] reading source http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-yarn-server-resourcemanager/2.0.0-alpha/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java [02:39:32] http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-yarn/hadoop-yarn-site/YARN.html [02:39:50] that's what i'm looking at next [02:41:04] there is a fair scheduler and a capacity scheduler milimetric [02:41:18] those are part of cdh4 [02:41:22] yeah, looks like Capacity gets recommended [02:41:24] there are also some custom scheduler [02:41:41] but not sure if that one already works with yarn [02:42:34] shouldn't they all? [02:42:40] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler [02:42:40] but yeah. we'll try out capacity tomorrow. [02:42:52] that's the setting for capacity ^^ [02:42:59] in yarn-site [02:43:22] there are capacity tuning params, also [02:45:34] dschoon, milimetric: please add your notes to https://mingle.corp.wikimedia.org/projects/analytics/cards/309 [02:45:45] cool, will do [02:45:57] most of those schedulers were not working with yarn last time i checked [02:46:03] (that was december IIRC) [02:46:11] we should have a general "cluster improvement" card, you know? [02:47:02] i think feature card is fine, there already enough card types to wrap my head around :D [02:47:29] i was just thinking a task that holds other tasks [02:47:36] or we could start adding tags? [02:47:42] i just find it hard to navigate things [02:48:24] i think the idea is to only do tasks that are related to features championed by people [02:48:42] i would just use the search box for now and let's wait until the dust settles around the actual workflows [02:48:49] so in this case, if we're trying to deploy a new analysis and kraken is slow as molasses, we'd add a task to look into the scheduler [02:49:16] i don't think we should immediately jump into this and try to fix it [02:49:27] i think that's the agile idea. but i think it's not crazy to think we'll be making modifications based on our needs. [02:49:37] right, only if "don't be slow as molasses" was in scope would we do it [02:49:37] exactly. we should just have a bucket so we can see it all at a glance [02:49:46] (that's merely what i'm thinking) [02:50:26] right, i see the bucket point, it's a fair one [02:50:42] drdee will hopefully help us figure the best way to do that before we all get mingle-dizzzy [02:50:50] i find the card-hierarchy stuff to be a bit annoying atm [02:51:15] but i don't really care what the means is so long as the ends is that all the ideas are grouped [02:53:04] i think the idea is that diederik and kraig are doing a lot of grouping and nesting for their own purposes [02:53:13] sure. [02:53:16] ideally, all we'd ever have to worry about would be features and tasks [02:53:21] features are on WIP - Features [02:53:25] tasks is an unsolved problem [02:53:27] i just want "all the cards related to making kraken better" as a thing [02:53:31] with an url [02:53:37] and a way to easily add a new thing [02:53:44] I don't think that's a thing :/ [02:53:47] call it a "task" if you wish :) [02:54:11] i do not think this is a crazy request [02:54:28] neither do I, I just don't think it's possible right now [02:54:34] we could use tags [02:54:39] we could use card-heirarchy [02:54:44] we could use a new card-type (bad!) [02:54:50] we could use a new property [02:54:53] lots of ways [02:54:54] yeah, me not like any of those things [02:54:56] it's possible [02:55:00] we just have to pick something [02:55:09] not possible within the constraints they picked already is what I meant [02:55:24] and personally, I'd like to remove some of the complexity not keep adding more [02:56:41] i think everyone (inc drdee and kraig) wants to find a solution that makes us ALL productive [02:56:49] we want to work TOGETHER after all [02:57:04] so if you find something good with mingle, i think you should let us know, milimetric [02:57:08] rather than suffer in silence :) [02:57:44] heh, believe me, if I find something good in Mingle, I'll be pretty loud about it [03:04:02] dschoon: i just want "all the cards related to making kraken better" as a thing [03:04:04] https://mingle.corp.wikimedia.org/projects/analytics/cards?favorite_id=722&view=MVC+Value+Path [03:04:11] it's incomplete but it's in progress [03:04:19] those are deliverables [03:04:23] this is non-feature work [03:04:25] right? [03:04:26] no [03:04:30] look at the page first [03:04:37] i did! [03:04:42] it's the "releases" [03:04:48] which (i thought) were feature-driven [03:04:54] plus the features for each release [03:06:34] we're looking for a low-friction way to both see and capture stuff to make X better (where X, in this case, is Kraken, but might also be Limn) [09:30:27] I would like to join the Analytics Team to help. [09:32:58] anyone is online? [09:33:02] I would like to join the Analytics Team to help. [09:33:37] gabrielchihongle: hey! Most of them are in the US, and hence sleeping [09:33:47] oic... [09:33:52] gabrielchihongle: better ask there :-] [09:34:03] which "there? [09:34:04] most people are in San Francisco so they are sleeping right now [09:34:18] I understand [09:34:35] I live in HK [09:34:42] So where should I put my reply? [14:21:18] ./bite hashar [14:28:01] morning [14:28:03] ping ottomata [14:28:17] we lost again 6 hadoop nodes [14:28:42] i see this [14:28:57] log-dirs turned bad: /var/lib/hadoop/data/c/yarn/logs,/var/lib/hadoop/data/d/yarn/logs,/var/lib/hadoop/data/e/yarn/logs,/var/lib/hadoop/data/f/yarn/logs,/var/lib/hadoop/data/g/yarn/logs,/var/lib/hadoop/data/h/yarn/logs,/var/lib/hadoop/data/i/yarn/logs,/var/lib/hadoop/data/j/yarn/logs,/var/lib/hadoop/data/k/yarn/logs,/var/lib/hadoop/data/l/yarn/logs 0 0 KB 0 KB [15:34:29] [travis-ci] master/29576ab (#84 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5474152 [15:40:13] morning milimetric [15:40:25] power back online? :D [15:40:44] yep, just got it back [15:40:52] my battery died right as it was coming back :) [15:41:15] i spoke with maarten from glam btw [15:41:28] great! [15:41:33] helping him with his Limn install, they seem ready to go [15:41:41] awesome [15:43:36] we lost for the 2nd time this week 6 hadoop nodes [15:43:49] i think something is off with the iptables [15:43:55] we lost for the 2nd time this week 6 hadoop nodes [15:43:59] ottomata ^^ [15:44:15] HM [15:44:15] hi [15:44:20] morning :D [15:44:25] morning! [15:46:35] did you restart them yet? [15:46:37] i am looking into it [15:46:39] no [15:46:44] ok good, lemme see [15:46:49] it would be weird if this was iptables, [15:46:54] because restarting is not a structural solution [15:46:54] thats usually an all or nothing thing [15:46:57] not intermtient [15:47:03] yeah but [15:47:18] look at http://localhost:8088/cluster/nodes/lost [15:47:29] (on the job server) [15:47:49] LOSt and Healthy? [15:48:03] yes [15:48:13] that's weird huh ? [15:48:18] also look at http://localhost:8088/cluster/nodes/unhealthy [15:51:17] that is more interesting, yeah i'm seeing weird stuff from namenode too, I the namenode web ui isn't responding [15:51:44] hmmm, yeahhh, hmmm [15:52:29] restarting namenoe [15:56:42] HMM [15:56:47] nodemanagers on those datanodes are dead [15:56:50] at least on an13 [15:58:46] hm [15:58:47] 2013-03-13 06:28:06,703 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread [15:58:47] java.lang.NullPointerException [15:59:08] can you paste the stack trace in a gist? [16:00:06] https://gist.github.com/ottomata/5153524 [16:00:14] this is from nodemanager logs on an13 [16:01:02] ai ai ai, that might be a bug in yarn [16:01:15] same error on an11 [16:01:28] starting nodemanater on an11 [16:01:46] probably same on all those 6 nodes [16:01:49] yeha [16:02:05] ask question on ch4 user group? [16:02:20] HM [16:02:24] it didn't start back up! [16:05:35] yeah totally weird [16:05:38] same error on startup [16:05:39] HMMM [16:06:12] oink [16:06:15] that's weird [16:08:40] i think this is related [16:08:40] http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-issues/201207.mbox/%3C612798421.59963.1342466795250.JavaMail.jiratomcat@issues-vm%3E [16:09:10] https://issues.apache.org/jira/browse/MAPREDUCE-4448 [16:09:32] not sure though [16:10:20] that's easy to test, disable log aggregation [16:10:41] ha, is that like the history thing? [16:10:57] yarn.log-aggregation-enable [16:10:57] ok will try [16:12:12] cool [16:12:36] bnope still tied [16:12:38] same error [16:13:01] i'm going to restart resource manager on an10 and see what happens [16:13:13] good morning [16:13:20] GRR, why is nodemanager running on an10>…. [16:13:22] hi! [16:13:23] morning [16:13:53] morning kraigparkinson [16:13:57] mornin [16:15:15] WEIRD [16:15:18] still not starting [16:16:14] yeah what that crap crackers [16:18:26] HM [16:18:36] drdee, could this be related to me changing the jvm.reuse setting? [16:18:40] doesn't seem related. [16:18:44] but that's the only thing that has changed [16:18:47] don't think so [16:18:53] no the problem was prior to that change [16:18:54] hm, that and I'm running more jobs concurrenlty these days [16:18:56] yeah probably not [16:19:00] it happened on monday for the first time [16:19:03] yeah that's true, well we don't know if this is that same problem [16:19:06] but it probably is [16:19:10] pretty sure it is [16:19:21] we didn't check the logs then right, we just restarted hadoop? [16:19:25] yes [16:19:38] maybe check hdfs health [16:19:59] the unhealthy name node is worrying [16:20:05] welll [16:20:05] sorta [16:20:08] and that was the exact same error on monday [16:20:15] that is werid though, i thikn that is a consequence of nodemanager running on an10 [16:20:18] it shouldn't be [16:20:34] have you tried disabling the nodemananger on an10? [16:20:39] i just stopped it [16:20:41] ok [16:20:41] still same thing [16:20:46] :( [16:20:46] cause [16:20:53] these dirs [16:20:53] /var/lib/hadoop/data/c/yarn/local [16:20:53] etc. [16:20:54] shouldn't exist on an10 [16:20:55] and they don't [16:20:58] so that is correct [16:21:03] if a nodemanager tries to run there it won't work [16:21:51] i just removed the nodemanager package too [16:21:56] to make sure it doesn't try to start upa gain [16:22:18] k [16:23:44] drdee, let me know when you have a sec. want to talk mingle again. :p [16:23:54] sure give me 3 minutes [16:25:07] ok ready [16:25:47] kraigparkinson: [16:25:47] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [16:37:50] SIGGHHGHGH last night everything was working so well! [16:49:29] WAH, nodemanager started this time [16:49:33] I didn't change anything. [16:49:58] based on Mingle, it looks like we don't have anything to showcase today. any objections to freeing up our stakeholders? [16:50:19] no objections from me [16:50:22] no objecttions [16:51:10] ah, but now it died, it just lasted longer, hm. [16:57:43] UHHH, am I running different versions of CDH4 on different nodes????? [16:57:49] waaatttt? [17:00:01] scrum? [17:00:08] ottomata, dschoon? [17:00:12] omw [17:11:00] fyi, i'll be in the office after scrum [17:11:16] (was finishing up the task for legal) [17:22:01] which task for legal? [17:22:08] maybe PM me. :) [17:23:00] i'll forward the email. [17:23:01] brb [17:23:09] then office :) [17:23:58] k, heading out [17:24:02] brb15 [17:24:03] thanks [17:33:42] hey drdee [17:33:48] I was gonna ask [17:33:49] yo [17:33:53] canceled? [17:34:00] wrong button [17:34:06] ha [17:34:12] but nobody was there a couple of minutes ago [17:34:26] I'm here if you want to chat for a moment [17:34:44] I just need to wrap up something, give me 2 mins [17:34:52] back [17:35:02] i think I fixed the nodes dying problem [17:35:11] drdee and I noticed there were different versions of cdh4 running... [17:35:13] i do not know why [17:35:28] the nodes that died were running 4.1.2, all other 4.2.0 [17:35:29] yeah….that's odd indeed [17:35:34] DarTar: link? [17:35:44] hopefully that will fix the issue [17:35:47] i upgraded the 4.1.2s to 4.2.0 [17:35:49] but only data nodes [17:35:57] i'd like to upgrade an27 soon too [17:36:01] maybe new hue is awesome [17:36:12] fingers crossed [17:36:15] definitely hope so [17:36:28] DarTar: hangout link? [17:36:50] drdee: yep, 1 sec [17:37:41] yeah, sorry drdee, I was totally wrong about 4.1.2 vs 4.2.0. the only upgrade I knew that I had done was a minor one [17:37:55] so when you saw that we were runnign 4.2.0 I assumed that it was just that [17:38:04] totally wrong, pester me more next time when you are right :p [17:41:12] i am now confused :) [17:42:32] ottomata: could you help me add jessie on this ticket? https://rt.wikimedia.org/Ticket/Display.html?id=4726 [17:42:50] i keep getting permission denied when I try to add her as a cc [17:43:01] hmm [17:43:10] what's her email? [17:43:10] jwild@ [17:44:16] ok cool, iet let me [17:46:10] drdee, I updated the feature status of card #312 to not set, so that we could review it in our meeting this afternoon… [17:46:19] ok [17:46:29] I want to make sure that when we put things into the backlog, we're assigning a release to it. [17:46:51] I want to make sure we have at least some vague sense of priority when agreeing to take work on. [17:46:59] grr, duplication [17:47:44] kraigparkinson: I called you kyle in my email to you just now [17:47:46] sorry about that ;] [17:47:54] but yer all setup for RT access now [17:47:55] lol, no worries. [17:48:09] I was talking to a kyle at digicert ;] [17:48:14] happens a lot with K names. [17:52:15] ottomata: so, uh. [17:52:20] multiple versions of CDH? [17:57:47] yeah [17:57:51] i know right [17:57:52] dunno why [18:04:14] http://etherpad.wmflabs.org/pad/p/AnalyticsSpringShowcaseRetroPlanning [18:08:41] hey, I need access to kripke to update the mobile dashboard (http://mobile-reportcard.wmflabs.org/), can someone help me with that? [18:09:11] milimetric ^^ [18:10:53] jgonera: yeah, we're in a meeting at the moment. will ping you after [18:14:25] http://etherpad.wmflabs.org/pad/p/AnalyticsSpringShowcaseRetroPlanning [18:17:15] ok, sure [18:19:31] kraigparkinson, ottomata, milimetric, DarTar, robla: http://analytics-pad.wmflabs.org/p/AnalyticsSpringShowcaseRetroPlanning [18:20:05] hey jgonera, I'll help you personally in about 1 hr. when this meeting is over [18:20:32] are you in an OK timezone to work on it in 1hr? [18:20:58] milimetric: he's in the office :) [18:21:08] cool, how's your eye? [18:23:30] milimetric: much better! but when I woke up it wouldn't open for about 15 minutes, so that was terrifying [18:23:47] glad it's better then [18:23:54] managed to keep calm and have a shower, and then it opened up. Generally been a lot better after that - blinking or closing no longer hurts [18:27:49] milimetric, yes, ~12:20 should be ok [18:27:55] thank you [18:27:56] cool [18:32:11] afk 1 min [18:37:09] back [19:17:57] The new pope is anti udp2log [19:35:56] wat?! [19:36:22] I wonder what Tim has to say about that [19:36:38] also, I nominate Tim to be the new pope [19:40:47] milimetric, I am going to get lunch now, can we reschedule this? [19:40:56] sure [19:41:01] ping me when you're back [19:41:31] ok [19:49:03] greetings all [19:49:20] where can i see the last 2 hours of hrly logs for en wiki ? [19:49:35] i want to confirm that the vote of the new pope is causing our spike in requests [19:50:33] welp, in an effort to fix the version mismatch [19:50:35] i upgraded hue :/ [19:50:36] :) [19:50:40] job browser there works now [19:50:43] mabye that will be helpful? [19:52:26] drdee: ottomata dschoon : thoughts on my message above ? [19:53:09] tfinc: stat1/a/squid/archive/sampled [19:53:24] but today's file is probably not yet copied [19:53:25] drdee that won't have last two hours [19:53:30] k [19:53:31] yeah [19:53:32] right [19:53:46] drdee: thats what i was concerned about . when does it get delivered ? [19:53:49] although it's utc, so the file should be copied soon [19:54:19] afaik, otttomata confirm, files get copied at the start of new day using UTC time [19:55:47] drdee: they get copied around 6:30 am UTC [19:56:49] tfinc: 1 sec let me check something [19:56:54] drdee: thanks [19:57:04] the narrative is pretty compelling that this is pope related [19:57:05] they are also on emery [19:57:09] the live ones [19:57:12] yes [19:57:15] /var/log/squid/sampled-1000.log [19:57:17] hat's were i am looking [19:58:17] tfinc: check emery:/var/log/sampled-1000.tab.log [19:58:17] dschoon, you might be able to help me find this, since you were looking at hadoop configurations [19:58:25] that's the current log file [19:58:31] org.apache.oozie.action.ActionExecutorException: JA001: Invalid host name: local host is: (unknown); destination host is: "analytics1010.wikimedia.org":8032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost [19:58:34] Permission denied (publickey). :( [19:58:42] tfinc: 1sec [19:58:43] I have no idea where it is getting analytics1010.wikimedia.org from [19:58:53] that is not a valid hostname [19:58:57] it should be .eqiad.wmnet [19:59:01] and all configs say that :/ [19:59:05] drdee [19:59:07] you there? [19:59:14] yes [19:59:18] 1 sec [19:59:22] can we hangout please? [19:59:42] the narrative is compelling enough to not have to dive into a deep analysis if we don't have the data files available. [20:00:11] tfinc: sampled-1000.tab.log is now on stat1:/home/diederik, can you read that? [20:00:25] dschoon and I are trying to understand how the difference between https://mingle.corp.wikimedia.org/projects/analytics/cards/240?version=11 and the present version came about; he thinks the intention is very different between the two descriptions. [20:00:41] what hangout are you? [20:01:16] kraigparkinson: ^^ [20:01:19] about to send one, give me a ec [20:01:22] k [20:17:11] drdee: yes. i'll take a look at it after grabbing lunch. thanks [20:19:18] tfinc, yesterday 32667000 hits on 'Pope', so far today 71812000 (using a very naive grep on "Pope" in the entire log line) [20:19:54] more than twice as many requests [20:20:05] thats a lot of pope requests to spread around [20:20:27] (this includes images and pages [20:28:28] ungh, guys i'm probably going to get lunch soon, but I am totally stuck atm [20:28:31] would love brain bouncers [20:29:49] dschoon, just updated https://mingle.corp.wikimedia.org/projects/analytics/cards/92 [20:36:11] thanks, drdee [20:36:13] will check it out [20:36:24] ottomata -- i'll be back in 10 or so, and then we can bounce [20:37:00] ok, i might be at lunch, but ja ping me [20:37:28] dschoon: 239 should be deleted as well [20:37:45] dschoon ^^ [20:41:40] drdee, mind if I reschedule our meeting planned for now? [20:41:45] dschoon, can i delete mingle card 239? [20:41:48] dont' mind [20:41:53] thanks, I need to eat [20:41:58] me too :D [20:42:44] erosen: can you please add some detail to https://mingle.corp.wikimedia.org/projects/analytics/cards/244 ? [20:43:00] in particular the acceptance criteria from amit? [20:43:07] sure [20:43:34] ty [20:44:22] dschoon: i deleted card 239 as a source of confusion ;) [20:44:28] drdee: just to clarify, is this intended to be the entire services provided to zero? [20:44:38] drdee: like the numbers and the dashboard? [20:44:44] yes [20:44:52] k [20:56:32] dschoon, nm [20:56:33] FIXED IT [20:56:37] ! [20:56:42] (just got back) [20:57:08] dschoon: i think in card 62 the task links are incorrect [20:59:34] ottomata: what was it? [21:01:23] ok so, i fixed one thing but not everything [21:01:56] somehow, after upgrading to 4.2.0 everywhere [21:02:11] i'm getting errors where some things think that the resource manager address is analytics1010.wikimedia.org [21:02:13] which is not a valid name [21:02:18] and I don't have that ocnfigured anywhere [21:02:50] ottomata: you check /etc/hostname ? [21:03:09] analytics1010 [21:03:11] no wikimedia.org [21:03:28] I grepped all of /etc on namenode and an27 (oozie host) for something relevant [21:03:30] didn't find anything [21:03:40] and facter [21:03:41] says [21:03:41] otto@analytics1010:/etc$ facter fqdn [21:03:41] analytics1010.eqiad.wmnet [21:04:00] drdee: updated the zero mingle card [21:04:35] drdee: not sure it is complete, but can't think of other things / I'm not totally sure about the degree of detail we should have on these cards [21:04:36] ty erosen! [21:04:39] dschoon, example: [21:04:39] http://localhost:19888/jobhistory/logs/analytics1014:8041/container_1363208055308_0001_01_000002/attempt_1363208055308_0001_m_000000_0/stats [21:04:44] Caused by: java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "analytics1010.wikimedia.org":8032; java.net.UnknownHostException; For more details see: http://wiki.apache.org/hadoop/UnknownHost [21:05:08] ahhh [21:05:17] ottomata: it's the stupid JVM host thing [21:05:18] sec [21:05:28] oh? [21:09:26] dschoon, i'm going to run for a few hours and then be back on tonight to work for a couple more hours [21:09:39] can we hangout real quick so I can show you how to reproduce this? [21:12:36] dschoooooon come back to me! [21:15:36] ottomata! [21:15:39] i am here for youuuu [21:15:41] also: http://docs.oracle.com/javase/7/docs/api/java/net/InetAddress.html [21:15:45] and http://docs.oracle.com/javase/7/docs/api/java/net/doc-files/net-properties.html [21:16:16] worth reading: http://stackoverflow.com/questions/7348711/recommended-way-to-get-hostname-in-java [21:16:28] ottomata: https://plus.google.com/hangouts/_/3a5e13f228d3e2c037ed195fcb67c1f6af887a20 [21:20:14] milimetric, where can I find you? 6th floor? [21:20:29] hi jgonera, no, I'm in Michigan :) [21:20:34] I work from Philadelphia usually [21:20:35] oooh [21:20:37] ok [21:20:41] all right [21:20:56] so, the first problem I have is what YuviPanda also experienced [21:21:05] access to stat1001? [21:21:20] ok, maybe lets start with access [21:21:22] milimetric: you aren't in the office? I thought the hangout was with you at the office? [21:21:27] is stat1001 kripke? [21:21:32] nope, I'm remote [21:21:40] i was visiting for a week [21:22:02] no jgonera, they're different [21:22:06] here, I'm PM you [21:35:52] milimetric: aah, okay :) [23:21:55] [travis-ci] master/a77422f (#85 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5485841 [23:25:55] this weekend GSP vs. Diaz [23:26:25] :D [23:26:30] can't wait [23:26:48] doing those charts now, I have some charts for small data, I'm polishing it up and then time for a full run on stat1 [23:27:08] we'll have charts for every day in 2012-11 ===> 2012-12