[08:20:09] [travis-ci] master/6aac3bc (#97 by dsc): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5619391 [12:27:56] no prob YuviPanda, let me know when you're around and maybe I can help with the script. Sorry I couldn't stay later yesterday, I hung out with Lawrence Lessig!! [14:41:24] mornin [14:41:27] ottomata about? [14:41:30] morning [14:41:31] yuppers [14:42:13] howdy [14:42:22] so, i was trying to create a new oozie job yesterday [14:42:38] to test my version of the zero script without disrupting the current one [14:42:48] i have workflow and coordinator xml files [14:43:04] but i totally could not figure out what to do with them (even once they were in hdfs) [14:43:19] the oozie gui did not really help me beyond trying to trick me into duplicating my work [14:43:45] ottomata: so like, does it require trickery? [14:43:48] perhaps cli trickery? [14:43:50] it requires cli [14:43:50] yeah [14:43:52] or database trickery? [14:43:54] okay. [14:44:08] so, i would start with testing the workflow [14:44:08] i will read docs, unless you have a snippet handy [14:44:20] since those are one off jobs [14:44:30] check out my tests on an02 [14:44:37] /home/otto/scr/oozie [14:44:43] i've got some workflows defined there [14:44:52] with job.properties files manually setting the variables that the workflow.xml script gets [14:45:07] once you've got that [14:45:10] to submit a workflow: [14:45:11] https://www.mediawiki.org/wiki/Analytics/Kraken/Oozie#Workflow [14:45:20] # upload the workflow.xml file to your oozie.wf.application.path [14:45:20] hadoop fs -put workflow.xml /user/dummy/oozie/webrequest_loss_by_hour/ [14:45:20] # submit the job.properties file to oozie and start it (-run == -submit && -start) [14:45:20] oozie job -oozie http://analytics1027.eqiad.wmnet:11000/oozie -run -config ./job.properties [14:45:44] okay. [14:45:52] that way you can check your job in oozie against a single manually defined dataset [14:46:06] rather than submitting a coordinator and waiting for it to run against abuncha datasets to test [14:46:22] make OUTPUT go to your hdfs user dir somewhere [14:46:31] oh, also, someitmes it is helpful to run these as the stats user [14:46:33] sudo -u stats [14:46:42] (make sure your hdfs output dir is writeable by stats if you do this) [14:47:03] that way you can be more sure that it will run properly as the stats user, and that you don't have some personal config or file that stats doesn't have [14:47:09] right [14:47:21] aiight. i'll let you know how it goes. [14:47:25] cool [14:48:29] ah, see, i tried to do this with only pig [14:48:36] and it couldn't find my jars no matter where i put them [14:48:47] i forgot you outlined workflow-only testing in this doc [14:48:51] it's great, ty! [14:49:21] yup! [14:55:32] brb a moment [14:59:37] analytics! [14:59:37] http://www.grantland.com/story/_/id/9068903/the-toronto-raptors-sportvu-cameras-nba-analytical-revolution [15:00:03] relevant re: data driven _____ [15:14:36] hokay, office soon. [15:24:39] yo ottomata [15:24:40] yoyo [15:25:50] is the metrics api stuff ready? [15:27:01] yup, afaik, i signed off while rfaulkner was checking it out [15:27:02] but [15:27:16] flask-login has a .deb, is puppetized and installed [15:27:42] if it is good then ryan will probably want me to remove the. htaccess [15:27:51] but i'm waiting for his go ahead on that [15:28:10] cool, yeah i saw the chats about the deb stuff [15:29:14] yeah faidon was super available and helpful yesterday [15:29:24] and now he wants me to do some reviews and help build more python module .debs :p :) [15:29:31] a good deal I think [15:29:36] that's how it goes :) [15:29:44] i'm going to see if I can get locke at least mostly replaced today [15:29:53] some of the filters moved to gadolinium [15:30:07] and then i'm going to step it up on my RT duty for this week, including that stuff [15:30:17] drdee, ottoman: metrics.wikimedia.org looks good [15:30:45] yeehaw [15:30:52] is flask-login working? [15:30:56] at most I may ask andrew to update the code base for some small fixes but it's sufficient for any potential demoing [15:31:08] i haven't checked that yet [15:31:13] aye ok [15:31:21] yeah so, puppet is configured to automatically pull master [15:31:31] that should happen every ~30 mins [15:32:46] ok, that's perfect ottoman. looks like one of the views is missing login but that's something for me to fix in the view module i think [15:32:54] i can hit the login page and login though [15:34:21] cool, well you test around, lemm eknow how it goes and if I can help [15:34:25] when you are ready I can remove the .htaccess [15:34:48] hopefully we should need minimal maintenance for this going forward. Cool, that's great thanks andrew [15:36:24] Dario and I should probably sit down and think about some usage profile so we can get a better idea of what support it actually will need down the road so we can better predict what to optimize for [15:37:14] I'm testing locally or on stat and some of the behavior doesn't always seem to replicate the prod instance … but the diffs are usually pretty minor [15:37:28] anyway, looking good now. will keep you posted of anything [15:40:52] yeah [15:41:08] we should actually probably modularize the puppet stuff a bit more, and get you a labs instance that is almost identical [15:50:31] true. I had this instance created some time back https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000054b [15:50:50] i think i had an old version of the code base deployed there [15:51:01] can check in a bit. for the moment gtg [16:19:28] ottomata: i'm rather puzzled [16:19:45] here, http://localhost:8888/jobbrowser/jobs/job_1363221635790_3169 -- my job claims to have succeeded [16:20:55] yeah, you can't trust that i think [16:21:15] here, it clearly did not: http://localhost:8888/oozie/list_oozie_workflow/0000588-130314023321019-oozie-oozi-W/ [16:21:24] got a sec to help? [16:21:30] i see almost no debugging output [16:21:36] this is more relevant [16:21:36] http://localhost:8888/oozie/list_oozie_workflow/0000588-130314023321019-oozie-oozi-W/ [16:21:42] logs, i think http://localhost:19888/jobhistory/logs/analytics1012:8041/container_1363221635790_3169_01_000001/job_1363221635790_3169/stats/syslog/?start=0 [16:21:53] yeah, but look at the logs.. [16:22:31] 2013-03-19 16:09:20,944 INFO PigActionExecutor:539 - USER[stats] GROUP[-] TOKEN[] APP[dsc_zero_workflow] JOB[0000588-130314023321019-oozie-oozi-W] ACTION[0000588-130314023321019-oozie-oozi-W@dsc_zero_new] action completed, external ID [job_1363221635790_3169] [16:22:31] 2013-03-19 16:09:20,973 WARN PigActionExecutor:542 - USER[stats] GROUP[-] TOKEN[] APP[dsc_zero_workflow] JOB[0000588-130314023321019-oozie-oozi-W] ACTION[0000588-130314023321019-oozie-oozi-W@dsc_zero_new] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.PigMain], exit code [2] [16:23:10] http://localhost:19888/jobhistory/logs/analytics1018:8041/container_1363221635790_3169_01_000002/attempt_1363221635790_3169_m_000000_0/stats [16:23:12] maybe take a look at the job conf and see if it makes sense? [16:23:13] that's confusing but you need to dive into the mapper logs [16:23:26] the oozie logs don't tell anything [16:23:30] ahh [16:23:36] 1. how did you get to that? [16:23:36] so, to get that [16:23:40] 2. okay, that's helpful. [16:23:43] yeah! [16:23:44] start here, right? [16:23:44] http://localhost:19888/jobhistory/job/job_1363221635790_3169 [16:23:50] click on the 'Map' link [16:24:02] then click on the task id link [16:24:09] then click on 'logs' link for the attempt [16:24:14] ahh [16:24:15] okay. [16:24:17] gotcha. [16:24:21] i know, obvious, right?! :p [16:25:04] also, you can probably find these in hdfs in /var/log/hadoop-yarn/apps/stats// [16:25:37] ty, ottomata [16:25:47] yup! [16:39:32] ah, drdee! we didn't build a precise version of the new udp-filter! [16:39:45] :( [16:39:55] maybe stefan can help you out [16:40:02] he has a build env for that [16:42:19] average_drifter, you around? [16:42:23] ottomata: yes [16:42:39] can you build a udp-filter 0.3.21 package for precise? [16:42:44] yes [16:43:04] or hm, maybe 0.3.22? se have 0.3.22 in apt, but lcoke has 0.3.21 installed [16:43:09] ottomata: x86 ? amd64 ? [16:43:47] I'll build both [16:44:02] just do amd64 [16:44:10] and ottomata why 0.3.21 and not 0.3.22 ? [16:44:16] looks like x86_64? [16:44:21] root@gadolinium:~# uname -m [16:44:21] x86_64 [16:44:35] locke has 0.3.21 installed [16:44:36] dunno [16:44:38] only 0.3.22 in apt [16:45:20] also, average_drifter, i thikn you want to build from the field_delim_param branch [16:45:34] yes! :) [16:45:43] and probably best to merge that back into master [16:46:20] ottomata: ok, I'll build field_delim_param [16:46:36] i think master is farther ahead than that, with other stuff [16:46:41] we might want to rename master or something [16:47:13] average_drifter: ask ^demon to help you with that [16:47:20] we did the same trick for webstatscollector [16:47:27] basically current master becomes old master [16:47:34] and field_delim_param becomes new master [16:48:36] ok [17:01:20] dschoon, kraigparkinson: scrum [17:01:32] we're on our way [17:02:31] average_drifter ^^ [17:04:05] I'm on in 20s [17:31:15] grooming hangout: https://plus.google.com/hangouts/_/e920454707b309ebf66331e55e9d557eb5ed1caa [17:32:53] I'm out to get something quick and I'll be back [18:41:25] okay figured out how to fix my mingle issues [18:41:30] sudo killall -HUP mDNSResponder [18:41:34] that resets the local dns cache [18:47:25] drdee: have you seen anything like this before? http://localhost:19888/jobhistory/logs/analytics1015:8041/container_1363221635790_3210_01_000002/attempt_1363221635790_3210_m_000000_0/stats/stdout/?start=0 [18:47:37] 1 sec [18:49:29] that hadoop finished job successful but oozie failed? [18:50:03] yes many times, it's retarded, errors are not properly propagated [18:50:05] you've got to look into the logs of the mappers [18:50:12] easiest is to do that through hue [18:50:32] go to /var/log/hadoop-yarn/dsc/logs/ or something like that [18:50:36] and look for the jobid [18:50:45] yeah [18:50:50] that *is* the mapper output. [18:50:59] all it tells me is "pig exited 2" [18:51:04] no it's not [18:51:21] 1 sec [18:51:55] hm! [18:52:20] check here http://localhost:8888/filebrowser/#/var/log/hadoop-yarn/apps/dsc/logs [18:52:22] i thought that's what the _m_ was saying [18:52:31] but jobid 3210 is missing [18:53:55] interesting. [18:54:03] the syslog is the one that contains weird shit [18:54:07] http://localhost:19888/jobhistory/logs/analytics1012:8041/container_1363221635790_3213_01_000002/attempt_1363221635790_3213_m_000000_0/stats/syslog/?start=0 [18:54:20] 2013-03-19 18:51:29,269 ERROR [main] org.apache.pig.Main: ERROR 2999: Unexpected internal error. unable to read pigs manifest file [18:54:28] oh, buh [18:54:31] i bet i know why. [18:54:31] okay. [18:54:33] hihi [18:54:37] still jar problems. [19:08:46] it appears the right answer is to build our jars to exclude the hadoop framework deps, as they're provided [19:18:01] never seen this error before [19:23:32] yeah, i'm almost certain it's a a conflict in the framework versions [19:24:00] are you sure? [19:24:02] this is weird: [19:24:03] 2013-03-19 18:51:28,305 ERROR [main] org.apache.hadoop.io.nativeio.NativeIO: Unable to initialize NativeIO libraries [19:24:04] java.lang.NoSuchMethodError: [19:24:17] the thing is that oozie has copies of the hadoop & pig jars as well [19:24:37] ok then those need to be replaced on hdfs with the new ones [19:24:48] probably related to the upgrade to 4.2 [19:25:13] our custom jars don't include hadoop framework shizzle [19:25:35] pretty sure the problem is this http://mail-archives.apache.org/mod_mbox/incubator-oozie-users/201205.mbox/%3CCAJs-t7MLUvcFyjBBDKY-9HOQ7%2BswvD4FqcA9WdNZtENwbgLzLw%40mail.gmail.com%3E [19:25:41] yeah, the 4.2 upgrade [19:25:45] they do, actually. [19:25:58] kraken-generic does an assembly with deps [19:26:09] and the transitive deps definitely depend on hadoop, i believe. [19:26:26] but either way, i'll shade the jar so we don't have this problem again [19:26:31] i agree it's the change to 4.2 [19:26:55] i wouldn't be suprised also our cloudera deps are out of date in the pom [19:27:12] that would mean the new jars are built against the wrong version, which would also cause problems [19:27:15] jar hell! [19:28:17] yes the pom's should be updated as well [19:29:13] brb, gonna get some food, etc [19:29:47] yep. [19:30:17] /usr/lib/pig/pig.jar -> pig-0.10.0-cdh4.2.0.jar [19:31:01] the dependency in kraken-pig is org.apache.pigpig0.10.0-cdh4.1.2 [19:31:05] so! [19:31:20] easy enough. will fix after lunch. [19:31:28] well remember the upgrade was unplanned :D [19:34:43] yeah, i've been thinking about that [19:34:47] i have suspicions. [19:34:54] when ottomata1 is around, we should bounce them about [19:35:09] k [19:35:10] because that was a pretty big deal, that CDH magically upgraded beneath us [19:35:17] that's, like, a Major Outage [19:35:20] yup [19:36:02] i'm around, one sec, i'm deploying gadolinium stuff [19:45:41] !log frontend caches now sending webrequest udp2log stream to gadolinium [19:46:15] !log frontend caches now sending webrequest udp2log stream to gadolinium [19:46:31] average_drifter, how goes the precise udp-filter package? [19:47:16] dschoon, i'm around [19:47:17] what's up? [19:47:23] ottomata: I'm wrapping it up, I'll ping you in a few [19:47:26] danke [19:47:46] wanted to chat with you & drdee about CDH magically updating at some point [19:47:51] not urgent [19:48:29] k [19:48:39] i mean, i'm waiting for udp-filter for gadolinium right now [19:48:47] so i got some time [19:51:42] aiight [19:51:45] drdee: are you about? [19:51:49] yes [19:51:51] coolio [19:52:18] so, i was thinking about how we got into a state where some machines were CDH 4.2 and others were still 4.1 [19:52:44] ottomata, did you figure that out? otherwise i have a theory. [19:53:25] no [19:53:51] are our machines covered by the CDH puppet module? [19:54:00] does it specify a version, or does it use "latest"? [19:54:23] [travis-ci] master/9eabd37 (#98 by Diederik van Liere): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5636266 [19:56:18] ^^ ottomata [19:56:31] * YuviPanda waves at milimetric [19:56:34] i wrote the cdh4 puppet module, i doubt i'd use latest, but let me check [19:56:52] also, they weren't running the cdh4 puppet module at the time of 4.2.0 release, i'm pretty sure [19:57:40] just checked, puppet was not ensuring latest on those packages, even if it was using the cdh4 module at the time of 4.2.0 release [19:59:47] ah [19:59:49] hm. [19:59:51] okay, welp. [20:00:27] my thought was that a mistaken submodule update resulted in some updates on our cluster [20:00:31] no idea otherwise. [20:01:14] i'm looking at https://github.com/wikimedia/puppet-cdh4/blob/master/examples/ and i'm not seeing how to tell it which version of the package to use [20:01:21] dschoon: massive simplification of https://mingle.corp.wikimedia.org/projects/analytics/cards/61 [20:01:42] (obv unrelated to our troubles if we're not using it on the cluster) [20:01:42] old 61 has now become https://mingle.corp.wikimedia.org/projects/analytics/cards/380 [20:02:24] drdee: update frequency is hourly? does the new file replace the old one? [20:03:00] it doesn't tell it, it would install the latest available [20:03:05] but [20:03:09] it won't upgrade it if the package is installed [20:03:26] so, a reinstall would cause a diff version to be installed if a new one is avail [20:04:12] hey Yuvi [20:04:21] didn't see you there YuviPanda with your waving [20:04:35] heh, I tend to get lost :) [20:04:39] okay, so I've a script [20:04:42] awesome [20:04:44] milimetric: should the script do the copying? [20:04:51] to /a/? [20:04:57] or should the copying be done separately? [20:05:02] um, will the copying ever need to be separate? [20:05:08] not that I can think of [20:05:12] so I'll add the copying oto [20:05:21] milimetric: so we'll want to serve just the datafiles or everything? [20:05:24] k, then yeah, just copy the rsync line from the puppetized cron [20:05:34] well, will the datasources change? [20:05:41] well, they currently do change [20:05:45] for the date ranges [20:06:22] so do the datasources select a SUBset of the data available in the datafiles? [20:06:57] milimetric: well, no [20:07:04] milimetric: this is limnpy's doing :P [20:07:11] I know [20:07:13] just wondering how the datafiles were created [20:07:15] ok, good [20:07:36] so in that case we can just manually edit the datasources, take out the date ranges and host them with the limn instance [20:07:47] then we can host just the datafiles remotely and everything should take care of itself [20:08:01] ottomata: hm. was there a reinstall? [20:08:04] so yeah, your rsync can just copy the datafiles, and the target path doesn't matter [20:08:20] don't thikn so [20:08:35] def not since analytics puppetmaster was turned off [20:08:48] i'm pretty sure that was before 4.2.0 came out [20:08:52] milimetric: okay [20:09:12] * YuviPanda goes to find the gerrit changeset [20:10:29] YuviPanda: https://gerrit.wikimedia.org/r/#/c/54116/4/manifests/misc/statistics.pp [20:11:21] dschoon: good question, i am waiting for reply from tomasz [20:11:36] ottomata: yeah, i think you're right [20:11:37] hm [20:11:50] ottomata: you remember which machines it was? [20:11:59] no but, hmm, one sec [20:15:27] ack, no [20:15:32] i don't konw, was hoping to find it in chat logs [20:15:40] i think we were talking in hangout when we noticed and when I fixed [20:15:42] it was 6 of them [20:15:46] an11 and an13 were some [20:22:25] ottomata: [20:22:27] milimetric: well, so executing '/home/yuvipanda/mobile-uploads/limn-mobile-data/update-mobile.bash /home/yuvipanda/mobile-uploads/limn-mobile-data/mobile/' on stat1 should generate files and copy them to /a/limn-public-data/mobile [20:22:27] ottomata: http://garage-coding.com/releases/udp-filters/ [20:22:30] ottomata: 0.3.22 [20:22:32] for precise [20:22:36] milimetric: however, rsync has to run from stat1001 [20:22:59] oh it does? [20:23:08] YuviPanda: why? [20:23:39] milimetric: hmm, https://gerrit.wikimedia.org/r/#/c/54116/4/manifests/misc/statistics.pp is for stat1001 no? [20:23:40] or is it for stat1? [20:24:03] well where it goes is up to a different file [20:24:11] that's just defining "what" the cron looks like [20:24:15] * YuviPanda is a little confused. [20:24:25] sites.pp defines "where" the different definitions are applied [20:24:35] ah, okay [20:24:36] * milimetric knows VERY little puppet :) [20:24:41] but basically, here's sites.pp: [20:24:45] hehe, i know even less little puppet :) [20:26:35] YuviPanda: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/site.pp;h=436f6c7ca6c661b7d3d25eb02645b009c81a1d9a;hb=69eca0ac218019688f3cc98b282453b36d9a0759 [20:26:41] (sorry, gerrit's hard to navigate) [20:27:06] misc/statistics.pp isn't organized that well, but the classes are approximately node independent [20:27:11] it depends on what gets included on what nodes [20:27:18] check out roles/statistics.pp [20:27:35] milimetric: heh :) [20:27:58] role::statistics::www is included on stat1001, and role::statistics::cruncher is included on stat1 [20:28:13] milimetric: thanks for that link with the VM [20:28:22] no prob average_drifter [20:28:22] average_drifter, danke, on it [20:28:23] I will definitely have to go through all their lessons with puppet [20:28:43] i went through just the first page of a tutorial and it was very helpful [20:28:47] the VM is very well done [20:28:50] guys, i'd be happy to give a little puppet tutorial to you sometime [20:29:09] it be cool if you both had a small distinct task to get done too [20:29:09] I think you're too high level :) [20:29:09] and we could all work thorugh it together [20:29:10] haha [20:29:14] maybe [20:29:27] have you checked out that VM? [20:29:36] it's pretty cool, and has awesome VIM extensions for puppet [20:30:20] but ottomata I'm about to have to submit a change to that changeset by Ori [20:30:36] bwerrrr, which one? [20:30:37] so YuviPanda, the script you made can run anywhere right? [20:30:39] oh the limn-public-data one? [20:30:42] yes [20:30:51] instead of an rsync it's just going to be a script [20:31:02] milimetric: it runs only on stat1, because virtualenv. [20:31:08] so i gotta make an erb template and ensure it exists, then change the cron to call it right? [20:31:32] ok, cool YuviPanda, will you send me the file so I can start puppetizing it? [20:32:02] commiting to repo [20:32:03] one moment [20:32:08] and now, i wait while maven downloads the internet. [20:32:08] (commiting is convoluted) [20:32:56] YuviPanda: is virtualenv installed on stat1? [20:33:02] and if so, where's it puppetized? [20:33:41] milimetric: it's not puppetized on stat1 [20:33:52] virtualenv is not puppetized anywhere, and I doubt it will ever be [20:34:00] IIRC stat1 is sortof exceptionist [20:34:01] virtualenv is on stat1001 [20:34:03] ? [20:34:05] no [20:34:11] y u need virtualenv? [20:34:13] limnpy? [20:34:17] well don't you need it for what you're doing? [20:34:20] limnpy + deps [20:34:24] aye [20:34:38] i might build a limpy .deb package soon [20:34:38] milimetric: err, didn't we agree to run our stuff ons tat1 and rsync it to sta1001? [20:34:39] practicing on one for Ori right now [20:34:50] ottomata: hmm, pandas as a dep is gonna be fun :) [20:34:54] OH pandas [20:34:55] Grrr [20:34:59] oh but there is a .deb already for it [20:35:04] sorry YuviPanda, I think we haven't hooked up since Friday [20:35:06] in apt somewhere [20:35:07] sure, but it is old and limnpy doesn't work [20:35:12] and Ops changed the playing field [20:35:13] either in debian/ubuntu or wmf [20:35:21] we need 100% puppetization, everywhere [20:35:30] oh so we're in trouble with this requirement then [20:35:46] really? I thought this is what we were talking about when you accidentally ignored ori-l [20:35:54] and that's why it was changed to rsync [20:35:55] that wasn't me :) [20:35:56] rather than whatever else [20:35:57] that was erosen [20:35:59] ow [20:36:00] dammit [20:36:03] sorry [20:36:03] :) [20:36:11] hmm, average_drifter, I thikn I need x86_64 [20:36:12] and it was at like 5am and I only know 'cause i read the logs [20:36:22] sorry about that. [20:36:22] root@gadolinium:~# uname -a [20:36:22] Linux gadolinium 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux [20:36:23] ok, let's do this. where's that script of yours? [20:36:29]