[13:15:24] gooood morning! [13:15:32] I'm going to head out early-ish today because of my flight [13:16:03] so if you need the milimetric, get your fix :) [13:16:28] charles-salvia: do you go by csalvia now? [14:22:20] (PS1) Milimetric: Makes filtering better [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108913 [14:22:32] (CR) Milimetric: [C: 2 V: 2] Makes filtering better [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108913 (owner: Milimetric) [14:27:43] :) I'm going through templating engines for the RFC discussion and it occurred to me that I was being silly ^ [14:47:07] yes... I should really kill my alter-ego [14:51:41] hmm, milimetric, is there an easy way I can undo what scripts/install did? [15:23:16] (PS1) Ottomata: Adding #!/bin/bash to top of install script [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108919 [15:23:34] (PS2) Ottomata: Adding #!/bin/bash to top of install script [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108919 [15:23:39] (CR) Ottomata: [C: 2 V: 2] Adding #!/bin/bash to top of install script [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/108919 (owner: Ottomata) [15:52:20] ok milimetric, csalvia, nuria, you can make wikimetrics vagrant work for you right now [15:52:33] there are a few patchsets that need to be merged by ori [15:52:41] but if you want it to work right now so you can use it, i can show you how [16:05:16] Ottomata: Can we do this in say 15 mins, once Dan and I have alittle time to talk about handlebars? [16:12:06] yup [16:12:13] actually, gimme another 10 [17:21:43] ottomata: what if we threw hadoop error logs into Hive ? would that be weird ? [17:21:49] just sayin.. [17:24:05] then we could say things like SELECT error_full_line FROM error_table WHERE error LIKE "DclassUDF" AND error_date >= 2014... AND error_date <= 2014...; [17:24:18] LIKE "%DclassUDF%" [17:24:21] if I'm not mistaken [17:24:37] but it's just an idea, grep would be nice too [17:25:42] actually maybe error logs are located on hdfs ? [17:27:16] ha, they are! [17:27:18] in fact! [17:27:52] average [17:28:00] ottomata: so they are in hdfs [17:28:00] hdfs dfs -ls /var/log/hadoop-yarn/apps/spetrea/logs/ [17:28:09] ah, that's interesting [17:28:34] hdfs dfs -cat /var/log/hadoop-yarn/apps/spetrea/logs/application_1387838787660_0844/* [17:28:46] if you want to map a hive external table in your own database on top of them, go right ahead! :p [17:28:58] but, actually, i think we'd like to one day throw the logs into logstash [17:29:09] once it is more ready in production, and we have more people trying to do this [17:29:59] actually, i'd really like to make a tool to make this easier [17:29:59] hm [17:45:24] ottomata: can I ask something else ? [17:45:50] ottomata: how does this know the start/finish time of jobs ? http://analytics1010.eqiad.wmnet:8088/cluster [17:46:05] ottomata: does it parses them in advance and stores this somewhere ? [17:46:11] *parse [17:47:18] oh wait, maybe it just takes [17:47:19] head -1 [17:47:20] and [17:47:21] tail -1 [17:47:26] and figures it out that way [17:52:46] i think yarn keeps track of it, no? [17:52:52] there is a process [17:52:55] called resourcemanager [17:52:57] if I restart it [17:53:12] all the info about those jobs in that interface will be lost [17:53:57] ah ok [17:55:11] average, what is the application id you are currently troubleshooting? [17:57:07] ottomata: 1387838787660_0837 [17:57:18] ottomata: I tried this [17:57:35] spetrea@analytics1026:~$ hdfs dfs -lsr /var/log/hadoop-yarn/apps/spetrea/logs/ | grep 1387838787660_0837 [17:57:38] lsr: DEPRECATED: Please use 'ls -R' instead. [17:57:41] drwxrwx--- - spetrea hadoop 0 2014-01-21 12:03 /var/log/hadoop-yarn/apps/spetrea/logs/application_1387838787660_0837 [17:57:44] -rw-r----- 3 spetrea hadoop 328 2014-01-21 12:03 /var/log/hadoop-yarn/apps/spetrea/logs/application_1387838787660_0837/analytics1012_8041 [17:58:05] I looked in that, but was really small, wasn't able to find much [17:58:14] that was a job that ran for like 1h and then failed at the very end [17:58:57] whoa that is weird [17:59:50] I looked at that file, no idea what I should use to read it [18:00:25] -cat [18:00:28] but yeah, there isn't much there [18:02:21] ok, let me try another one, I had a bunch of these jobs [18:02:23] i have not seen this before... [18:03:57] hm, they're all the same [18:04:02] these jobs [18:04:53] same error log file, with 328 bytes in it [18:05:48] they all look they ran for about 10 minutes [18:06:52] 1387838787660_0827 took 3 hours and failed [18:07:34] http://analytics1010.eqiad.wmnet:19888/jobhistory/logs/analytics1017:8041/container_1387838787660_0844_01_000001/job_1387838787660_0844/spetrea [18:08:10] http://analytics1010.eqiad.wmnet:19888/jobhistory/logs/analytics1018:8041/container_1387838787660_0827_01_000002/job_1387838787660_0827/spetrea [18:08:15] same error [18:08:15] hm [18:10:10] https://issues.apache.org/jira/browse/MAPREDUCE-4794 [18:10:11] the last link you gave me has a lot of stuff in it [18:10:22] the the one for 1387838787660_0827 [18:10:54] hmm, think that is irrelevant (the jira link) [18:11:28] ah here we go [18:11:29] http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_0827/m/FAILED [18:12:02] unrelated-to-this-conversation question; how long does it take for changes to the analytics puppet manifests to, well, manifest? ;p [18:12:58] well, they need to be approved and merged, if you are thinking of the one you submitted :) [18:13:50] ottomata: I just looked at that, hmm, java.io.FileNotFoundException: File does not exist: hdfs://kraken/tmp/hive-spetrea/[...] [18:14:20] hmn; I thought it was +2'd... [18:14:29] no, I'm just crazy [18:14:38] this is why I shouldn't trust things I learn at 8am ;p [18:15:01] ottomata: maybe one of the nodes didn't have the data on it ? [18:15:14] hmmm [18:15:21] maybe i dunno, but hmm, no this is an hdfs url [18:15:23] doesn't matter [18:15:26] i'm looking in that dir now [18:15:40] ok [18:18:48] yeah dunno [18:18:50] ahh, will come back to this [18:18:53] this is mysterious [18:18:55] lunchtime [18:35:21] average [18:35:23] hm [18:35:31] what is the query you are running? [18:35:46] i wonder if it is trying to select against data that might be in the process of being written [18:39:03] ottomata: pm-ed the query [18:39:51] but it had a WHERE clause limiting it to 16-January-2014 [18:41:02] yea [18:41:06] looks fine [18:49:43] http://www.youtube.com/watch?v=P40akGWJ_gY (part 1 is also nice) [19:04:49] you want to meet? [19:14:12] average, so this query works in your local install? [19:14:16] what about on a test table [19:14:17] like [19:14:20] maybe the one in my database [19:14:24] otto.webrequest_mobile; [19:14:28] i only have one partition in there [19:14:33] does the query run against that partition? [19:15:57] I can run it on that partition [19:16:07] so it works witih a small dataset? [19:16:17] on my local hive it works ok [19:16:24] ok, what about my table? [19:16:30] trying now [19:16:30] in production? [19:16:32] k [19:16:33] do [19:16:40] show partitions otto.webrequest_mobile [19:16:45] to show the partition that is there [19:16:47] and change your where to use it [19:19:55] ottomata: it's running [19:20:46] can I control explicitly the number of mappers ? (apparently for reducers this is possible through mapred.reduce.tasks ?) [19:23:19] i think so, but i know as much as you [20:56:40] average [20:56:45] your query worked 100% on my table [20:56:51] maybe it is a permission problem somehow? [20:56:52] hmm [20:58:34] it might be the case yes [21:00:40] hm [21:00:42] shoudln't be [21:00:50] our users have the same rights [21:01:35] oh hm [21:01:37] but for my table [21:01:37] hm [21:05:19] trying as your user [21:05:26] ok [21:13:44] hmmm, average it looks like the job isn't really starting [21:13:45] yargh [21:13:46] hm [21:13:52] it does when I have a limit [21:13:53] hmm [21:14:02] oh, no [21:14:04] it didn't [21:14:05] weiiirrd [21:15:53] so we know it's something that has to do with the user [21:15:58] it doesn't like me :( [21:16:44] hm [21:26:07] well, at least running on my table does…but then again [21:26:20] it doesn't seem like the job ever gets scheduled [21:26:20] at least [21:26:20] the query on my table [21:26:20] hmmmm [21:27:25] stefan, i'm going to try some things out as your user [21:27:32] i'm going to put this test table in your db in your homedir and see what happens [21:27:40] cool :) [21:54:47] hmm, ok i selected from your table just fine [21:54:49] sigh. [21:54:50] hm [21:54:57] gonna try with limit on the real tables as your user now [21:59:17] ottomata: i guess Diederik van Liere is not here on such hours [21:59:50] he lives in Toronto, but is in SF for the architecure summit [22:08:56] ok, if you run into him, please ask him/ let him know i'm looking to remove pipe 10 /usr/bin/udp-filter -F '\t' -p _NARA_ -g -b country >> <%= log_directory %>/glam_nara.tsv.log [22:09:06] aka 6143 [22:10:40] hmm,i think someone else might know more if they need that than him [22:10:45] he probably put that in place for someone [22:10:45] hmmm [22:10:55] but he woudl know who [22:12:46] hmm, average, yeah [22:12:51] seems stalled [22:12:51] http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_0901/mapreduce/task/task_1387838787660_0901_m_000000 [22:17:24] matanya: looking for me? [22:21:06] drdee: Look for "[22:08:56] " on http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140122.txt [22:21:25] thanks qchris [22:22:52] matanya was wondering if we needed to keep the glam nara udp2log filter around any longer [22:22:54] and who to ask [22:23:25] got it, i think so [22:23:34] the glam folks like the data [22:23:54] we never got to do the sign off with giving maarten actual access to that data [22:25:27] ok, so we need it drdee ? [22:26:09] ah average! [22:26:14] i got the job to finish as your user with a limit [22:26:14] hmmm [22:26:16] on the real data [22:26:17] hmm [22:26:27] sigh, i guess i'll try to run it just like you were…and check on it tomorrow? [22:26:28] dunno. [22:28:40] matanya, the data should be availalbe on stat1 [22:29:03] i think? [22:29:04] oh no [22:29:04] sorry [22:29:05] on stat1002 [22:29:33] hmm i think? [22:30:45] ahh nope [22:30:47] it isn't synced [22:30:49] it is all on emery [22:31:33] can i move it from there to some where? [22:31:42] matanya: yes [22:32:03] um, what is 'it' in that statement matanya? [22:32:16] the logger [22:32:19] and the data [22:37:03] we can move that to another machine if we don't allso have to move a lot of other things to other machines [22:37:17] we have had problems in the past with data loss if we run too many filters on a single box [22:38:01] any free box you can think of? any stat* would be ok, i think [22:38:28] no, we need dedicated hosts for this [22:38:29] i mean [22:38:35] the data can be synced to stat1002 [22:38:36] that is fine [22:38:40] but, if we want to keep the process running [22:38:50] we need a dedicating logging box with capacity fro it [22:38:53] for it [22:38:58] matanya, if we can get rid of all the other filters [22:39:06] then we can probably move this to erbium or to oxygen [22:39:23] ok [22:39:40] i'll work on the other two. know who i can ask? [22:39:59] naw don't really know [22:41:59] ottomata: hey [22:42:22] yoyo [22:42:50] i brought up the front-end-perf-stats-pixel-endpoint-backed-by-kafka thing during the platform quarterly review, which mark and faidon attended, and they were both cool with it [22:42:58] should i just go ahead and submit a patch? [22:43:10] we can add them as reviewers so everyone is on the same page [22:45:05] ottomata: ^ [22:45:39] yeah! [22:45:40] let's do it [22:45:59] add the varnishkafka instance to wherever you need it…and then we create the topic manually [22:46:00] merge it [22:46:02] and kablam [22:46:03] :) [22:46:10] bits varnishes [22:46:13] it's not already on them , right? [22:46:18] it's still on mobile only, right? [22:47:12] ah yeah i see it [22:47:37] yeah just on mobile now [22:48:04] but if i add class { 'role::cache::varnish::kafka': }, you won't be able to add a class with a different topic for web reqs [22:48:27] lemme look [22:48:38] let's do this now and then refactor to a resource (instead of class) later? [22:49:12] hmmmmm [22:49:24] to a define [22:49:24] yesshhhhh [22:50:02] hmm, also, ori, the role there hardcodes the format [22:50:06] you want a different format probably, right? [22:50:08] or is that one ok? [22:50:17] it's ok just has a ton that i don't need [22:50:29] there's also no way to specify a URL pattern to match [22:50:33] i don't need all bits requests [22:50:36] hm [22:50:57] true, but i guess we can just set this up as generic bits webrequest and you can filter? [22:51:07] we'd only need one vk instance then? [22:51:22] i'm fine with that, but it ups the ante a little in terms of the load it can be expected to generate on the kafka brokers [22:51:30] if you're cool with that then sure [22:51:44] should be fine on brokers, i think, the filter would be done on your consumer side [22:51:49] but yeah, you'd be consuming more [22:51:52] but it should be fine [22:51:59] if we can't handle that then poo on us [22:53:01] so the patch should be just class { 'role::cache::varnish::kafka': topic => 'webrequest_bits', } [22:53:18] yup, thikn so, on in the bits role class, ja [22:53:24] bascially exactly the same as mobile, except for bits [22:53:28] could you submit that? i think if faidon and mark have questions etc you'd be the person to answer [22:53:30] i could! [22:53:40] actually, i might get gage to, trying to get him into more things [22:53:42] moving to #ops [22:53:54] cool [23:06:21] i'm out, laters all!