[14:34:31] goooooooood mmooooorning [14:39:25] mooorning ottomata [14:39:52] we still have packet loss issues on emery :( even after enabling sampling on two filters yesterday, would restarting udp2log help? [14:40:11] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=emery.wikimedia.org&v=0.0&m=packet_loss_average&r=hour&z=default&jr=&js=&st=1359324678&vl=%25&z=large [14:42:01] did something change there? [14:42:13] did you add new filters? [14:42:16] not that i know [14:42:17] no [14:42:22] i only enabled sampling [14:42:27] to reduce workload [14:42:48] maybe one filter is stuck? [14:43:44] was emery having problems? why did you enable sampling? [14:44:57] (if there are emails about this, I will probably read them in 5 or 10 minutes :) ) [14:45:37] look at the ganglia chart [14:45:44] yesterday we had packet loss for almost 3 hours [14:45:48] right now again [14:48:02] but did that start because you changed something? or was it already happening? [14:48:10] just trying to make sure I know what is going on [14:48:31] so you enabled sampling because emery had problems without any changes happening there? [14:49:31] no i did not change anything [14:49:36] hm ok [14:49:42] it just started happening yesterday afternoon [14:49:47] yes [14:50:28] the Thank_You_Main filter and a country filter had 100% CPU usage [14:50:29] ok [14:50:36] and were both unsampled [14:50:41] i see a lot of processes there, [14:50:49] so i put them to sample 1:10 [14:50:49] i doubt this is related [14:50:49] but just fyi [14:50:56] one thing i've noticed with deploying changes to udp2log filters [14:51:02] is that you can't rely on puppet to properly restart udp2log [14:51:08] ohhhh [14:51:21] why not? [14:51:26] dunno [14:51:34] i always run puppet and get the config changes [14:51:37] and then restart udp2log myself [14:51:49] we didn't do that yesterday [14:51:57] probably good to do that now [14:52:04] i just did ja [14:52:07] ok [14:52:10] thx [14:52:16] we'll see if it helps [14:53:24] what is mysql doing on emery? [14:53:49] and apache? [14:54:23] dunno! [14:54:26] not much if ianything [14:54:27] heh [15:14:20] morning average_drifter, milimetric [15:14:40] morning sir [15:18:34] limn priorization [15:18:36] > [15:18:37] ? [15:29:46] milimetric ^^ [15:30:04] i'm PM you [15:30:09] k [16:15:42] drdee why jan 31 rather than feb 4? [16:15:46] for tab seps? [16:15:47] deploy [16:16:03] because we need to get it done )" [16:16:04] :) [16:21:42] but more importantly it makes erik z's life easier if the transition happens at the end/start of a month [16:23:16] do udp-filters need to be debianized becuase I saw you guys are discussing a lot about it so I wasn't sure if the code's ready to be packaged yet [16:23:24] btw, good evening :) [16:23:52] I also saw the tickets on Asana related to Limn's debianization [16:24:16] ok, drdee, cool, that's a good reason [16:24:48] average_drifter: yes you should debianize the field_delim_param branch [16:25:05] and we will ask you to help milimetric with debianizing limn in the coming two weeks [16:25:44] ottomata, is upgrading precise on emery a lot of work? [16:25:52] probably not [16:25:59] but i don't know for sure [16:26:03] ok [16:26:40] ottomata: https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [16:26:52] so we can talk a bit about all those filter things ; [16:27:31] what filter things? [16:28:04] emery / udp-filter / kraken [16:33:12] ottomata, we also need to talk about udp-filter and kraken [16:33:18] i need to look at that [16:33:24] let me reporduce what I saw on friday [16:33:25] k [16:33:43] haha, this morning has been CERAAaazzy, let's talk in an hour before standup [16:33:46] then tehre is an ops meeting at 2 [16:33:48] aaahHHHh [16:34:54] k [17:23:54] hey average_drifter [17:23:58] milimetric: [17:24:05] howdy [17:24:14] i'm in a hangout that I emailed you [17:24:19] ah sweet, just made it to the office [17:24:34] drdee, i apologize about jumping the gun blaming udp-filter, I just realized that my check there was flawed [17:24:40] np [17:24:46] cool, join when you're ready DarTar [17:24:56] it still can be udp-filter but then it's because it's not fast enough [17:25:01] could be [17:25:07] milimetric: not allowed to join [17:25:13] which email address did you use? [17:25:23] your wikimedia one [17:25:34] can you add dario.taraborelli@gmail.com ? [17:26:03] yes [17:26:06] thx [17:56:45] which scrum URL are we using? [17:56:46] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [17:56:48] or [17:56:55] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:57:06] the first one is from the invite [17:57:07] the second is what i have bookmarked. [17:57:20] drdee milimetric [17:57:49] i prefer the first one [17:57:59] k [17:58:21] either way, i am alone :P [17:58:56] it's 12:58 [17:58:57] :D [17:59:38] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [18:19:26] yarr [18:19:36] harr harr [18:33:08] ok, making some food, be back in a bit [18:44:05] hey drdee [18:44:09] it now works. Thanks :) [18:44:30] awesome! [18:44:43] i'll be pushing some now to see how it goes [18:46:37] don't forget to update your settings.xml [18:59:21] from the ZeroMQ book [18:59:25] How It Began [18:59:25] We took a normal TCP socket, injected it with a mix of radioactive isotopes stolen [18:59:27] from a secret Soviet atomic research project, bombarded it with 1950-era cosmic rays, [18:59:28] and put it into the hands of a drug-addled comic book author with a badly-disguised [18:59:29] fetish for bulging muscles clad in spandex (see figure 1). Yes, ØMQ sockets are the [18:59:30] world-saving superheroes of the networking world. [18:59:46] in case you didn't know :D [19:18:46] heh [19:18:54] i've read quite a bit of his stuff [19:20:44] the best is http://www.250bpm.com/blog:2 -- basically, "why i wrote ZMQ" [19:20:55] and http://www.250bpm.com/concepts [19:21:14] more interesting stuff here http://www.250bpm.com/research [19:58:48] ahhh, drdee, what needs to happen to kraken repo? [19:58:54] its cloned on lots of analytics machines and pupet is not happey [19:58:55] happy* [19:59:07] you need to reclone it [19:59:15] yeah, you can't pull. [19:59:16] :( [19:59:16] oh ok [19:59:20] hm :/ ok [20:08:12] where is the source code for limnpy? [20:09:30] https://github.com/embr/limnpy [20:10:07] thx [20:10:08] !log Modified webrequest-mobile udp2log filters to produce based on frontend server hostname (^cp104[1-4]) rather than m.wikipedia.org domain. [20:10:11] Logged the message, Master [20:10:20] WOOOT WOOOOT [20:10:43] GUYS [20:10:45] GUYS GUYS GUYS [20:10:46] https://github.com/linkedin/camus [20:10:47] i'll wait till we get our first consumption, and then will check seqs [20:10:51] ottomata, drdee!! [20:10:57] oooo [20:10:59] reading [20:10:59] (h/t to preilly) [20:11:23] of course we will check seqs :D [20:11:43] this could be SO GOOD [20:12:03] this is like, p[erfect [20:12:08] (provided it works) [20:12:20] drdee: Camus is LinkedIn's Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. [20:12:26] only committed to github 1 month ago [20:12:32] cooool! [20:12:32] 11 days, i thought :) [20:12:32] i am reading right now [20:12:36] nice find preilly [20:12:42] the only problem is that it wants to write avro records. [20:12:48] which is great for where we want to go [20:12:56] not so great for the interim, where we're working with CSVs [20:13:05] Topic offsets stored in HDFS. Camus maintains its own status by storing offset for each topic in HDFS. This data is persistent. [20:13:06] interesting [20:13:12] wonder why they do that instead of in zookeeper [20:13:25] oh i guess if all zks restart this info would be gone [20:13:29] hm [20:13:32] cool! [20:13:37] ahhhh [20:13:41] shit, i totally hadn't thought of that. [20:14:36] seems pretty great though [20:14:52] certainly better than the stoopid kafkahadoopconsumer + cron jobs [20:14:58] exactly. [20:15:38] shouldn't' be too hard to add plain text writer instead of avro writer, right? [20:16:14] haha, i was just about to ask that same q [20:16:35] or i mean, we can use our avro schema we've got defined, if we want to start doing that [20:16:38] not sure if we do yet [20:16:42] i would wait [20:16:47] yeah, i think so too [20:16:50] we'd have to change pig scripts etc. [20:16:54] no not evne [20:16:54] would just slow us down right now [20:17:01] exactly. [20:17:09] but it would make it much harder to inspect data [20:17:10] i think plaintext is the right answer for now. [20:17:13] me too [20:18:24] We hope to eventually create a more out of the box solution, but until we get there you will need to create a custom decoder for handling Kafka messages. You can do this by implementing the abstract class com.linkedin.batch.etl.kafka.coders.KafkaMessageDecoder. Internally, we use a schema registry that enables obtaining an Avro schema usingan identifier included in the Kafka byte payload. For more information on othe options, you can email [20:18:25] camus_etl@googlegroups.com. [20:18:30] ja [20:18:35] from https://github.com/linkedin/camus#first-create-a-custom-kafka-message-to-avro-record-decoder [20:18:50] we could make a single field avro schema for the whole thing, but it would probably be better to write straight text, rather than avro at all for now [20:19:08] i doubt that would be hard to implement though [20:19:14] cool! [20:21:06] you could also just use reflection [20:22:32] assuming that org.apache.avro.reflect.ReflectData;… works as I think it might [20:23:35] and ReflectDatumReader as well as ReflectDatumWriter [20:24:32] hmm, cool [20:29:38] drdee, I'm still listening to ops meeting, so I can't voice chat [20:29:39] but! [20:29:47] i want to focus on tab now [20:36:36] ergh, just got a failed kafka hadoop consumer email again [20:39:18] ok [20:43:50] don't really know why though [20:43:52] ergh [20:51:36] so anyway [20:51:38] ja, drdee [20:51:39] tabs [20:51:56] btw, running large jobs on kraken does not work; close at the end of the map phase this happens: [20:51:57] derrrrrrr, we need: [20:51:57] deploy new version of udp-filter [20:51:57] deploy new version of packet-loss [20:52:14] O [IPC Server handler 0 on 58033] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill Job received from client job_1355947335321_11056 [20:52:14] 2013-01-28 20:28:08,847 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1000 [20:52:16] 2013-01-28 20:28:08,848 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 2000 [20:52:17] 2013-01-28 20:28:08,848 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 3000 [20:52:19] 2013-01-28 20:28:08,849 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 4000 [20:52:20] 2013-01-28 20:28:08,849 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 5000 [20:52:22] 2013-01-28 20:28:08,849 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 6000 [20:52:23] 2013-01-28 20:28:08,850 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 7000 [20:52:25] 2013-01-28 20:28:08,850 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 8000 [20:52:26] 2013-01-28 20:28:08,850 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 9000 [20:52:27] 2013-01-28 20:28:08,850 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1355947335321_11056Job Transitioned from RUNNING to KILL_WAIT [20:52:35] average_drifter is working on udp-fitler package, will have it ready soon [20:52:45] the log stuff that i pasted the first line is important [20:52:51] Kill Job received from client job_1355947335321_11056 [20:52:54] not clear at all wny [20:58:38] (back) [21:00:00] drdee: job error? check the job log? [21:00:34] my suspicion is OOM / poorly configured mapred-site.xml [21:00:50] definitely digging through the logs [21:00:56] but there are many :) [21:01:13] yeah [21:04:05] cool! so, during a full 15 minute period of producing logs via server hostname matching, rather than domain name [21:04:11] 0 missing seq numbers [21:04:13] for cp1044 [21:04:55] but, also interesting [21:05:08] matching on these 4 hosts produces MUCH less data than matching for m.wikipedia.org [21:05:42] 200 MB / min before, now only around 50MB / min [21:06:58] because we have at least 4 dedicated varnish mobile servers [21:07:06] cp104* [21:09:47] nice! [21:09:54] interesting. [21:10:44] asking RobH to confirm we have the right hosts... [21:11:23] we have, just checked ganglia [21:11:25] ottomata: we have the right hosts, but it seems 2 are offline (!!) [21:11:30] mmmmm [21:12:03] LOL [21:12:05] why would matching on m.wikipedia.org generate so much more data though? [21:12:15] people browsing m.wikipedia on their desktops? [21:12:30] how do people get directed to those cache servers? I would think it would be DNS based [21:12:52] er [21:13:12] do we match on 'm.wikipedia.org' or '\bm\.wikipedia\.org'? [21:13:24] new tag for new udp-filters release ? [21:15:11] \. [21:15:25] ottomata: because am. pam. nrm. udm. om. tum. [21:15:25] its literal unless you pass a flag to do regex [21:15:27] are all wikis [21:15:33] .... [21:15:40] uh huhhhh! [21:15:41] new version will be 0.3.20 [21:15:43] well that would do it [21:15:43] so if we're not inserting the \b or the \. [21:15:47] we would be matching all those [21:15:48] average_drifter: that's fine [21:16:42] average_drifter, if you can, please build a precise and lucid .deb s [21:17:46] example: [21:17:46] http://apt.wikimedia.org/wikimedia/pool/main/u/udp-filter/ [21:19:26] drdee, ottomata, milimetric: if you add the hosts riemann. and graphite. to your hosts line for the analytics domains, http://riemann.analytics.wikimedia.org/ looks pretty good [21:19:44] cool [21:19:58] nice dschoon! [21:20:38] i haven't piped those into graphite yet [21:20:43] i'm going to get jmx working first [21:22:26] ok, crisis rescinded [21:22:34] all four mobile cache servers are up, ottomata, drdee [21:22:34] oh! packet-loss.cpp is not debianized at all [21:22:36] interesting. [21:22:40] they were just misconfigured. [21:22:43] (in ganglia) [21:22:52] k [22:11:16] dschoon, does rieman also do memory consumption monitoring ? [22:11:29] that would help me :) [22:11:34] that's what the "memory" stat is [22:11:53] does it have more granular metrics? [22:12:15] and can we add an11-an20 [22:12:21] what is riemann ? [22:12:29] I know Bernard Riemann [22:12:32] but not this new one [22:12:35] :) [22:12:45] realtime cluster monitoring [22:12:51] oh, ok [22:17:51] ottomata, are an10-an20 sending stats to ganglia? [22:18:08] i mean hadoop-specific settings [22:19:31] yes [22:19:55] should be! [22:19:58] but uhhh don't see any [22:19:59] hm [22:20:15] oh maybe an10 is [22:20:18] but others aren't [22:20:22] drdee, they can very easily though [22:20:29] any jmx metric you want to get at [22:20:39] or wait, no, this is a hadoop config... [22:20:40] hang on [22:20:41] i think only an10 sends hadoop specific metrics to ganglia [22:21:53] hmmm, the others should I think [22:21:55] they are configured with [22:21:58] datanode.sink.ganglia.servers=239.192.1.32:8649 [22:22:02] and [22:22:02] nodemanager.sink.ganglia.servers=239.192.1.32:8649 [22:22:56] there are a buuuunch more stats though [22:23:01] that you could have them send [22:23:05] wanna learn how?! :) [22:23:17] SURE! [22:23:26] but i am looking at an13 right now [22:23:34] at hadoop-metrics2.properties [22:23:37] and that looks configured [23:05:20] drdee: So I am working on RT4402 for Stefan's access to shell [23:05:29] aight [23:05:34] what you want is basically the same access you have right? [23:05:43] your access is based off the analytics intern group access [23:05:47] yes that's fine [23:05:49] so if he is identical, then i just toss him there [23:05:55] yup [23:07:12] cool, also updated rols/analytics to include him on accounts, i dont think im missing anything else [23:07:26] will roll it live shortly, if it is missing anything else you guys can feel free to ping me about it [23:08:08] ty! [23:09:35] working in software rather than racking servers... still odd. [23:11:01] RobH: thank you [23:12:16] welcome =] [23:12:26] ok, change is live, it'll take a couple hours for all the analytics machines to get the update [23:12:44] RobH: my pubkey is in your pm [23:12:57] RobH: I just tried to connect and got [23:12:59] the pubkey was already in admins.pp =] [23:13:07] user@user-Inspiron-3520:~$ ssh spetrea@analytics1001.wikimedia.org [23:13:07] Permission denied (publickey). [23:13:09] average_drifter: it'll take a bit for puppet to make its calls [23:13:15] lemme try forcing it on 1001 so you can test [23:13:17] ok [23:13:21] alright [23:13:37] hrmm [23:13:47] well, thats not good [23:13:50] my changeset broke it [23:13:53] * RobH goes to fix [23:14:09] horrible typo, WTF did i do. [23:16:21] HAHAH [23:16:21] acciybts::spetreaal [23:16:23] hahaah [23:17:18] :) [23:18:53] and its still saying that [23:18:55] even though i fixed it [23:18:57] wtf puppet [23:19:31] spetreaal [23:19:34] is not right [23:19:39] you only fixed 'accounts' [23:19:50] yea [23:19:54] i saw it, already committed, heh [23:19:59] aye [23:20:00] commit message 'i am not smart' [23:20:31] yeehaw, danke! [23:27:09] I am running into another user based puppet issue, so until I fix it, this won't work, still hackin at it.