[14:18:54] hey ottomata [14:18:57] happy new year :) [14:19:01] hey qchris [14:19:44] Hi average. [14:19:58] Happy new year. [14:20:59] hey drdee [14:21:21] happy newy year@! [15:01:42] ottomata: Want to join us for daily scrum? [15:01:55] ooo ok! [15:47:13] going to get some f00d [16:09:28] happy new year guys! [16:14:52] happy new year drdee! [20:53:34] hey ottomata , Snaps [20:53:58] hey average [20:54:32] yoyo [20:54:33] I have a small idea for a validity check. I can write a json schema for the json data that is produced, and have some sort of periodic scheduled job that checks the data against a schema and records things that don't abide to the schema [20:55:18] average, it would be maybe cooler if we could keep invalid recordfs from being written in the first place [20:55:20] maybe in camus [20:55:23] somehow [20:55:25] hmmmm [20:55:29] hmm [20:55:31] actually [20:55:52] so the camus code i'm using [20:55:59] uses the Gson parser to parse the record before writing to hdfs [20:56:02] which means, that Gson parses it just fine [20:56:08] whatever parser the serdes uses can't parse it [20:56:10] hmmm [20:57:03] ottomata: is "time_firstbyte" allowed to be nan ? [20:57:21] I mean is that a value that you would allow in certain situations ? [20:57:25] when kafkatee is deployed you can run the json validator from there, could be as simple as a py script reading from stdin. [20:58:00] Snaps: can I ask what kafkatee is ? is it like tee but for kafka ? [20:58:10] looks like a multiplexor [20:58:17] average: udp2log replacement using kafka as input instead of the udp thingie [20:58:29] I see [20:58:51] but yeah, its generic, so it is a kafka demuxer / tee [20:59:52] the time_firstbyte value is a floating point number from varnish: [20:59:56] 3294 ReqEnd c 1503439622 1376564892.173064470 1376564892.173686981 0.274475574 0.000562668 0.000059843 [21:00:06] column nr 5 following the 'c' [21:00:39] 1376564892.173064470 ? [21:00:39] so I guess it could be output as nan by varnish if the request failed internally for some reason and thus never produced a "first byte" [21:00:52] 0.000562668 [21:01:08] oh, #5 relative to column containing "c" [21:01:09] ok [21:01:15] ah, yeah, sorry [21:02:02] ok, so it's allowed to be NaN too [21:02:19] maybe, I dont know :) but theoretically yeah [21:03:52] hmmmmmmm, Snaps, i've got data from kafka consumer with nan [21:04:03] but my vk output and ncsa output greps didn't catch it [21:04:06] this happened twice now [21:04:09] thought iw as doing something wrong [21:04:22] i've got this [21:04:22] varnishkafka -S ./varnishkafka.out.conf | grep ':nan' >> varnishkafka.nan.out [21:04:24] sure it is from the same frontend? [21:04:28] OH [21:04:30] duh [21:04:30] nope [21:04:31] i tisn't [21:04:35] oh wait [21:04:36] yeah it is [21:04:40] i'm using the same vanrishkafka.conf file [21:04:43] i just copied it and then changed output [21:05:06] yeah [21:05:06] # varnish instance name [21:05:06] varnish.arg.n = frontend [21:05:43] and the host reported in the json is the same as you are troubleshooting on yeah? [21:08:29] snaps, same host [21:08:39] ag, gotta run out for a bit to pick someone up, i'll be back on for a little bit [21:09:15] ok, running out Snaps, back on for a bit in 30 minsish [21:54:39] ok back [21:56:09] so Snaps, iiiiinteresting [21:56:13] if varnish has a xid [21:56:16] do we need seq? [21:56:30] what's a xid ? [21:56:39] https://gerrit.wikimedia.org/r/#/c/105093/1/varnishkafka.conf.example [21:58:05] ottomata: we shouldnt be needing seq no, but in mark's varnish log sample I see a few xid duplicates, so Im not sure it can be fully trusted (as seq can) [21:58:27] hm [21:58:33] what's mark's varnishlog sample? [21:59:17] a varnish log file with some 15k requests off a live frontend, that I use for reference and testing [21:59:44] the xid dups in varnish may be explained by this comment in varnishd: sp->xid = ++xids; /* XXX not locked */ [22:01:12] ah hm [22:01:23] hm, ok well, dups would be a problem [22:01:26] so I guess we'll keep seq [22:01:32] but it won't hurt to have xid in the logs too [22:01:40] that commit looks fine, should I merge? [22:01:45] we'd have to rebuild and change puppet [22:01:50] would it help to have that now? [22:01:59] or can we wait til then next time we want to update vk for something? [22:02:15] ok weird [22:02:19] Snaps: i def just tried again [22:02:24] saw a request come through kafka with nan [22:02:33] did not see it in vk stdout or varnishncsa [22:02:39] we dont need the xid stuff now [22:02:48] oh i betcha i am not using hte right frontend for ncsa [22:02:51] but still, should be for vk [22:03:05] huhm, ok [22:03:52] but, that's super weird, right? [22:03:57] yeah [22:04:01] why would it come through in kafka but not in vk stdout? [22:04:17] unless the message you receive from kafka is old, but it should be, right? [22:04:21] shouldnt [22:04:24] no, its live [22:04:41] ,"dt":"2014-01-02T22:02:05" [22:05:07] i'm running vk stoudt and ncsa side by side on the same host [22:05:12] and also consuming the kafka topic [22:05:32] waiting to see nan lines coming from that host [22:05:33] cp1047 [22:05:35] in kafka [22:05:45] and then checking my output from vk stdout and ncsa [22:05:47] getting nothing [22:05:55] grep ':nan' [22:06:01] should do it [22:06:01] right? [22:06:03] "time_firstbyte":nan, [22:06:04] yeah totally [22:06:04] hm [22:06:35] and if you remove the grep you get output? [22:07:16] ottomata , Snaps can I help you in any way related to this NaN ? [22:07:28] yeah, tons of output snaps [22:07:46] average, if you want, you could figure out how to sanitize the data we have? [22:08:01] ok, I'll give that a try [22:08:02] super weird [22:08:02] some kind of hadoop [22:08:10] s/"time_firstbyte":nan/"time_firstbyte":0.0,/g [22:08:11] would suffice [22:08:31] you'd probably ahve to rewrite each file to a temp file, and then delete the original and rename the temp [22:08:58] which files would those be ? [22:09:35] files in hdfs under /wmf/data/external/webrequest_mobile ? [22:10:35] yes [22:11:17] hmmm, wonder why there isn't newer data there [22:11:18] hmm [22:11:33] average, I'm pretty sure there is at least one of those lines in /wmf/data/external/webrequest_mobile/hourly/2014/01/01/14 [22:11:39] so you can take that as a smaller sample for an example [22:11:55] average, also, i think that actually doing this might be a lot of work, [22:12:04] if it takes too long, don't worry about it [22:12:11] we don't need to sanitize this data (yet) [22:12:15] it would just be nice to have [22:12:26] if we can fix the vk bug then we can just scrap this data and start new [22:13:05] (since "0.0" and "nan" are of equal length you can do in-place replacement and dont need to use an intermediary file) [22:13:06] is the vk bug replicable ? or does it require a production environment to replicate ? [22:13:12] I mean can I replicate the bug on my box ? [22:13:25] Snaps, I don't think hdfs will let him replace [22:13:29] we haven't replicated it yet, no [22:13:33] ah :) [22:13:34] ok [22:14:29] ottomata: whats the cpu usage like while running two vks and one vn? [22:15:17] load average: 1.42, 1.56, 1.46 [22:16:38] huhm, maybe they're running too slow under this load and missing logs from varnish. Seems unlikely but you never know. [22:17:13] hmm [22:17:22] that doesn't seem like much load though, no? [22:17:31] 16 processors on that box [22:18:44] you are right [22:23:33] are you still seeing new nans on the consumer? [22:24:30] haven't seen any in the last 5 or 10 mins [22:33:45] yeah ok Snaps, jsut saw a couple more [22:33:49] one on the host i'm looking at [22:33:52] still nothing in vk or ncsa [22:34:19] which is really strange, right? [22:34:33] is it possible that the nan is coming from rdkafka somehow? [22:34:38] no, right? [22:34:45] its just a string messages when it gets to rdkafka [22:34:46] rdkafka just copies the data [22:34:49] yeah [22:34:58] hm hm hm hm [22:35:11] and you are super sure the messages you are seeing are really from the host you are testing on? [22:35:17] more likely i must be doing something wrong with vk? [22:36:31] Snaps: https://gist.github.com/ottomata/8228436 [22:38:10] HMMMMM [22:38:14] Snaps, I do see a [22:38:18] Jan 2 22:35:01 cp1047 varnishkafka[17650]: TERM: Received signal 2: terminating [22:38:20] in varnishkafka.log [22:38:27] which is about the same time that came through [22:39:29] what killed it? [22:39:45] not sure [22:39:47] investigating... [22:39:58] oh PSSH [22:40:13] that's probably my stdout process [22:40:17] i ctrl-ced it [22:40:42] but hgmm [22:40:48] not right after that log came in [22:40:49] huh? [22:40:49] hm [22:41:25] hm, yeah that is certainly unrelated [22:41:31] that's just me ctrl cing my foreground vk process [22:41:40] but its 1s from the nan [22:41:44] must be related [22:41:57] i dunno, [22:42:24] icinga-wm is whining about too many processes with varnishkafka name. Does it take any action? [22:42:29] oop [22:42:32] no it doesn't [22:42:36] i'll acknowledge [22:44:01] Snaps, I think it must have been a coincidence, there are about 6 or 7 of those messages in vk.log, and they are certainly just from me recently running and ctrl-c ing a foregroudn vk [22:44:09] there are no messages in the vk logs from the past few days [22:44:14] okay [22:44:23] and there are certainly nans in the logs from then [22:44:39] maybe the 'tee' is buffering stuff [22:44:44] try running it with just the grep [22:45:34] i was doing this before without tee, jsut >> [22:45:37] but i will try [22:46:06] can we try to reproduce this somehow? maybe by forming a similar request to the one we're seeing? [22:47:44] we dont have much to go on though, just the url [22:48:22] we could try to run varnishlog directly and grepping for ReqEnd.*nan [22:55:07] yeah hm [22:55:17] !!! [22:55:20] wai ti just got some [22:55:27] without tee [22:55:30] ? weird [22:55:34] ok phew [22:55:36] and in ncsa [22:55:42] show it! [22:57:03] https://gist.github.com/ottomata/8228726 [22:57:31] so yeah, from varnish then [22:57:41] hm [23:00:00] we could fix that in vk, and either send it as null (which seems to be correct according to some SO post), or default it [23:00:29] if '!num' modified is specified and value is "nan". [23:00:34] modifier [23:01:26] do you have any clue whats up with that request? [23:02:08] no [23:02:23] it looks lik 95% of the time it is from that IP [23:02:38] usually to pt.m.wikimedia.prg [23:02:48] but there are others [23:04:07] hm, yeah i guess we can do that [23:04:16] null is ok in json ? [23:04:35] yeah guess so [23:04:47] if it parses in json it should be fine in hive [23:04:52] so yeah null is better than nan [23:06:16] should we do null or default? [23:07:01] i think null is good [23:07:12] i just did a very limited check [23:07:17] null parses in jq fine [23:07:20] whereas nan didn't [23:07:38] jsonlint.com says {"field": null} is valid [23:07:43] cool [23:07:43] yeah [23:07:45] should be then [23:07:47] i think null is good [23:07:51] which makes sense since null is a keyword in javascript ? [23:07:54] kinda nice to know that this is some weird request [23:08:03] hmmmmm, but hm [23:08:08] i wonder if hive will crap out [23:08:12] because null isn't a float [23:08:14] null or "null"? [23:08:28] hive craps out if it has to parse a bigint vs an int (long vs int) :/ [23:08:38] if we did null, then just null [23:08:40] not "null" [23:08:41] that would be a string [23:08:42] hm [23:08:54] hmm, maybe default is better [23:10:32] http://stackoverflow.com/questions/1423081/json-left-out-infinity-and-nan-json-status-in-ecmascript [23:13:22] hmm aye [23:13:28] ok, i gotta run soon [23:14:03] I will push a review with just null [23:14:07] ok? [23:14:35] yeah i think so, i'm worried that hive will crap out, but at least the json will parse [23:14:42] and i think null is the right thing to do [23:14:46] or should I prefer default, and if no default do null [23:14:52] hm [23:15:02] if you wait until tomorrow, i can check if hive craps out [23:15:06] i think that null is hte right thing to do [23:15:08] okay sure [23:15:21] and i kiiinda think that varnishkafka shouldn't be beholden to hive's problems [23:15:23] buuut, dunno [23:15:34] vk should produce valid json, which it doesnt now [23:15:37] jyeah [23:15:42] and null is valid [23:16:14] ok gotta run [23:16:17] laataaaas [23:16:32] see ya, good bug hunting! [23:19:22] Snaps: bson seems to be able to allow serialization of these edge-case values [23:19:26] ofc, bson is not json [23:19:35] and I don't know of a bson serde for hive [23:20:04] I think faidon wrote bson support for vk, but yeah, theres the other side aswell :) [23:20:05] http://bsonspec.org/#/specification [23:20:35] faidon wrote bson for vk ? hm, that's nice [23:20:42] but he never pushed it [23:21:16] if he would, would it solve the problem ? [23:22:52] I dont see nan specified for bson [23:23:30] but this isnt much of a problem to solve, its three lines of code in vk to get it to output correct json. :) The problem was verifying where the nan came from in the first place [23:24:36] bson has null and undefined(but apparently this one is deprecated) [23:24:40] yes it doesn't have nan, you're right [23:26:05] I need to go to bed. See you average! [23:26:12] gnight