[09:31:26] New patchset: Stefan.petrea; "Fixed segfault" [analytics/webstatscollector] (master) - https://gerrit.wikimedia.org/r/56902 [09:33:03] oh nice [09:33:07] the bot is working [09:34:36] !log notify ottomata that webstatscollector segfault was fixed in https://gerrit.wikimedia.org/r/56902 [09:34:42] !log test [13:24:55] ottomata: ! [13:24:58] ottomata: good morning :) [13:25:02] ottomata: how was your weekend ? [13:25:16] morning! [13:25:17] good weekend! [13:25:24] nice :) [13:25:28] how was yours? [13:25:38] ottomata: it was ok, thanks :) [13:25:41] ottomata: I solved the segfault [13:25:46] ottomata: https://gerrit.wikimedia.org/r/56902 [13:27:43] awesome! [13:27:46] I will try it out in just a bit then! [13:28:02] ok [13:33:48] Change merged: Ottomata; [analytics/webstatscollector] (master) - https://gerrit.wikimedia.org/r/56902 [13:53:55] moooorning [13:54:16] morning drdee [13:55:53] morning! [13:56:02] average_drifter, no segfaults so far! [13:56:04] yay! [13:56:15] morning ottomata! [13:56:15] how long does it take for the dumps directory to show up? [13:56:20] morning average_drifter! [13:56:36] i belief every hour ottomat [13:56:38] a [13:56:53] and maybe at the hour [13:56:59] so that would be in a couple of minutes [13:57:39] k [13:59:51] ottomata: :) [14:00:04] drdee: hi [14:00:13] yeahhh I got dumps! [14:00:21] oh they are old dumps : [14:00:31] ottomata: does that mean it segfaulted ? [14:00:35] no new dumps yet [14:00:35] nono [14:00:38] just hasn't dumped [14:00:48] no segfaults yet at all [14:00:53] before it segfaulted like right away [14:00:55] now it runs [14:00:59] so I expect that it works [14:01:01] drdee: I have news [14:02:29] shoot [14:03:19] http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r46-updated-logic/out_sp/EN/TablesPageViewsMonthlyOriginalMobile.htm [14:03:27] yes i saw that [14:03:53] numbers are at least 50% too low :( [14:04:08] maybe we should do a hangout, invite milimetric as well [14:04:12] and you talk us through the code [14:05:16] yes, moment, I have to finish up some docs for it, then I'll send a link for the hangout [14:05:35] ok [14:05:38] oh! average_drifter, new rule: [14:05:43] when anyone says "hangout?" [14:05:47] it means we all go into the standup [14:05:49] so bookmark that [14:05:54] good point milimetric [14:06:08] ok, so the hangout link is the one for the standup [14:06:09] #efficiency #solvingfirstworldproblems [14:51:43] average_drifter: any update? [15:22:40] morning kraigparkinson [15:23:25] still waking up… :p [15:23:28] gm drdee [15:24:31] I'm ready [15:26:18] drdee , milimetric I'm in the hangout :) [15:26:38] get started without me while i am running to get coffee :) [15:26:58] ok [15:27:35] gr, weird, no pinging going on with my irc anymomre [15:27:37] brt average_drifter [15:28:44] brt ? [15:28:54] be right there [15:28:55] ok [15:43:52] drdee, why would we discard a pageview if it comes from mobile? [15:44:09] ?? [16:02:35] drdee , milimetric thanks for giving your oppinion on this :) [16:02:45] np [16:08:26] so scrum is in 1h [16:17:33] good morning, kind sirs [16:18:29] milimetric: i didn't get to running that job on friday [16:18:32] due to lack of brain [16:18:39] but! [16:18:43] i can tell you my idea [16:19:19] i was going to incrementally broaden my search using kraken [16:19:30] start with all UAs that contain "wikimedia" at all [16:20:13] (user_agent MATCHES ".*(?i:Wikimedia).*") [16:20:21] and count that [16:20:31] if that differs significantly [16:20:45] then i know my problem is somewhere in my filtering [16:20:52] if it looks similar to the original results [16:20:59] we are probably missing the data [16:24:40] hey dschoon, just grabbed lunch [16:24:44] word [16:25:07] diederik suggested the sensible thing: i'm installing the android app and hitting the site [16:25:13] heh [16:25:15] then i'll grep for my ip address [16:25:16] sure [16:25:29] ...you'll need to hit the site a few thousand times to be sure [16:25:40] oh, the mobile site [16:25:42] ah [16:25:44] yes, sure [16:25:55] it should be in Kraken within 15 minutes right? [16:26:05] yes [16:26:14] k, gonna eat lunch and get to it [17:01:00] ottomata scrum [17:01:07] average_drifter scrum [17:01:13] doh doh [17:03:27] ok [17:12:57] ottomata: http://localhost:8888/oozie/list_oozie_workflow/0001814-130321155131954-oozie-oozi-W/ [17:12:57] click on the log tab [17:14:31] org.apache.openjpa.persistence.PersistenceException: Data truncation: Data too long for column 'conf' at row 1 [17:16:08] is the term "unsampling" correct ? [17:16:16] what do you call it like from a statistics PoV ? [17:17:52] like for example if you sampled 1:100 , but then you need to multiply back by 100 [17:18:04] extrapolating? :) [17:18:17] kraigparkinson: extrapolating ! that's it, thanks [17:33:50] i think it's interpolate, actually [17:33:53] when you go up, you interpolate [17:34:09] (up := fill in missing data by inference) [17:34:12] thanks, dschoon [17:34:30] when you go out (:= infer new dimensions) you extrapolate [17:34:43] usually you extrapolate time [17:38:18] (such is my understanding) [17:41:23] milimetric: we need to fix the datasources page, 4 realz [17:41:42] i saw, matt walker and many others need it [17:41:45] yes! [17:42:18] so the problem was that saving the datasource now fails because of a cyclical reference during JSON serialization [17:42:21] you said you had a hunch [17:42:25] yeah. [17:42:27] but also [17:42:40] the column headings now show the toString of functions [17:42:43] so it's more than just that [17:44:57] that i think i fixed [17:45:17] i'll make sure everything's pushed everywhere [17:45:50] dev still shows it [17:45:53] test does not [17:45:58] but test doesn't have the vis options [17:48:08] yea, I didn't deploy it [17:48:12] it's fixed though [17:48:51] lmk when it's up on dev [17:49:08] dschoon, ottomata, drdee, I sent you all a response to my previous investigation email. What do you think? [17:49:15] looks like we're not getting the traffic into kraken [17:49:25] any reason for skepticism of my method? [17:49:26] will check it out [17:49:37] reading [17:50:12] [travis-ci] develop/87b285e (#115 by milimetric): The build passed. http://travis-ci.org/wikimedia/limn/builds/5963267 [17:51:20] maybe run the query against the entire day? just to be sure? [17:52:19] ok dschoon, deployed: http://dev-reportcard.wmflabs.org/datasources [17:52:48] running against the whole day would mean my data traveled back in time but I suppose anything could happen right drdee? [17:52:49] :) [17:53:59] timezones, milimetric timezones :D [17:54:11] i also said just to be sure :) [17:54:34] obviously it shouldn't matter but just ruling out alternative expanations [17:54:39] right, true [17:54:54] so timezone wise though, it'd be good to know for sure that kraken's on GMT [17:54:57] that's true right? [17:55:03] kraken us UTC [17:55:04] *is [17:56:54] this line is correct too right? [17:56:54] grunt> dan_log_fields = FILTER log_fields BY remote_addr MATCHES '.*96\\.227\\.53\\.90.*'; [17:57:21] as in, there's no way my IP address would be in there but not get caught by this [17:59:36] ok drdee, of 142 million records, 0 match that [17:59:43] (looking at the whole day) [18:00:27] gonna do the monthly reportcard stuff now, but if nobody finds fault with this, I'll talk to ops to pick their brain [18:00:44] milimetric: yes, that looks right [18:00:57] hm [18:00:59] actually [18:01:12] that first match is greedy [18:01:13] try this: [18:01:38] dan_log_fields = FILTER log_fields BY remote_addr MATCHES '\\s*96\\.227\\.53\\.90\\s*'; [18:01:48] because i think the first .* was greedily eating the rest of the string [18:02:02] heh, no [18:02:10] then .*anything would never match anything [18:06:35] right [18:06:37] it might not! [18:06:44] dschoon, you merged the metric defs stuff into master reportcard branch? [18:06:49] i forget. [18:06:50] probably? [18:06:52] yeah [18:06:55] hm... predicament [18:07:04] we should not do that until we launch a feature officially [18:07:30] because now we might have to deploy latest to prod [18:08:18] mk, so re: the regex, I'll try it because hey it's crazy day here at WMF. But if that's the case I'm never using regex in pig ever again. Or pig for that matter :) [18:08:56] milimetric, iiinteresting! I just read your email [18:09:03] yea [18:09:13] whatever's happening, its probably not ops' fault, it would be ours [18:09:17] so don't run off to them yet :p [18:09:22] oh definitely not their fault [18:09:32] my approach was gonna be: we're stuck, can you help? [18:10:15] it's obviously on our side so I'll happily exhaust other options you can think of [18:10:26] well, let's see if we can figure it out first, [18:10:32] I suppose my IP could be wrong.. lemme vet that [18:10:33] first, i'd try just matching exactly with = instead of regex [18:10:37] if you are looking for a single IP [18:10:39] k [18:10:40] yeah [18:10:44] i was about to say that [18:10:55] grunt> dan_log_fields = FILTER log_fields BY remote_addr = '96.227.53.90'; [18:10:57] also, the import is funky, there's no guaruntees about order or time buckets [18:11:03] milimetric: http://tire.less.ly/asking/wtf-is-my-ip/but-in-plaintext-pls/ [18:11:04] so, you don't need to grab the whole day [18:11:14] or http://tire.less.ly/asking/wtf-is-my-ip/ :) [18:11:18] but, grab maybe an hours worth of time around when your request was made, just to be safe [18:11:42] no that's what i was doing first but drdee said to do the whole day [18:11:53] 'cause he believes in time travel and has little faith in timezones :P [18:11:59] http://dev-reportcard.wmflabs.org/graphs/donationdata-vs-day [18:12:39] and all the imports had happened for that hourish time period you looked in? [18:12:51] works for me dschoon [18:12:55] also, we need to be sure your requests were going to one of the 4 varnish hosts we are importing from [18:12:59] milimetric, let's try this real quick [18:13:08] not sure how i know the imports happened or not [18:13:15] i'm going to turn on unsampled udp2log on an09 for a, can you make some requests? [18:13:18] is fs -ls enough? [18:13:33] yeah, if the files you are looking for are there [18:13:34] the imports happened [18:13:47] ok, say when and i'll hammer it at like 1 request / second :) [18:13:47] also, we could look in the kafka buffer too [18:13:47] k one sec [18:13:55] k, then they're there because it said there were like 9 million records it looked through [18:14:16] right but, you have a wildcard for a few imports [18:14:17] like [18:14:26] maybe the latest 15 minute interval that you needed ahdn't been imported yet [18:14:34] I had 16:45* and 17* [18:14:48] and i did requests from 16:50 onward [18:15:23] actually [18:15:44] you had all of 17:00-17:45? or just 17:00? [18:15:51] is there a way for you to know based on what you get back what varnish host you were serviced by? [18:15:56] yes [18:15:58] look at header [18:16:06] it has X-Cache host or somethign [18:16:27] X-Cache: cp1043 frontend miss (0) [18:16:36] milimetric, real quick before I turn this on [18:16:36] do: [18:16:39] milimetric: can you look at that? [18:16:47] curl --head http:// [18:16:54] and look at value of X-Cache [18:16:58] from my phone? [18:17:23] 17* to your previous question, meaning all of 17 [18:17:25] yeah [18:17:33] ok, hang on, gotta find a terminal emulator [18:17:35] heh [18:17:46] oh ha [18:17:57] or uh hm [18:19:16] heh, balls, this emulator has no curl!! [18:19:16] :) [18:19:16] milimetric: what URL are you using? [18:19:16] sec [18:19:31] i browsed randomly, star trek, international space station, etc. [18:19:37] using the WMF app, right? [18:19:39] yes [18:19:44] version 1.3.4 [18:20:05] milimetric, if finding a way to get headers on your phone is going to be annoying, let's do udp2log thing first [18:20:08] lemme know if you are ready to browse [18:20:23] rdy [18:20:28] k go [18:20:40] do as much as possible, this gets huge fast! [18:20:44] gonna shut it of in a few seconds [18:20:49] 1G already [18:20:54] i'll shut it off at 2G [18:21:06] you got some requests in there? [18:21:14] milimetric? ^\ [18:21:21] oh yea [18:21:26] a bunch [18:21:29] ok cool [18:21:31] and what is your IP/ [18:21:31] ? [18:21:38] i'll double check it, one sec [18:22:37] 96.227.53.90 [18:23:21] so if they were cached, would the request still go through to kraken? [18:23:36] when i was looking through the logs, I remember a TON of miss/200 statuses [18:23:49] and very few hits, and those were very weird requests [18:24:00] it shouldn't matter if it's cached [18:24:09] it still proxies through the varnish boxes [18:24:36] yeah, that's my first suggestion though - see if we can find any examples of cache hit 200s in kraken [18:24:56] from mobile that is, so UA MATCHES 'WikipediaMobile/.*Android' [18:25:26] i suggest you always use .*? [18:25:28] not .* [18:25:36] the greedy version is rarely what you want :P [18:26:13] i know all about .*? and it isn't the witch you are hunting [18:26:19] yes [18:26:46] hm, I don't see that IP in the logs I captured milimetric [18:26:57] ah wait [18:26:58] yes I do [18:26:59] hm [18:27:02] hang on [18:27:12] #edgeofseat [18:27:16] udp-filter just diidn't catch it! [18:27:17] grep does [18:27:27] 25 requests [18:27:35] yeah, that sounds about right [18:27:43] but [18:27:44] this last time i made only four or five [18:27:49] welp [18:27:51] before i made maybe 4 groups of 5 [18:27:52] screwed, i think. [18:27:53] https://github.com/wikimedia/WikipediaMobile/blob/master/assets/www/js/app.js#L124 [18:28:06] it hits $lang . $project .org [18:28:12] which means en.wikipedia.org [18:28:16] rigiht [18:28:17] not en.m.wikipedia.org [18:28:21] none of these are on cp1041-44 [18:28:24] correct. [18:28:27] grep 96.227.53.90 webrequest.log | awk '{print $1}' | sort | uniq -c [18:28:27] 2 cp1001.eqiad.wmnet [18:28:27] 3 cp1002.eqiad.wmnet [18:28:27] 1 cp1006.eqiad.wmnet [18:28:27] 1 cp1007.eqiad.wmnet [18:28:27] 3 cp1008.eqiad.wmnet [18:28:27] 1 cp1009.eqiad.wmnet [18:28:28] so we don't always get the data. [18:28:28] 1 cp1011.eqiad.wmnet [18:28:28] 1 cp1013.eqiad.wmnet [18:28:29] 1 cp1014.eqiad.wmnet [18:28:29] 1 cp1016.eqiad.wmnet [18:28:30] 1 cp1017.eqiad.wmnet [18:28:38] so there's the answer. [18:28:42] ugh [18:28:46] there are hundreds of machines it might be on. [18:28:57] so, what's the prob? this is the api? [18:28:58] https://github.com/wikimedia/WikipediaMobile/blob/master/assets/www/js/app.js#L124 [18:29:04] yep [18:29:07] the app doesn't hit .m. [18:29:12] well, that's ok [18:29:19] that's ok? [18:29:19] it's not, really [18:29:20] well, wait [18:29:28] kraken only imports hits to cp1041-44 [18:29:32] it's not just the .m. part, these lines aren't in kraken ANYWHERE [18:29:36] those are the machines that handle .m. [18:29:41] right [18:29:41] of course not. [18:29:43] because i looked for purely the IP [18:29:46] right [18:29:51] kraken only gets cp1041-44 [18:29:55] cp1041-44 are .m. [18:30:00] oh! [18:30:03] i didn't know [18:30:04] we are importing all logs from only 4 varnish hosts, because we were told all mobile traffic goes through those [18:30:04] yes. [18:30:13] k, so we don't have the data [18:30:13] why not? [18:30:17] we have all mobile site data [18:30:19] not api [18:30:22] we were probably told "all mobile *site* traffic" does [18:30:24] exactly. [18:30:27] right [18:30:34] api traffic is evenly spread across the cluster [18:30:35] but why are we not importing everything else? [18:30:37] not allowed on MVC? [18:30:37] well, probably not, actually [18:30:39] kraigparkinson: drdee: need status updates for https://www.mediawiki.org/wiki/Wikimedia_engineering_report/2013/March#Analytics [18:30:51] api traffic is a specific subset [18:31:06] proof of concept, keeping traffic imports low to make sure we don't mess stuff up, also, kafak hadoop consumer is hacky, so we wanted to reduce potential problems [18:31:08] we used to take everything, but we turned it off for ephemeral reasons [18:31:09] it would be so nice if a limited number of servers were dedicated to api traffic [18:31:16] robla: working on it here: http://etherpad.wmflabs.org/pad/p/AnalyticsMonthlyEngineeringReport [18:31:33] robla, by when do we need to have it buttoned up? [18:31:54] milimetric, do we need api traffic for our upcoming deliverables? [18:32:01] ok, so drdee, kraigparkinson - do we abandon card 92 right now or do we launch an initiative to turn on the firehose? [18:32:11] yes ottomata, card 92 [18:32:25] i'll change it to blocked. [18:32:33] we can't turn on the firehose on mvc [18:32:50] kraigparkinson: today was the main deadline. Check in with Sumana now for exact timing [18:32:51] ok, then it's abandoned unless there are any objections [18:32:53] i also added a note about .m. and the API [18:33:00] thanks robla [18:33:01] milimetric: please update drdee and kraigparkinson [18:33:05] will do [18:33:10] drdee, kraigparkinson, hangout? [18:33:14] :) they'll decide, but yeah, it's blocked for us [18:33:41] milimetric, sure [18:33:46] got one? [18:33:53] remember, that's our codeword [18:33:56] it means standup link [18:34:07] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [18:34:18] everyone should bookmark that ^^ [18:35:18] drdee: writing an RT ticket to change the UMAPI access privs, do you have a list of private DBs handy? Otherwise we can reuse the toolserver censorship table [18:36:19] scrap that, I'll use the latter [18:38:21] DarTar: can't you just use dblists? [18:39:27] jeremyb_: I think the list that I have for the TS is the most conservative [18:39:58] but we can replace it later as needed [18:39:58] DarTar: i don't follow [18:41:40] ottomata: i really hate this error. http://localhost:8888/oozie/list_oozie_coordinator/0001820-130321155131954-oozie-oozi-C [18:41:47] i'm telling you this because it's extremely misleading [18:42:15] what it's saying is that when oozie when to materialize the output dir, it checked the DoneFlag and determined the dirs all existed [18:42:21] which caused it to omit the property [18:42:32] then the workflow bitched because it didn't get the property [18:42:34] which is all FINE [18:42:47] but it still shows up as a failed job [18:43:31] say again? [18:44:09] E0738: The following 1 parameters are required but were not defined and no default values are available: dataOutput [18:44:11] that is a lie [18:44:15] well [18:44:17] it's true [18:44:19] but it's misleading [18:44:30] it means the coordinator calculated no materialized output path [18:44:38] oh because they DoneFlag existed? [18:44:49] so it thinks there's nothing to do? [18:45:30] exactly. [18:45:35] but it still shows up as a failed job [18:45:43] because it tries to validate the workflow parameters [18:45:45] and fails [18:45:55] like, yes, i know that dir exist [18:45:58] *exists [18:46:06] so how about you just skip it and do the next one, okay oozie? [18:46:07] sigh [18:46:24] ah yeah [18:47:20] well shit [18:47:26] PersistenceException: Data truncation: Data too long for column 'conf' at row 1 [18:47:48] ah poo, even with the vars? [18:48:01] yes. [18:48:10] http://localhost:8888/filebrowser/view//libs/kraken/oozie/mobile/device/props/workflow.xml [18:48:14] scroll to the bottom [18:48:23] i even resubmitted the coordinator to make sure it reloaded everything [18:48:29] gragh [18:48:31] that's awful. [18:48:47] i think that means i really have to chain coordinators together or something awful like that [18:48:58] can we look at the db or something [18:49:03] how big is that stupid column? [18:49:55] hmmm [18:49:57] we can look at the db yeah [18:49:58] one se [18:49:59] c [18:50:01] its on an27 [18:50:44] i'm going to attempt to read the source [18:50:44] https://gist.github.com/ottomata/5286849 [18:50:48] but... there's a LOT of it [18:50:55] too long for TEXT column??? [18:51:01] ...text?! [18:51:07] that is absurd. [18:51:15] is there a logfile on an27? [18:51:23] there has to be something [18:52:23] ja i'm sure, i'm reading stuff too [18:52:38] oozie-jpa.log looks propmising... [18:52:42] *promising [18:53:10] (i'm also in ops meeting) [18:53:13] it is not. [18:53:16] (okay) [18:57:02] i betcha $dataInput is just too long [18:57:08] and when you use it twcice it goes over the size [18:57:15] but... [18:57:16] text! [18:57:17] also [18:57:22] 2^16 is max size, unless max_allowed_packet is lower, that could cause problem too [18:57:23] dataInput shouldn't be passed to the sub-job [18:57:37] i turned OFF propagate conf [18:59:40] the conf is very long, i'm looking at it [18:59:42] in db [19:00:13] hm, lots of duplicate things [19:00:13] hm [19:00:48] aaah dschoon [19:00:54] yes [19:00:56] that labels: in the datasources? [19:01:00] that was loadbearing :) [19:01:05] the DB schema is ... not obvious [19:01:07] monthly import is broken [19:01:09] load-bearing? [19:01:17] proto_action_conf [19:01:18] ? [19:01:25] fixing, carry on [19:01:33] drdee: monthly import just got fun [19:01:46] of course it did [19:02:01] milimetric: bookmarked... [19:02:04] :) [19:02:35] cool, so the phrase "hangout?" aims to induce a pavlovian response to jump into that link [19:02:53] ottomata: ? [19:03:00] to that end, we would like management to create scripts which deliver us yummy chocolates every time someone says that phrase [19:03:26] yes, milimetric. I wish I could setup a custom parser for Colloquy to auto-translate that into a link. :) [19:03:49] * drdee wonders if he should open a mingle card for that :D [19:04:43] you can bookmark it in chrome [19:05:11] I also wish I had something that would link Mingle cards using the #312 format. [19:05:25] in irc, that is [19:05:39] we could add that to the analytics-logbot actually [19:05:44] ottomata: switching to gtalk, as it's too hard to follow what's going on in here [19:08:27] drdee, thanks for working on the monthly report, please ping me when you' [19:08:35] re done. (and robla) [19:09:20] milimetric: what's the problem regarding reportcard? [19:09:58] the yaml used to be "types: [], labels: []" [19:10:15] now it's "columns: [{label: '', type: ''}, ...] [19:10:33] so there is at least one spot where i have to make it backwards compatible [19:10:54] I'd say "nothing crazy" but I'd have learned nothing! [19:10:56] :) [19:15:50] average_drifter, ottomata: has the filter segfault bug been resolved? [19:16:04] i think not [19:16:14] in ops meeting, helping dschoon, going to get average_drifter more data soon [19:16:29] ok [19:16:32] (ottomata is very helpful, as always) [19:20:12] kraigparkinson: do you want to mention the analytics intensive session in the report? [19:20:30] sure [19:31:31] kraigparkinson: ookokokok [19:31:52] :) hangout? [19:31:57] looking... [19:32:01] the usual? [19:32:05] yep [19:32:22] k am there [19:39:00] hooray, thanks to otto, concat worked. [19:39:08] now i need to run an ad-hoc job to backfill. [19:45:09] hm. where is erosen? [19:50:26] drdee, you can link https://www.mediawiki.org/wiki/Analytics/Roadmap/PlanningMeetings/AnalyticsReboot for that part of the report. [19:50:48] yayyyy [19:50:50] yup [19:51:02] funny that I googled the problem and found a note from my past self on the internet [19:51:09] dschoon :() [19:51:11] :) [19:51:17] haha [19:55:56] kraigparkinson: is it possible to add watchers to a mingle ticket ? [19:58:24] hm, average_drifter, finding the offending webstats input this time is gonna be difficult, it does segfault that often [19:58:27] maybe once an hours ish, maybe more [19:59:02] average_drifter; yes you can subscribe to a card [19:59:41] mk. brb lunch [19:59:43] ottomata: ok, not a problem, please ping me when it stumbles [20:00:22] drdee, be right there. [20:01:05] drdee: ok, wanted to add ottomata as watcher to #500 [20:01:18] ok, dunno if that's possible [20:01:24] he has to do that himself i think [20:02:06] ok [20:02:58] waaat i do? [20:03:10] oh got it [20:06:25] hmmmmmmm, average_drifter, i'm not sure how to figure out what is causing segfaults [20:07:04] ok, then I'll run it against the latest /a/squid/archive/sampled [20:08:50] yeah [20:08:51] i just ran it against a 2G sampled file [20:08:54] no segfault [20:09:28] i'd need to capture unsampled data until it finally segfaults, to be sure I capture whatever is causing it [20:10:22] so, it last restarted at 19:31 [20:10:24] 40 mins ago [20:10:30] we'll see if it dies in the next 20 mins or so [20:10:42] collector is the hourly dumper, right? not filter? [20:10:52] if this thing lives for the next 20 mins or so [20:10:55] we should se enew dumps, right? [20:11:15] yes [20:11:29] k [20:16:39] just ran it on all tab-separated logs [20:16:47] we'll see if it breaks real soon [20:24:06] back [20:27:48] dschoon, just curious, have you solved the limnify step with your coelesce stuff? [20:27:59] what does it do, precisely? [20:28:13] output data in format that limn can graph [20:28:22] which would then get synced to kraken-public [20:28:27] and graphs on reportcard could point at urls there [20:28:52] coalesce does the same thing as concat_sort.pig [20:28:57] except without OOM [20:28:57] heh [20:29:06] concat sort pit output a single file though [20:29:10] you've got a file per month, right/ [20:29:12] ? [20:29:14] this does also. [20:29:23] it updates the daily file every time it runs [20:29:25] where's the file? [20:29:31] for which, zero? [20:29:39] zero has been disabled since i started on it [20:29:40] naw, for whatever, just wanna see [20:29:41] device props [20:29:45] i was testing with device props [20:29:51] you have the link... [20:29:58] http://stats.wikimedia.org/kraken-public/webrequest/mobile/device/props/2013/04/ [20:30:05] two files though [20:30:05] http://stats.wikimedia.org/kraken-public/webrequest/mobile/device/props/2013/ [20:30:07] 03, 04 [20:30:09] csv: http://stats.wikimedia.org/kraken-public/webrequest/mobile/device/props/2013/04/mobile_device_props-2013-04-01.tsv [20:30:14] right. [20:30:18] ottomata: got it [20:30:20] /a/squid/archive/sampled/sampled-1000.tab.log-20130208.gz [20:30:20] one per month [20:30:23] segfaults on this one ^^ [20:30:24] er, one per day [20:30:30] i haven't done per-month aggregates yet [20:30:33] concat sort made a single file total [20:30:38] i know. [20:30:40] this does also. [20:30:44] it runs at the end of each hourly job [20:30:53] with the output target aggregating the day [20:31:34] so each hour it reconcats everything from the previous day [20:31:39] er, from that day [20:32:27] i can add a monthly one as well. [20:32:35] hm, ok, but don't you want a a single file? i'm not talking about aggregates [20:32:37] but a single file wiith all hourly data [20:32:38] (now that we're sure it actually works) [20:32:42] can limn do multiple files? [20:32:45] not all ever [20:32:53] that file would be huge. [20:33:12] but anyway [20:33:13] ok, help me answer a more general question [20:33:14] for zero [20:33:18] this is kidna part of the monitoring stuff [20:33:21] right [20:33:29] i want to graph this [20:33:29] http://stats.wikimedia.org/kraken-public/webrequest/loss/ [20:33:44] ganglia or limn, don't care, i thikn graphing that with ganglia is going to be hacky [20:34:09] oh, i pivot on hostnames and keep average, which is hte last field [20:34:15] yes. [20:34:17] (or, I want to) [20:34:18] i was going to say [20:34:38] the big thing is that doing a pivot is a huge pain with any of these tools. [20:35:06] i think we're going to need to put actual support for pivoting in limn [20:35:32] but, that could get messy too, unless you've got some caching support built in too [20:35:39] lots of stuff to do for every reload [20:35:44] indeed. [20:35:52] sounds naastaaayyyyy [20:35:59] not too bad so long as the data isn't large. [20:36:13] for data the size of most of our CSVs, it's 10s of ms [20:36:36] we walk the whole dataset in order to apply date-based filters [20:36:53] (we always apply timeseries:{ start, end } [20:36:55] ) [20:37:17] average_drifter, what was it? [20:37:41] brb a moment [20:39:18] ottomata: I found a .gz it's segfaulting on, now I'm doing binary search on it to find the row [20:39:34] aye cool [20:39:34] heheh [20:39:40] that's what I did to find your row too! [20:39:52] i did it manually w sed, how are you doing it? [20:40:17] zcat .gz | wc -l to find out how many rows [20:40:23] then I take that number slice it in two [20:40:26] then I do stuff like [20:40:35] zcat .gz | head -HALF | ./filter [20:40:39] zcat .gz | tail -HALF | ./filter [20:40:47] aye ja [20:40:52] i I used tail, head too [20:40:53] hehe [20:45:18] I could write something to do it automatically... although considering I have to do at most log(7266786) ~ 22 tries ... [20:45:26] it will be ok to do it manually .. [20:46:35] ottomata: did you write something to do it for you ? [20:46:59] naw [20:47:05] did it manually [20:49:56] ottomata: there is one more speedup which you can do [20:49:58] man -k "binary search" [20:50:05] ottomata: have the two halfes start at the same time [20:50:13] ottomata: and if one segfaults before the other, it will notify the other [20:50:22] ottomata: and the other one stops [20:50:29] lots of man -S 3 [20:50:35] sadly, no -S1 [20:51:13] other than git-bisect :) [20:51:27] ahha, it doesn't take that long though [20:51:37] :D yeah, just sayin :) [20:53:21] so, ottomata [20:53:28] what else did you need from me re: concat? [20:54:36] oh [20:54:53] i'm really not sure what to do [20:55:00] i'd really like a graph of the loss percentages [20:55:04] even if it is averaged across all hosts [20:55:18] it'd be reallly nice if that was in ganglia, i'm cahtting with ori-l righ tnow about how/if that is possible or a good idea [20:55:20] its kinda weird [20:55:20] let me walk you through coalesce-wf.xml [20:55:23] and maybe that will help [20:55:26] hmm, ok [20:55:34] --> gtalk [21:14:37] drdee, do you remember the out of bounds exception but in maxmind geoip? [21:14:43] http://forum.maxmind.com/viewtopic.php?f=13&t=9353 [21:14:54] yes [21:15:12] we compiled 1.2.9-snapshot-2 to fix that [21:15:16] dschoon is hitting it, or something very similar [21:15:16] http://localhost:19888/jobhistory/logs/analytics1011:8041/container_1364239892421_3913_01_000012/attempt_1364239892421_3913_m_000010_0/stats [21:15:17] not sure [21:15:37] let me look at what jars are loaded. [21:15:41] 2013-04-01 21:01:45,696 ERROR [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:stats (auth:SIMPLE) cause:org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.wikimedia.analytics.kraken.pig.GeoIpLookupEvalFunc, Out of bounds access [254] [21:15:42] that jar is for sure in my home folder on an10 [21:15:48] Caused by: java.lang.ArrayIndexOutOfBoundsException: 254 [21:15:48] at com.maxmind.geoip.LookupService.getLocation(LookupService.java:905) [21:16:40] drdee: very first line is REGISTER 'geoip-1.2.9-patch-2-SNAPSHOT.jar' [21:16:46] mmmmm [21:16:52] that is the right jar [21:17:34] no other geoip jars even show up in the classpath [21:17:41] http://localhost:19888/jobhistory/logs/analytics1020:8041/container_1364239892421_3912_01_000002/attempt_1364239892421_3912_m_000000_0/stats/stdout/?start=0 [21:18:07] i switched to symlinking the databases though [21:18:12] i will try without [21:18:15] but that seems... silly [21:19:46] i have to rerun the coordinator anyway for those times, so i'll reup it now. [21:32:51] it seems that https://github.com/kohsuke/geoip has been updated to 1.2.9 so maybe upgrade the pom.xml from 1.2.5 to 1.2.9 and try again? [21:33:39] hm. [21:33:46] oh. [21:33:54] because generic builds a jar with all its deps. [21:33:56] ugh. [21:33:57] fine. [21:35:29] 1.2.9-patch1 [21:35:36] http://mvnrepository.com/artifact/org.kohsuke/geoip [21:52:05] i asked a friend of mine who builds hadoop/hbase apps for a living what they recommend for scheduling [21:52:13] he said he's never met anyone who uses oozie. [21:52:25] :) [21:52:34] that's comforting :/ [21:52:44] it feels about right. [21:52:52] the pain points are just too obvious [21:53:15] he said people usually write some python scripts and run them via cron. [21:57:30] btw, drdee, i updated the jars in kraken-0.0.2 [22:10:00] how do you guys find that problem board? [22:13:05] what problem board? [22:13:16] what's the status of https://gerrit.wikimedia.org/r/#/c/50452/? [22:13:41] you mean https://mingle.corp.wikimedia.org/projects/analytics/cards?favorite_id=756&view=%3EWIP+-+Problems ? [22:13:48] yeah, found it, thank you, but how do you find it? [22:13:50] there are no links [22:13:53] there is [22:14:02] it's called >WIP - Problems [22:14:08] status of that drdee, is that mark said it looked great and wanted to do a final review [22:14:11] that was early last week sometime [22:14:18] WIP? [22:14:24] See the little >> on the top right of the window, ottomata? [22:14:27] work in progress [22:14:36] I guess it's a << if its closed... [22:14:51] Click the <<, then look at the list of Team favorites... [22:14:52] brb [22:14:56] ah favorites! [22:14:59] k [22:15:13] I can promote it to a tab if you think that'll be easier. [22:15:24] (and if nobody else objects. :)) [22:17:01] cafe is closing [22:17:02] i'm outty [22:17:05] i updated a couple of cards [22:17:06] lataas