[09:31:26] New patchset: Stefan.petrea; "Fixed segfault" [analytics/webstatscollector] (master) - https://gerrit.wikimedia.org/r/56902 [09:33:03] oh nice [09:33:07] the bot is working [09:34:36] !log notify ottomata that webstatscollector segfault was fixed in https://gerrit.wikimedia.org/r/56902 [09:34:42] !log test [13:24:55] ottomata: ! [13:24:58] ottomata: good morning :) [13:25:02] ottomata: how was your weekend ? [13:25:16] morning! [13:25:17] good weekend! [13:25:24] nice :) [13:25:28] how was yours? [13:25:38] ottomata: it was ok, thanks :) [13:25:41] ottomata: I solved the segfault [13:25:46] ottomata: https://gerrit.wikimedia.org/r/56902 [13:27:43] awesome! [13:27:46] I will try it out in just a bit then! [13:28:02] ok [13:33:48] Change merged: Ottomata; [analytics/webstatscollector] (master) - https://gerrit.wikimedia.org/r/56902 [13:53:55] moooorning [13:54:16] morning drdee [13:55:53] morning! [13:56:02] average_drifter, no segfaults so far! [13:56:04] yay! [13:56:15] morning ottomata! [13:56:15] how long does it take for the dumps directory to show up? [13:56:20] morning average_drifter! [13:56:36] i belief every hour ottomat [13:56:38] a [13:56:53] and maybe at the hour [13:56:59] so that would be in a couple of minutes [13:57:39] k [13:59:51] ottomata: :) [14:00:04] drdee: hi [14:00:13] yeahhh I got dumps! [14:00:21] oh they are old dumps : [14:00:31] ottomata: does that mean it segfaulted ? [14:00:35] no new dumps yet [14:00:35] nono [14:00:38] just hasn't dumped [14:00:48] no segfaults yet at all [14:00:53] before it segfaulted like right away [14:00:55] now it runs [14:00:59] so I expect that it works [14:01:01] drdee: I have news [14:02:29] shoot [14:03:19] http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r46-updated-logic/out_sp/EN/TablesPageViewsMonthlyOriginalMobile.htm [14:03:27] yes i saw that [14:03:53] numbers are at least 50% too low :( [14:04:08] maybe we should do a hangout, invite milimetric as well [14:04:12] and you talk us through the code [14:05:16] yes, moment, I have to finish up some docs for it, then I'll send a link for the hangout [14:05:35] ok [14:05:38] oh! average_drifter, new rule: [14:05:43] when anyone says "hangout?" [14:05:47] it means we all go into the standup [14:05:49] so bookmark that [14:05:54] good point milimetric [14:06:08] ok, so the hangout link is the one for the standup [14:06:09] #efficiency #solvingfirstworldproblems [14:51:43] average_drifter: any update? [15:22:40] morning kraigparkinson [15:23:25] still waking up… :p [15:23:28] gm drdee [15:24:31] I'm ready [15:26:18] drdee , milimetric I'm in the hangout :) [15:26:38] get started without me while i am running to get coffee :) [15:26:58] ok [15:27:35] gr, weird, no pinging going on with my irc anymomre [15:27:37] brt average_drifter [15:28:44] brt ? [15:28:54] be right there [15:28:55] ok [15:43:52] drdee, why would we discard a pageview if it comes from mobile? [15:44:09] ?? [16:02:35] drdee , milimetric thanks for giving your oppinion on this :) [16:02:45] np [16:08:26] so scrum is in 1h [16:17:33] good morning, kind sirs [16:18:29] milimetric: i didn't get to running that job on friday [16:18:32] due to lack of brain [16:18:39] but! [16:18:43] i can tell you my idea [16:19:19] i was going to incrementally broaden my search using kraken [16:19:30] start with all UAs that contain "wikimedia" at all [16:20:13] (user_agent MATCHES ".*(?i:Wikimedia).*") [16:20:21] and count that [16:20:31] if that differs significantly [16:20:45] then i know my problem is somewhere in my filtering [16:20:52] if it looks similar to the original results [16:20:59] we are probably missing the data [16:24:40] hey dschoon, just grabbed lunch [16:24:44] word [16:25:07] diederik suggested the sensible thing: i'm installing the android app and hitting the site [16:25:13] heh [16:25:15] then i'll grep for my ip address [16:25:16] sure [16:25:29] ...you'll need to hit the site a few thousand times to be sure [16:25:40] oh, the mobile site [16:25:42] ah [16:25:44] yes, sure [16:25:55] it should be in Kraken within 15 minutes right? [16:26:05] yes [16:26:14] k, gonna eat lunch and get to it [17:01:00] ottomata scrum [17:01:07] average_drifter scrum [17:01:13] doh doh [17:03:27] ok [17:12:57] ottomata: http://localhost:8888/oozie/list_oozie_workflow/0001814-130321155131954-oozie-oozi-W/ [17:12:57] click on the log tab [17:14:31] org.apache.openjpa.persistence.PersistenceException: Data truncation: Data too long for column 'conf' at row 1 [17:16:08] is the term "unsampling" correct ? [17:16:16] what do you call it like from a statistics PoV ? [17:17:52] like for example if you sampled 1:100 , but then you need to multiply back by 100 [17:18:04] extrapolating? :) [17:18:17] kraigparkinson: extrapolating ! that's it, thanks [17:33:50] i think it's interpolate, actually [17:33:53] when you go up, you interpolate [17:34:09] (up := fill in missing data by inference) [17:34:12] thanks, dschoon [17:34:30] when you go out (:= infer new dimensions) you extrapolate [17:34:43] usually you extrapolate time [17:38:18] (such is my understanding) [17:41:23] milimetric: we need to fix the datasources page, 4 realz [17:41:42] i saw, matt walker and many others need it [17:41:45] yes! [17:42:18] so the problem was that saving the datasource now fails because of a cyclical reference during JSON serialization [17:42:21] you said you had a hunch [17:42:25] yeah. [17:42:27] but also [17:42:40] the column headings now show the toString of functions [17:42:43] so it's more than just that [17:44:57] that i think i fixed [17:45:17] i'll make sure everything's pushed everywhere [17:45:50] dev still shows it [17:45:53] test does not [17:45:58] but test doesn't have the vis options [17:48:08] yea, I didn't deploy it [17:48:12] it's fixed though [17:48:51] lmk when it's up on dev [17:49:08] dschoon, ottomata, drdee, I sent you all a response to my previous investigation email. What do you think? [17:49:15] looks like we're not getting the traffic into kraken [17:49:25] any reason for skepticism of my method? [17:49:26] will check it out [17:49:37] reading [17:50:12] [travis-ci] develop/87b285e (#115 by milimetric): The build passed. http://travis-ci.org/wikimedia/limn/builds/5963267 [17:51:20] maybe run the query against the entire day? just to be sure? [17:52:19] ok dschoon, deployed: http://dev-reportcard.wmflabs.org/datasources [17:52:48] running against the whole day would mean my data traveled back in time but I suppose anything could happen right drdee? [17:52:49] :) [17:53:59] timezones, milimetric timezones :D [17:54:11] i also said just to be sure :) [17:54:34] obviously it shouldn't matter but just ruling out alternative expanations [17:54:39] right, true [17:54:54] so timezone wise though, it'd be good to know for sure that kraken's on GMT [17:54:57] that's true right? [17:55:03] kraken us UTC [17:55:04] *is [17:56:54] this line is correct too right? [17:56:54] grunt> dan_log_fields = FILTER log_fields BY remote_addr MATCHES '.*96\\.227\\.53\\.90.*'; [17:57:21] as in, there's no way my IP address would be in there but not get caught by this [17:59:36] ok drdee, of 142 million records, 0 match that [17:59:43] (looking at the whole day) [18:00:27] gonna do the monthly reportcard stuff now, but if nobody finds fault with this, I'll talk to ops to pick their brain [18:00:44] milimetric: yes, that looks right [18:00:57] hm [18:00:59] actually [18:01:12] that first match is greedy [18:01:13] try this: [18:01:38] dan_log_fields = FILTER log_fields BY remote_addr MATCHES '\\s*96\\.227\\.53\\.90\\s*'; [18:01:48] because i think the first .* was greedily eating the rest of the string [18:02:02] heh, no [18:02:10] then .*anything would never match anything [18:06:35] right [18:06:37] it might not! [18:06:44] dschoon, you merged the metric defs stuff into master reportcard branch? [18:06:49] i forget. [18:06:50] probably? [18:06:52] yeah [18:06:55] hm... predicament [18:07:04] we should not do that until we launch a feature officially [18:07:30] because now we might have to deploy latest to prod [18:08:18] mk, so re: the regex, I'll try it because hey it's crazy day here at WMF. But if that's the case I'm never using regex in pig ever again. Or pig for that matter :) [18:08:56] milimetric, iiinteresting! I just read your email [18:09:03] yea [18:09:13] whatever's happening, its probably not ops' fault, it would be ours [18:09:17] so don't run off to them yet :p [18:09:22] oh definitely not their fault [18:09:32] my approach was gonna be: we're stuck, can you help? [18:10:15] it's obviously on our side so I'll happily exhaust other options you can think of [18:10:26] well, let's see if we can figure it out first, [18:10:32] I suppose my IP could be wrong.. lemme vet that [18:10:33] first, i'd try just matching exactly with = instead of regex [18:10:37] if you are looking for a single IP [18:10:39] k [18:10:40] yeah [18:10:44] i was about to say that [18:10:55] grunt> dan_log_fields = FILTER log_fields BY remote_addr = '96.227.53.90'; [18:10:57] also, the import is funky, there's no guaruntees about order or time buckets [18:11:03] milimetric: http://tire.less.ly/asking/wtf-is-my-ip/but-in-plaintext-pls/ [18:11:04] so, you don't need to grab the whole day [18:11:14] or http://tire.less.ly/asking/wtf-is-my-ip/ :) [18:11:18] but, grab maybe an hours worth of time around when your request was made, just to be safe [18:11:42] no that's what i was doing first but drdee said to do the whole day [18:11:53] 'cause he believes in time travel and has little faith in timezones :P [18:11:59] http://dev-reportcard.wmflabs.org/graphs/donationdata-vs-day [18:12:39] and all the imports had happened for that hourish time period you looked in? [18:12:51] works for me dschoon [18:12:55] also, we need to be sure your requests were going to one of the 4 varnish hosts we are importing from [18:12:59] milimetric, let's try this real quick [18:13:08] not sure how i know the imports happened or not [18:13:15] i'm going to turn on unsampled udp2log on an09 for a, can you make some requests? [18:13:18] is fs -ls enough? [18:13:33] yeah, if the files you are looking for are there [18:13:34] the imports happened [18:13:47] ok, say when and i'll hammer it at like 1 request / second :) [18:13:47] also, we could look in the kafka buffer too [18:13:47] k one sec [18:13:55] k, then they're there because it said there were like 9 million records it looked through [18:14:16] right but, you have a wildcard for a few imports [18:14:17] like [18:14:26] maybe the latest 15 minute interval that you needed ahdn't been imported yet [18:14:34] I had 16:45* and 17* [18:14:48] and i did requests from 16:50 onward [18:15:23] actually [18:15:44] you had all of 17:00-17:45? or just 17:00? [18:15:51] is there a way for you to know based on what you get back what varnish host you were serviced by? [18:15:56] yes [18:15:58] look at header [18:16:06] it has X-Cache host or somethign [18:16:27] X-Cache: cp1043 frontend miss (0) [18:16:36] milimetric, real quick before I turn this on [18:16:36] do: [18:16:39] milimetric: can you look at that? [18:16:47] curl --head http:// [18:16:54] and look at value of X-Cache [18:16:58] from my phone? [18:17:23] 17* to your previous question, meaning all of 17 [18:17:25] yeah [18:17:33] ok, hang on, gotta find a terminal emulator [18:17:35] heh [18:17:46] oh ha [18:17:57] or uh hm [18:19:16] heh, balls, this emulator has no curl!! [18:19:16] :) [18:19:16] milimetric: what URL are you using? [18:19:16] sec [18:19:31] i browsed randomly, star trek, international space station, etc. [18:19:37] using the WMF app, right? [18:19:39] yes [18:19:44] version 1.3.4 [18:20:05] milimetric, if finding a way to get headers on your phone is going to be annoying, let's do udp2log thing first [18:20:08] lemme know if you are ready to browse [18:20:23] rdy [18:20:28] k go [18:20:40] do as much as possible, this gets huge fast! [18:20:44] gonna shut it of in a few seconds [18:20:49] 1G already [18:20:54] i'll shut it off at 2G [18:21:06] you got some requests in there? [18:21:14] milimetric? ^\ [18:21:21] oh yea [18:21:26] a bunch [18:21:29] ok cool [18:21:31] and what is your IP/ [18:21:31] ? [18:21:38] i'll double check it, one sec [18:22:37] 96.227.53.90 [18:23:21] so if they were cached, would the request still go through to kraken? [18:23:36] when i was looking through the logs, I remember a TON of miss/200 statuses [18:23:49] and very few hits, and those were very weird requests [18:24:00] it shouldn't matter if it's cached [18:24:09] it still proxies through the varnish boxes [18:24:36] yeah, that's my first suggestion though - see if we can find any examples of cache hit 200s in kraken [18:24:56] from mobile that is, so UA MATCHES 'WikipediaMobile/.*Android' [18:25:26] i suggest you always use .*? [18:25:28] not .* [18:25:36] the greedy version is rarely what you want :P [18:26:13] i know all about .*? and it isn't the witch you are hunting [18:26:19] yes [18:26:46] hm, I don't see that IP in the logs I captured milimetric [18:26:57] ah wait [18:26:58] yes I do [18:26:59] hm [18:27:02] hang on [18:27:12] #edgeofseat [18:27:16] udp-filter just diidn't catch it! [18:27:17] grep does [18:27:27] 25 requests [18:27:35] yeah, that sounds about right [18:27:43] but [18:27:44] this last time i made only four or five [18:27:49] welp [18:27:51] before i made maybe 4 groups of 5 [18:27:52] screwed, i think. [18:27:53] https://github.com/wikimedia/WikipediaMobile/blob/master/assets/www/js/app.js#L124 [18:28:06] it hits $lang . $project .org [18:28:12] which means en.wikipedia.org [18:28:16] rigiht [18:28:17] not en.m.wikipedia.org [18:28:21] none of these are on cp1041-44 [18:28:24] correct. [18:28:27] grep 96.227.53.90 webrequest.log | awk '{print $1}' | sort | uniq -c [18:28:27] 2 cp1001.eqiad.wmnet [18:28:27] 3 cp1002.eqiad.wmnet [18:28:27] 1 cp1006.eqiad.wmnet [18:28:27] 1 cp1007.eqiad.wmnet [18:28:27] 3 cp1008.eqiad.wmnet [18:28:27] 1 cp1009.eqiad.wmnet [18:28:28] so we don't always get the data. [18:28:28] 1 cp1011.eqiad.wmnet [18:28:28] 1 cp1013.eqiad.wmnet [18:28:29] 1 cp1014.eqiad.wmnet [18:28:29] 1 cp1016.eqiad.wmnet [18:28:30] 1 cp1017.eqiad.wmnet [18:28:38] so there's the answer. [18:28:42] ugh [18:28:46] there are hundreds of machines it might be on. [18:28:57] so, what's the prob? this is the api? [18:28:58] https://github.com/wikimedia/WikipediaMobile/blob/master/assets/www/js/app.js#L124 [18:29:04] yep [18:29:07] the app doesn't hit .m. [18:29:12] well, that's ok [18:29:19] that's ok? [18:29:19] it's not, really [18:29:20] well, wait [18:29:28] kraken only imports hits to cp1041-44 [18:29:32] it's not just the .m. part, these lines aren't in kraken ANYWHERE [18:29:36] those are the machines that handle .m. [18:29:41] right [18:29:41] of course not. [18:29:43] because i looked for purely the IP [18:29:46] right [18:29:51] kraken only gets cp1041-44 [18:29:55] cp1041-44 are .m. [18:30:00] oh! [18:30:03] i didn't know [18:30:04] we are importing all logs from only 4 varnish hosts, because we were told all mobile traffic goes through those [18:30:04] yes. [18:30:13] k, so we don't have the data [18:30:13] why not? [18:30:17] we have all mobile site data [18:30:19] not api [18:30:22] we were probably told "all mobile *site* traffic" does [18:30:24] exactly. [18:30:27] right [18:30:34] api traffic is evenly spread across the cluster [18:30:35] but why are we not importing everything else? [18:30:37] not allowed on MVC? [18:30:37] well, probably not, actually [18:30:39] kraigparkinson: drdee: need status updates for https://www.mediawiki.org/wiki/Wikimedia_engineering_report/2013/March#Analytics [18:30:51] api traffic is a specific subset [18:31:06] proof of concept, keeping traffic imports low to make sure we don't mess stuff up, also, kafak hadoop consumer is hacky, so we wanted to reduce potential problems [18:31:08] we used to take everything, but we turned it off for ephemeral reasons [18:31:09] it would be so nice if a limited number of servers were dedicated to api traffic [18:31:16] robla: working on it here: http://etherpad.wmflabs.org/pad/p/AnalyticsMonthlyEngineeringReport [18:31:33] robla, by when do we need to have it buttoned up? [18:31:54] milimetric, do we need api traffic for our upcoming deliverables? [18:32:01] ok, so drdee, kraigparkinson - do we abandon card 92 right now or do we launch an initiative to turn on the firehose? [18:32:11] yes ottomata, card 92 [18:32:25] i'll change it to blocked. [18:32:33] we can't turn on the firehose on mvc [18:32:50] kraigparkinson: today was the main deadline. Check in with Sumana now for exact timing [18:32:51] ok, then it's abandoned unless there are any objections [18:32:53] i also added a note about .m. and the API [18:33:00] thanks robla [18:33:01] milimetric: please update drdee and kraigparkinson [18:33:05] will do [18:33:10] drdee, kraigparkinson, hangout? [18:33:14] :) they'll decide, but yeah, it's blocked for us [18:33:41] milimetric, sure [18:33:46] got one? [18:33:53] remember, that's our codeword [18:33:56] it means standup link [18:34:07] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [18:34:18] everyone should bookmark that ^^ [18:35:18] drdee: writing an RT ticket to change the UMAPI access privs, do you have a list of private DBs handy? Otherwise we can reuse the toolserver censorship table [18:36:19] scrap that, I'll use the latter [18:38:21] DarTar: can't you just use dblists? [18:39:27] jeremyb_: I think the list that I have for the TS is the most conservative [18:39:58] but we can replace it later as needed [18:39:58] DarTar: i don't follow [18:41:40] ottomata: i really hate this error. http://localhost:8888/oozie/list_oozie_coordinator/0001820-130321155131954-oozie-oozi-C [18:41:47] i'm telling you this because it's extremely misleading [18:42:15] what it's saying is that when oozie when to materialize the output dir, it checked the DoneFlag and determined the dirs all existed [18:42:21] which caused it to omit the property [18:42:32] then the workflow bitched because it didn't get the property [18:42:34] which is all FINE [18:42:47] but it still shows up as a failed job [18:43:31] say again? [18:44:09] E0738: The following 1 parameters are required but were not defined and no default values are available: dataOutput [18:44:11] that is a lie [18:44:15] well [18:44:17] it's true [18:44:19] but it's misleading [18:44:30] it means the coordinator calculated no materialized output path [18:44:38] oh because they DoneFlag existed? [18:44:49] so it thinks there's nothing to do? [18:45:30] exactly. [18:45:35]