[00:20:31] yo dschoon, so are x-cs and x-carrier going to be running side-by-side? [00:20:47] what do you mean, drdee? [00:21:08] from arthur's email: [00:21:10] " It introduces the X-Analytics header (ultimately a replacement for X-CS) while preserving the X-CS header until we're sure X-Analytics is doing what we need it to and we confirm X-CS is safe to completely kill." [00:21:18] regarding https://gerrit.wikimedia.org/r/#/c/52606/ [00:21:28] yep. [00:21:33] as in, both headers are sent. [00:21:38] some are used by mobile frontend [00:21:45] and the zero extension [00:21:52] but X-Analytics is logged [00:21:52] but that does result in two separate fields? [00:21:56] find the varnishncsa file [00:22:12] iirc, /files/varnish/varnishncsa.default.irb [00:22:23] you'll see in his patch there's only X-Analytics [00:22:28] k [01:40:59] to anyone who cares, bad-xcs-lines should run for about 45m [01:41:22] so i'm going to head home [01:41:31] and we'll see if the kafka consumers get their fair share :) [01:41:33] bbl [03:17:56] I almost got it working [03:18:24] hopefuly I'll have something to show for the showcase meeting [03:22:55] nope didn't work [05:02:33] milimetric: good morning [05:02:45] lol, forgot it's midnight over there [12:38:57] hey average_drifter, morning [12:39:02] now it's a reasonable time here :) [12:39:53] personally I think we should install some mirrors to make all the timezones the same [12:39:55] :) [13:03:00] hashar: crazy bug in puppet in superm401's patchset [13:08:58] moooooooooooooning [13:09:56] milimetric: hello :) [13:10:07] milimetric: yeah puppet autoloading has some surprises from time to time :( [13:10:33] i didn't know, i'm just learning puppet too and had no answer to matt's problem [13:10:41] averag_drifter: around? [13:37:59] morning ottomata [13:38:08] morning! [13:39:07] aveage_drifter: around? [13:55:57] milimetric: can you help erik zachte with a git merge conflict? [13:56:13] sure [13:57:11] oh, signing on to skype [13:57:36] thx! [14:09:05] ottomata: how can erik push to gerrit from stat1:/a/wikistat_git ? [14:09:37] git commit [14:09:38] git puhs [14:09:38] ? [14:09:52] git review [14:09:52] ? [14:12:19] no [14:12:22] that doesn't work [14:12:32] what's up? [14:12:33] if it was that easy i wouldn't ask [14:12:36] which part? [14:12:43] i think that should work [14:12:45] it's the stuff with the username [14:12:58] the only reason it wouldn't is if he couldn't write to the .git directory [14:13:17] he can commit [14:13:20] he can't push [14:13:20] everything there looks ok, group writeable, group-owned by wikidev [14:13:24] what does it say? [14:13:48] right now he has a merge conflict [14:14:19] even i have to specify a specific remote origin [14:15:44] ahhhhhhh [14:15:48] right because its not set ok ok [14:15:49] yeah [14:16:03] ok yeah i see specific remotes for you and stefan [14:16:06] so yeah [14:16:07] ok [14:16:14] dan: erik is waiting for your help :) [14:16:17] have him or you (I could do it too) [14:16:19] do this [14:16:38] git remote add ezachte ssh://ezachte@gerrit.wikimedia.org:29418/analytics/wikistats.git [14:16:43] (if ezachte is his gerrit username) [14:16:52] then, to push [14:17:08] git push ezachte [14:17:12] or, if that complains [14:17:15] i'm not sure if it will [14:17:22] git push ezachte master [14:17:33] here, ezache is the name of the remote [14:17:41] so its saying to push to that remote [14:17:55] you could skip the remote add step, and just paste in the full url instead of the named 'ezachte' remote [14:19:06] drdee, is anyone using the mobile udp2log file from locke? [14:19:11] i noticed that it isn't synced to stat1 at all [14:19:19] and its still filtering using the m.wikimedia.org domain [14:19:22] rather than the varnish hostanmes [14:19:49] mmm [14:20:11] probably safe to start syncing this as well [14:21:47] yea, but from what we know of our use of mobile logs in kraken [14:21:52] this domain filter is not accurate [14:21:53] right? [14:23:19] right, the filter should be [14:23:40] .m|zero.wiki*.org [14:23:47] that is pseudo regex :) [14:24:06] hm [14:24:21] in kraken we are collecting based on varnish machine hostname [14:24:27] cp1041-1044 [14:24:28] right? [14:24:30] yes [14:24:32] shouldn't we do the same here? [14:24:35] sure [14:24:46] should be the same :) [14:25:15] i'm just doing this [14:25:17] /bin/grep -P '^cp104[1-4]' [14:25:50] mmk, i'll start doing this on gadolinium [14:26:28] and start syncing to stat1 [14:26:35] it looks like we used to have some kind of mobile sync [14:26:45] there are mobile logs on stat1, but they end at feb 2012 [14:27:06] mmm [14:29:58] milimetric around? [14:30:04] yeah, i went on skype [14:30:42] i added you the conversation [14:31:09] ... so weird [14:31:14] sorry my skype is silly [14:31:35] k [15:13:58] mornin [15:14:14] i have made many discoveries this night past [15:14:31] most annoyingly, this: http://localhost:8888/filebrowser/view/user/dsc/zero/bad-xcs/bad_xcs-2013-03.tsv [15:14:42] 1. our logs are not encoded in UTF-8 [15:14:52] 2. our UA strings sometimes have tabs in them [15:14:55] joyous. [15:15:24] ahha [15:15:28] number 2! [15:15:31] ori predicited it1 [15:15:32] ! [15:29:01] it is the user-submitted content [15:29:11] so of course it contains that which we despise [15:29:34] UA is user submitted? oh well, from their browser [15:29:41] or from ... curl [15:29:43] i guess t hey could set it manually, but for the most part its automatic [15:30:01] the more subtle problem is that without the right encoding, you can have a two-byte character that ends in the code for tab [15:30:29] ergh [15:30:29] and because the logs are all encoded as ascii, the value isn't properly escaped [15:30:35] yes. [15:30:36] exactly. [15:30:49] i don't think we're going to change the encoding from the cache servers though, right? [15:31:13] we basically have to [15:31:23] i'm sure there's a nasty attack in there somewhere [15:31:27] ja, i guess that's the only option [15:31:36] brb [15:31:47] can we just do what dan suggested and drop any logs with the wrong number of fields? [15:31:50] dschoon: about UTF-8 encoded logs, isn't' that a hue issue? [15:32:04] in other news, packet loss is looking way more sane, eh? [15:32:04] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=Packet+Loss+Average&vl=%25&x=&n=&hreg%5B%5D=(locke%7Cemery%7Coxygen%7Cgadolinium)&mreg%5B%5D=packet_loss_average>ype=line&glegend=show&aggregate=1 [15:32:13] if we do that, we hold at least count the number of lines we discard [15:35:01] hold == should [15:36:20] drdee: what do we do with the pig job I created? [15:36:28] wanna oozify it even though it's not right? [15:36:34] and we can fix as we go? [15:36:39] sure [15:36:50] waiting for reply from brion and yuvi [15:37:06] ok, ottomata: when you have a sec, I am not sure if copy/paste is a good idea with the plethora of oozie files [15:37:24] ? [15:37:25] david changed the job I was modeling after [15:37:32] oh [15:37:55] yeah actually, i was thinking, most of the coordinator and workflow that I was using is even more parameterizeable [15:37:58] so basically, I'm gonna run into problems and you're probably gonna have to look at it. [15:38:02] including the path tot he pig script [15:38:13] so I want to make your job of cleaning up after me as easy as possible [15:38:17] haha, ok [15:38:18] yeah! [15:38:19] let's do that [15:38:22] can I help with anything right now? [15:38:37] should I go ahead and copy paste from an existing job? [15:38:42] or start from scratch? [15:38:50] what would be most helpful, least awful for you to debug [15:39:31] i'll start from scratch and look at your on-wiki tutorial? [15:41:12] sigh [15:41:13] ottomata: [15:41:14] http://localhost:8888/oozie/list_oozie_coordinator/0000496-130321155131954-oozie-oozi-C [15:41:22] i have no idea why it succeeded *twice* [15:41:23] early on [15:41:27] and then failed repeatedly [15:41:32] i pray it's something obvious [15:41:55] heading into the office [15:41:56] back soon [15:43:02] dschoon [15:43:02] http://localhost:19888/jobhistory/logs/analytics1014:8041/container_1364239892421_0373_01_000002/attempt_1364239892421_0373_m_000000_0/stats [15:43:50] ERROR 1200: mismatched input ',' expecting RIGHT_PAREN [15:52:38] ottomata, dschoon, about that stupid tab in UA; how about just putting the user agent string as the final field in the web request log line? at least it wouldn't mess up other fields and is an easy fix [15:53:00] fine with me, but that is a very non backwards compatitble chagne [15:53:42] true but also easily patchable [16:13:50] yeahhhhhhhh, but i mean, if we run pig scripts over the full set of data, we'd have to have different scripts for different time periods [16:13:58] or special logic to handle it [16:16:54] but then we really need to escape tab very fast [16:17:10] so no standup today guys? [16:17:17] drdee? ^ [16:18:19] 1 sec [16:18:44] but very fast, just making sure that all cards in the right spot [16:21:34] ottomata: I'd like you to review a change. https://github.com/wikimedia/kraken/commit/b916283744288da61aa92692130f1641ddba78a3 [16:21:34] :) [16:21:37] i'm gonna go grab some food [16:21:42] brb 10 [16:32:02] kraigparkinson: https://plus.google.com/hangouts/_/14679bfb0b7641a63af4066ee3413679464576cb [16:32:49] drdee: http://stats.wikimedia.org/EN/draft/TablesPageViewsMonthlyOriginalMobilePetrea.htm [16:33:27] ottomata: editing #367 eh? [16:35:34] average_drifter: why are the numbers so low? [16:38:47] uhm, I don't know [16:39:57] drdee: can we showcase this? should we wait for when it's ready ? [16:40:07] don't showcase [16:40:12] ok [16:45:59] ottomata: any chance you figured out what is wrong with the device job? [16:46:56] oh, also, i think i figured out how to do concat without overrunning memory [16:48:00] you use hadoop streaming [16:48:04] and the lowly /bin/cat [16:49:58] device job? [16:53:14] http://localhost:8888/oozie/list_oozie_coordinator/0000496-130321155131954-oozie-oozi-C [16:53:16] drdee: Erik says to not multiply them by x1000 [16:53:28] drdee: but in my report I multiply them and I get this http://stat1.wikimedia.org/spetrea/new_pageview_mobile_reports/r41-refactored-complete-run/pageviews.html [16:53:53] ok milimetric [16:54:08] which is close to his http://stats.wikimedia.org/EN/TablesPageViewsMonthlyMobile.htm [16:54:09] howdy [16:54:30] oh that [16:54:30] yeah, dschoon [16:54:33] http://localhost:19888/jobhistory/logs/analytics1014:8041/container_1364239892421_0373_01_000002/attempt_1364239892421_0373_m_000000_0/stats [16:54:36] ERROR 1200: mismatched input ',' expecting RIGHT_PAREN [16:54:46] milmetric [16:54:46] so [16:54:47] but it worked the first two times! [16:54:50] bizarre. [16:54:54] you are trying to run this daily, right? [16:54:55] anyway, i am shamed and i will look [16:55:00] me? [16:55:04] no, milimetric [16:55:08] :) [16:55:26] dschoon, did the pig script change in hdfs /libs/kraken/pig? [16:55:32] so, milimetric: [16:55:44] you define input-events and output-events [16:55:56] but, the way I defined it was for hourly data [16:56:01] right [16:56:02] mobile is imported every 15 minutes [16:56:04] so the coord -4 [16:56:11] right [16:56:12] that didn't make sense, i knew [16:56:15] so if it is 8pm now [16:56:17] -4 is 7pm [16:56:21] and for input [16:56:21] right [16:56:24] but [16:56:29] so how do I get it to always be "that day" [16:56:43] i guess the inputs and outputs are mismatched [16:56:52] does it need to concat and aggregate? [16:56:54] well, since the mobile data is imported every 15 minutes [16:57:07] and you want to have the coordinator operate over a days worth of data [16:57:23] you have to define the input data set so that it picks a range of 15 minute intervals for your day [16:57:24] so in your case [16:58:06] current(0) would be like 2013-03-10_00:00, and you need to start at the previous day [16:58:13] which is however many 15 minute intervals are in a day [16:58:37] 96? [16:58:52] but just to make sure you get the overlap [16:58:55] make it 97 [16:59:05] so start instance should be current(-97) [16:59:25] and, for output-events [16:59:41] you defined the frequency of the "webrequest-wikipedia-mobile-platform-daily" dataset to be daily [16:59:48] so you can just do current(-1) [17:00:05] so when the 2013-03-10_00:00 appears [17:00:08] oh cool [17:00:16] oozie knows that it is ready to start crunching data for the previous day [17:00:23] so since your output dataset is daily [17:00:29] current −1 == 2013-03-09_00:00 [17:00:33] right [17:00:42] great! [17:00:51] now I understand that a little better [17:00:58] that was the only known unknown I had [17:00:58] DAY_REGEX I tihnk is good, i haven't looked at the pig script but I think that makes sense [17:01:09] yeah, I changed that consistently I think [17:01:12] anything else I missed? [17:01:34] you ran bin/hdfs_sync.sh ? [17:01:40] to get your pig and oozie files into hdfs? [17:01:51] nope [17:01:57] you'll have to do that before you submit the job [17:02:00] but is that after I make sure everything's peachy? [17:02:00] coordinator* [17:02:02] yeah [17:02:09] you'll need to do that to run with oozie [17:02:12] but yeah [17:02:19] so, if you haven't run this in oozie at all yet [17:02:20] so is everything else peachy? :) [17:02:24] it looks good yeah [17:02:25] yep, not at all yet [17:02:33] i'd recommend trying a single workflow first [17:02:36] before you do the coordinator [17:02:39] k [17:03:51] you should be able to use the same workflow.xml you have [17:04:04] but, create a job.properties file that fills in all of the variables [17:04:09] that coordinator normally would [17:04:17] average_drifter: scrum milimetric [17:04:22] like, get a big list of inputs that you want to run on, set DAY_REGEX manually [17:04:23] etc. [17:04:31] the oozie tutorial goes over how to do that [17:36:44] cool. streaming works for concat. [17:36:56] will still need some stupid fs commands to rename the file though [17:36:59] but that's no big [17:38:10] make a script in kraken/bin? [17:38:20] it'll be a workflow [17:38:32] will do after muffin acquisition [17:38:41] (muffins are srs bsns) [17:38:46] (like war crimes or livejournal) [17:38:48] brbrb [17:43:24] can you run the fs commands as a workflow action? [17:43:27] oh I bet you can [17:43:28] oozie is smart [18:05:00] drdee, kraigparkinson, ottomata: i'm creating cards for the issues i uncovered in the X-CS investigation [18:05:01] https://mingle.corp.wikimedia.org/projects/analytics/cards/471 [18:05:10] ty! [18:06:11] dschoon, I take it that's an immediate impact to the quality of service we're able to offer with respect to analytics in production. yes? [18:06:18] yes [18:06:29] it does not appear to be frequent, however [18:06:43] about 1:10,000. [18:06:55] but some of the requests we get are looking at events on that granularity [18:07:02] so it would be a considerable stddev for them [18:07:08] (provided there's an overlap) [18:08:03] drdee/dschoon, during investigation, can we assess how many/which jobs are affected by that? [18:08:36] all mobile jobs will be affected [18:08:41] I'd like to make sure that the stakeholders concerned are made aware, and we can negotiate a resolution time frame. [18:08:53] but it happens very rarely [18:09:03] i don't think it effects any current jobs [18:09:14] with the possible exception of the mobile app job [18:09:21] as the numbers for that are in the same neighborhood [18:10:27] it was ~900 records for 30 days [18:10:37] which is like, vastly beyond IEEE float rounding error [18:10:38] one short-term solution is to ignore log lines if it contains more fields than we expect [18:11:09] we had similar solutions in the past when working with wikistats [18:11:45] and it does not seem to materially affect the numbers we produce, if that would be the case then it would be a whole different ball game [18:11:47] can we drop this to a Standard class of service? [18:11:52] i think so [18:12:28] yes. [18:12:30] assuredly [18:12:52] INPUT RECORDS: 7,780,811,485 [18:12:59] OUTPUT RECORDS: 927 [18:13:00] so. [18:13:15] 1:1.191e-07 [18:13:17] ahem. [18:13:28] i think that gets to make its lovely way to "Immaterial" [18:13:45] http://localhost:19888/jobhistory/jobcounters/job_1364239892421_0177 [18:13:54] Intangible? :) [18:14:06] :D [18:14:22] nice. [18:17:47] updated: https://mingle.corp.wikimedia.org/projects/analytics/cards/471 [18:18:39] thanks :) [18:27:30] so there's two meetings kraigparkinson and drdee, which hangout are we using? [18:27:55] https://plus.google.com/hangouts/_/5b70172d0f7418695ff6d98f3cb53bbb7097e020 [18:28:07] use the hangout from the 30 minute showcase as per instructions from last week. I intentionally removed the hangout from the 1.5 hour one, but it got readded by someone. [18:28:32] oh gotcha [18:32:00] oh [18:32:02] i'm in a diff one [18:33:54] kraigparkinson, drdee, ottomata: https://mingle.corp.wikimedia.org/projects/analytics/cards/473 [18:33:59] that one is mad fun [18:34:07] i politely masked out the binary :) [18:40:35] wow, scientific paper [18:41:03] didn't know about that, but I did see one that predicts movie box-office profits [19:51:28] drdee: can I add the bug(segfault) that Andrew reported to me as a card on mingel ? [19:51:34] *mingle [19:52:12] add it as a task to card https://mingle.corp.wikimedia.org/projects/analytics/cards/460 [20:11:18] hey guys, I'm starting a hangout with Ryan. If anyone would like to join and quietly observe, you're very welcome to: https://plus.google.com/hangouts/_/649e039ee7e02d789aee145b5caa35deae9e45a5 [20:16:04] I wanted to join to quietly observe. G+ told me I'm not allowed [20:16:07] i gotta run [20:16:12] back after errands in a bit [20:17:54] milimetric: can you add stefan? [20:18:09] ryan's having lunch [20:18:16] k [20:18:17] i'll add you though average_drifter, and after lunch we'll talk [20:18:24] awesoem [20:20:01] ella still asking me if these were real people but she says "she liked the man who made the funny faces" [20:20:16] :) thanks ella [20:20:47] I ask myself whether or not I'm real all the time, so that's a very good question too [20:20:57] :D [20:21:00] drdee: https://plus.google.com/hangouts/_/649e039ee7e02d789aee145b5caa35deae9e45a5 [21:01:22] ottomata, can you help me and milimetric and figuring out to which boxes ssl mobile traffic is send? [21:02:31] is it even possible for anything but cp1041-1044 to host mobile sites? [21:17:42] drdee, ottomata, are you guys talking about this? [21:17:51] I'm curious too if you wanna bring it into main channel [21:18:00] i'll ask [21:22:03] drdee, i'm asking, but a quick dirty way to check would be to look for X-Forwarded-For fields in the webrequest-wikimedia-mobile data [21:22:18] if the field is set, then most likely it was set by nginx and proxied to varnish [21:37:11] i think no one who's listening right now knows [21:42:01] drdee: [21:42:01] [~]$ curl --head https://m.wikipedia.org/ [21:42:02] HTTP/1.1 302 http://en.m.wikipedia.org/ [21:42:02] Server: nginx/1.1.19 [21:42:02] Date: Wed, 27 Mar 2013 21:41:49 GMT [21:42:02] Connection: keep-alive [21:42:02] Location: http://en.m.wikipedia.org/ [21:42:02] Accept-Ranges: bytes [21:42:03] X-Varnish: 3031461865 [21:42:03] Age: 0 [21:42:04] Via: 1.1 varnish [21:42:04] X-Cache: cp1044 frontend miss (0) [21:42:05] Cache-Control: private, s-maxage=0, max-age=0, must-revalidate [21:42:22] OHHHHHH [21:42:23] drdee! [21:42:26] nginx caches too! [21:42:32] i think? [21:42:34] yo [21:42:39] lemme read some more [21:43:21] hmm, maybe [21:43:35] naw naw naw naw [21:43:37] i take that back [21:44:15] the confs i'm reading are about shared session cache, [21:44:15] nm [21:44:16]