[14:30:42] heeeeehhihaaaaa [14:46:34] mooorning ottomata [14:47:18] mornign! [14:49:13] quick sync? [14:49:16] https://plus.google.com/hangouts/_/4901decbfb1b665489d46b2aabe8aa2ef330f9e0 [14:51:11] chat for a min? I'd have to go upstairs [14:51:26] k [14:52:19] whaaassuuup? [14:53:35] nut much; just talk bout varnish [14:53:36] ? [14:53:41] aye [14:54:06] how about today I write an email to ops asking if partitioning udp2log stream is cool [14:55:39] ok, sounds good; and can you start prepping separate config changes to the logging format? [14:55:58] yeah lemme add that [14:56:09] i have one todo that has a due date: [14:56:13] tab as seperator [14:56:17] jan 31 it says :p [14:56:35] but the others: alpha/beta cookies? that one is interesting [14:56:38] i have to say I don't like it very much [14:56:44] what was the third? [14:57:25] good morning everyone [15:01:29] good morning milimetric [15:01:36] ottomata, what don't' you like? [15:01:57] well, its seems so one-off [15:02:13] we're like "Oh, somebody's application sets some cookies! We should log those in the webrequest logs" [15:02:28] its fine, we can do it [15:02:43] well, that's the only way we can distinguish beta visitors from regular visitors [15:03:00] would it be better to add an arbitrary field to the line that anyone can use? [15:03:03] some [15:03:16] X-Meta header or something? [15:03:24] then their app can just set it? [15:04:28] the current way is already operational, but can't we just put all the cookie key/value pairs in a single field and delimit by semicolon? [15:04:55] hmmmmmmmmmmmmmmmmmmm [15:05:02] i'd be for that, but it might be a lot, no? [15:05:17] there can be a lot of cookies, right? [15:05:24] we set very few [15:05:27] almost none [15:06:13] hmmmmmm [15:45:09] drdee, i'm trying to figure out how to test this cookie thing [15:45:24] and you are right! mediawiki does not set cookies! [15:45:24] heh [15:46:46] :) [15:48:57] were you planning on sending that mobile email today? [15:49:35] drdee: I'm missing some specs [15:50:02] ottomata, yes, once we al agree [15:50:06] average_drifter: what specs/ [15:50:07] ? [15:50:08] drdee: we'll have to talk about them today [15:50:09] drdee: you know when we discussed about the new mobile pageviews reports ? [15:50:37] drdee: you told me I think about matching on url not only on *.m.wikipedia.org but more like *.m.wikipedia.org/wiki/ [15:50:48] and *.m.wikipedia.org/w/api.php [15:50:51] not sure [15:52:15] waiting for december to finish fixing from the "11.22.33.44|XX" problem [15:52:30] after that's done, I'm going to run the repots for december as well (country reports) [15:52:43] busy right now [15:52:54] ok [15:57:41] average_drifer: i am ready [15:57:56] I am ready too [15:58:36] https://plus.google.com/hangouts/_/4901decbfb1b665489d46b2aabe8aa2ef330f9e0 [15:59:14] tl;dr => you told me that instead of matching *.m.wikipedia.org we should match on *.m.wikipedia.org/wiki/ and *.m.wikipedia.org/w/api.php for new mobile pageviews reports [15:59:52] drdee: it says I don't have permission for that hangout [16:00:30] prolly it's not for me [16:00:31] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [16:01:07] I'm there [16:11:37] drdee, just so I have it straight: [16:11:41] log format changes are [16:11:54] - space -> tab [16:11:55] - fix x-carrier [16:11:55] - cookies [16:11:55] right? [16:13:37] yes [16:13:41] and i would say x-carrier [16:13:51] first,then cookies, then tab [16:14:11] tab after cookies! [16:14:13] you crazy [16:14:14] but ok [16:16:46] would it be bad to deploy these all at once? [16:16:52] i would submit them as separate patchsets [16:17:02] i need ops help to deploy these [16:17:10] and the fewer times I need to bother them the better, right? [16:17:19] I guess I can ask in the RT I make for this [16:19:23] uhhmmm, no no [16:19:29] happy to listen to your advice [16:19:37] what do you think is the best sequence? [16:21:01] ottomata ^^ [16:21:36] combine cookie and xcarrier in one [16:21:40] and tab separate [16:21:40] cookie seems like it could be more controversial, and since I have to submit these patch sets sequentially [16:21:52] i would almost have it in order of least resistence [16:21:57] x-carrier, tab, cookie [16:22:01] ok [16:22:02] BUUUUT [16:22:05] the first two have been talked about for a while [16:22:07] cookie is new [16:22:14] tab requires changes to udpfilter and webstatscollector as well [16:22:22] and needs to be announced [16:22:34] hm, ok……. [16:22:42] i can make the filter change as part of the same patchset [16:22:54] you [16:23:00] you've already announced once that this is coming, right? [16:23:08] not wide enough [16:23:21] i want to send out an email list with a scheduled date [16:46:02] ok, run for december started [16:46:14] for country reports [16:46:20] it should be ~20h or so [17:00:43] drdee: diffdb? [17:01:00] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:02:20] ohhhhh ok [17:35:15] moornin [17:36:38] mornign [17:40:21] moooooorning [17:53:54] hi folks....there's currently "Firehose checkin" scheduled for noon PST today....I'm assuming that's not actually happening, right? [17:54:37] drdee: ^ ? [17:54:44] yeah we assume not [17:54:55] due to eqiad migration, right? [17:54:57] no we are not d [18:34:17] oh - very cool git tip Matt showed me [18:34:20] git checkout - [18:34:23] works just like cd - [18:34:35] (goes back to the last branch you were on) [18:35:18] I DID NOT KNOW cd - [18:35:21] !!!!!!!!!!!!!!! [18:35:30] MY LIFE IS FOREVER CHANGED [18:35:45] i know pushd and popd, but those are annoying to use [18:37:53] lol [18:38:00] <3 [19:07:16] drdee: it worked! http://analytics1027.eqiad.wmnet:8888/filebrowser/view/user/diederik/wikihadoop/part-00000?file_filter=any [19:07:32] no it did not [19:07:35] hehe [19:07:37] really? [19:07:38] the command was wrong [19:07:43] rerunning it right now [19:07:49] i see .... [19:07:59] http://analytics1010.eqiad.wmnet:8088/proxy/application_1355947335321_9549/ [19:08:10] what happened? [19:08:24] it used the regular input codec [19:08:26] is this just the original dump? [19:08:28] not the wikihadoop one [19:08:29] ye [19:08:30] s [19:08:34] gotcha [19:08:44] so i forgot to add [19:08:44] http://analytics1010.eqiad.wmnet:8088/proxy/application_1355947335321_9549/ [19:08:50] -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat [19:09:10] the current output looks real good [19:09:34] join me https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [19:36:36] drdee, [19:36:44] yo [19:36:59] i just checked an hour of blog data in kraken vs the same sequence numbers in the files [19:37:04] i've got 100% of data in both [19:37:13] ugh [19:37:23] third possibility was the only one we didn't want! [19:37:49] are you going to run it for a longer period of time? [19:38:26] well, i mean, that is a good thing, 100% of data in both [19:38:47] that means that kafka importing works, and udp of a small partition works [19:38:55] yeah, i'll check a days worth [19:39:11] oh i don't have a full day yet [19:39:12] uummm [19:39:14] i'll check what I have [19:39:24] but you saw 5% missing packets with the same method yesterday right? So it's not that it's working, it's just working _right now_ [19:39:33] drdee, did, I didn't [19:39:36] same == kafka [20:05:07] ah poo, there are duplicate seqs in the data im' examining, because the varnishncsa instance was restarted a few times [20:05:07] hm. [20:18:58] ok, got it [20:19:01] over a larger timespan [20:19:02] https://gist.github.com/4598026 [20:19:05] drdee [20:19:07] so [20:19:11] udp2log > file [20:19:13] got 100% of sampled data [20:19:20] udp2log | kafka | hdfs [20:19:21] is missing some [20:20:50] but it has most of it [20:20:53] 99.0894137% of it [20:21:26] this is in an almost 24 period [20:21:30] about 23 hours [20:21:36] I will email results [20:29:43] that's good enough for now [20:32:14] ottomata, could you install http://pypi.python.org/pypi/diff-match-patch/ on all an* nodes? [20:32:38] or is there an easy way for me / erosen to do this? [20:33:37] you guys have sudo, but it'd be annoying for you to do it on all nodes [20:33:42] yes ;) [20:33:51] why you need all nodes? streaming job? [20:33:58] yeah [20:34:01] but streaming hadoop [20:34:03] not streaming pig [20:35:14] cool [20:35:17] urrrrm [20:35:37] is there a .deb? [20:35:43] probs not [20:35:59] I can pip install? [20:36:01] i think i might be able to find a way to ship it, though [20:36:03] yeah [20:36:07] just tested it myself [20:36:43] shipping it would be better if you can, will it be real hard? [20:36:59] no [20:37:04] drdee didn't ask me yet [20:37:05] erosen, how? [20:37:19] the -files does not work [20:37:19] we can do the tar trick [20:37:28] go for ti! [20:37:33] why won't the files work? [20:37:35] if you can figure out the tar trick, that will be very helpful for me too [20:37:59] we tried that :) [20:38:07] hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -libjars wikihadoop-0.2-CDH4.jar -files revision_differ.py,diff_match_patch.py -input /user/diederik/arwiki-20130120-pages-meta-history.xml.bz2 -output /user/diederik/wikihadoop4 -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat -mapper revision_differ.py [20:38:14] -files revision_differ.py,diff_match_patch.py [20:38:22] it is sending the diff_match_patch.py file [20:38:31] but python cannot find it [20:38:58] interesting [20:39:15] i'm confused [20:39:28] i thought we were just trying to "ship" the mapper reducer files before [20:39:51] i was shipping both [20:39:53] I didn't realize we had tried the libraries [20:40:01] the revision_differ and it's dependency [20:40:51] in the worst case couldn't we shell out and install pip with user rights from within the script? [20:40:55] install w/ pip, i mena [20:45:31] i've finally got a dev env working, so I'll hack on it for a whil [20:58:01] guys, is this a good place to start reporting our progress on mobile analytics: [20:58:02] http://www.mediawiki.org/wiki/Analytics/Mobile [20:58:10] (page does not exist yet) [21:43:16] ottomata, thanks for your email! [21:43:53] how would udp2log / flume work? [21:47:42] i'm not running flume rigiht now, so I don't have the data for that time period [21:47:53] are you asking if it would work better? [21:52:55] yes [21:55:59] not sure [21:56:12] i would say no, because yesterday when I was trying it it wasn't looking so good [21:56:18] i think it *should* work better though! [21:56:19] but it wasn't [22:08:59] ottomata, drdee, how does flume write to HDFS? [22:10:07] you guys wanna have a general chat about this stuff? [22:10:13] i think I need to talk to someone, i'm feeling a wee discouraged [22:10:33] let's talk [22:10:56] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [22:22:44] dude! ottomata! [22:22:45] SO GOOD