[01:11:53] it'd be easier to just filter out all traffic from russia and china :) [01:12:11] strangely, i bet things would look awfully reasonable then [01:13:24] man. 8k URLs? seriously? [01:13:39] IE truncates anything longer than 2048 [01:13:46] (IE6-7) [01:14:34] dschoon: functional javascript is crazy [01:14:35] :) [01:14:50] String.prototype.toFunction() for lyf! [01:49:43] i don't know WHO thought it was okay to run more than one appliance at the same time [01:49:44] but geez [14:26:36] Morning guys [14:27:19] Gonna be online a bit later [14:28:03] morning [14:33:38] Moooooooooooooooorning! [14:33:57] Is it a stormy day today? [14:34:26] not so bad, no! [14:34:30] looks nice [14:34:51] :D [14:35:42] I was referring to 'storm' ;) [14:37:10] ahhhh [14:37:11] haha [14:37:15] maybe! heh [15:31:54] morning ottomata, drdee [15:35:07] ottomata: what is the feasibility of moving all of the sampled logs over to hadoop? [15:35:14] looks like 184 G [15:35:24] yeah we can do that, i've done it before…wait, have I already...? [15:35:43] http://hue.analytics.wikimedia.org/filebrowser/view/user/otto/logs/sampled?file_filter=any [15:36:06] nice [15:36:22] is that all of them/ [15:36:32] that recent enough, though [15:36:42] great [15:36:49] i think I'll be doing a bit of pig counting today [15:38:43] also ottomata, with the new cluster / proxy config can I still ssh into the machine running hue? (an10) [15:39:12] yup, should be the same [15:39:18] hue is actually on an27 [15:39:23] but, hadoop namenode is a10 [15:39:39] any machine with hadoop client on will work for your purposes though (assuming you want to do cli hadoop stuff ) [15:40:29] ah [15:40:46] so the next question is what do i need to do to actually ssh [15:40:53] should I be using the hue username [15:40:54] ? [15:41:09] same creds you use for stat1 [15:41:11] basically for some reason what I used to do doesn't seem to work [15:41:12] hmm [15:41:23] hm no worky? [15:41:29] can you ssh to analytics1001.wikimedia.org [15:41:30] ? [15:41:38] ssh erosen@analytics1010.wikimedia.org [15:41:46] aah [15:41:46] yes [15:41:47] can [15:41:56] ahhh [15:41:56] yeah [15:41:59] no wikimedia.org [15:42:04] wikimedia.org is only for analytics1001 [15:42:05] 1001 works thoyugh [15:42:07] yeah [15:42:08] cool [15:42:11] that's fine, and you can do stuff from there [15:42:15] but, for the others, you need to use a bastion [15:42:21] either an01 or fenari or whatever [15:42:28] gotcha [15:42:29] just like you would for other internal hosts [15:42:29] but [15:42:34] ssh analytics1010.eqiad.wmnet [15:42:40] are all of the machines set up with the same hadoop config [15:42:50] so they know where to look for the name node and hdfs stuff? [15:43:43] they should be, ja [15:43:46] cool [15:43:54] just checking if understood correctly [15:43:59] kewl [15:44:11] ja, and both an01 and an10 have a large /a partition [15:44:16] so if you need to work with some big files locally [15:44:20] either of those are good machines to do so [15:45:01] cool [15:45:02] thanks [15:48:56] Nemo_bis: did you say the channel was dead? [16:00:10] jeremyb: no, just that I wasn't paying attention [16:02:53] erosen, i'm going to run a benchmark on the cluster, you using it atm? [16:03:04] nope [16:03:05] go for it [16:03:21] how long do you need, I would probably start messing around in the next hour if possible [16:05:59] i dunno, not that long, these previous benches took about 40 mins [16:06:08] buuuuut, i'm having troulbe finding the jars I used before now that we're on cdh4 [16:06:10] so it might take me a bit [16:06:11] anyway [16:06:15] just lemme know when you need it [16:06:20] and i'll make sure we don't conflict [16:06:36] cool [16:07:01] maybe let me know when you're done [16:07:17] it's sort of the next thing on my list, but i can find other stuff to do in the meantime [16:20:29] erosen, if you got stuff, do it, its not looking for this bench [16:20:32] i don't think I can run it with YARN [16:20:42] k [16:21:05] i won't be hitting it hard immediately (as I'll just be testing out the script for a bit) [16:43:05] hmmm, erosen, all the sudden it is working? [16:43:13] i'm running TeraGen right now, which is just generating the data [16:43:19] go for it [16:43:20] this isn't the benchmark yet, so its ok to run things alongside of it [16:43:23] i'm not doing anything yet [18:01:20] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:02:33] apparently hangout urls are stable now! [18:02:38] dan noticed [18:03:51] ahj! [18:33:36] ottomata: any ideas on this pig error: ERROR 2997: Encountered IOException. Call From analytics1001.wikimedia.org/208.80.154.154 to analytics1001.wikimedia.org:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused [18:33:46] looks like it might be a cluster networking issue [18:35:47] hey folks [18:36:11] erosen: is the scrum still running? [18:36:20] i'm just hanging out [18:36:29] no one else came [18:36:40] so you haven't missed anything [18:36:44] k - I was running late [18:36:47] no worries [18:37:03] thx for the follow up on pig btw [18:37:18] no worries [18:37:19] I ended up parsing the data in a shell script as it was faster [18:37:28] but I definitely want to play with it [18:37:29] i just got back into it today [18:37:33] ya [18:37:45] and speaking of which - one question for ottomata [18:38:05] is the plan to keep importing the project count dumps on a regular basis? [18:38:14] not sure [18:38:23] my assumption is: not yet [18:38:47] but you could ping ottomatta and he might be able to easily set it up [18:38:49] daily pv per project is really a basic and very fundamental dataset we should serve/visualize [18:39:04] yeah definitely [18:39:15] i'm actually working on something like that right now [18:39:23] once I get it up and running, I'll let you know [18:39:27] sweet [18:39:40] (ottomata - consider yourself pinged) [18:40:58] a note that I just sent to E3, the launch of the fundraiser caused the 2nd daily max in new account regs on enwiki in 2012 (the 2012 daily max being post SOPA) [18:41:14] http://toolserver.org/~dartar/reg2/ [18:41:39] 5923 new accounts on 11/27 [18:42:41] very cool [18:46:54] ottomata: could you update your stuff on this page? http://www.mediawiki.org/wiki/Analytics/Roadmap [19:03:56] robla, done! [19:04:06] excellent, thanks! [19:04:57] DarTar PONGED [19:05:05] uhhhrrr no plans to regular import at the moment [19:05:12] howdy [19:05:23] right now we're trying to just serve one off requests, as we're not quite stable yet [19:05:27] but hmmmm [19:05:30] ic [19:06:11] there are a lot of regular core jobs we need to start implementing soon [19:06:18] i'm still mainly working on infrastructure stuf [19:06:19] stuff [19:06:52] dschoon (and I) will be working on best practices for regular stuff like this when he has more time away from Limn and things start to settle [19:07:06] dec 10! [19:07:10] ok, this one definitely has high priority for me [19:07:35] is there a preview instance of the new limn we can play with? [19:09:06] did you end up pigging the daily per project thing? [19:09:19] would that regularly in hadoop actually help with your prioritization? [19:10:48] no, I just did in via a shell script [19:11:24] I needed that data quickly and that was a way faster solution, but I am definitely planning to play with pig [19:12:15] aye cool, hm, ok, i think we want to have stats like these available in hadoop [19:12:16] erosen got me started the other day and I'm carving out some more imd later this week for this [19:12:25] buuuuuut, hopefully computed by hadoop, instead of via webstats [19:12:47] so this is kind of a down the road for sure thing, and it will be nice to be able to verify our numbers vs. webstatscollector (/domas) [19:13:07] but! if you need more data in hadoop to crunch, let me know and I will import [19:14:03] totally, I am not sure what ez does with the raw data to generate project counts (I understand there's quite a lot of interpolation going on for the holes in the data), but it would be good to reproduce the counts in hadoop [19:20:20] yup [19:20:58] i think for reportcard, ez might do some mangling, but for the most part, the hourly project counts come from Domas' webstatscollector stuff [19:26:02] another squid entry that is crashing udp-filter now, workin on fixing it https://gist.github.com/9f2b039ae9216dde7b47 [19:28:13] when you read in each field, can you just read in a fixed number of bytes? [19:28:34] and truncate the rest? [19:30:49] ottomata: yes, that's what I was thinking also [19:30:56] I should definitely do that [19:32:39] ottomatta: any ideas if something is up with pig? [19:32:55] i'm just doing a test command: [19:32:55] grunt> LOG_FIELDS = LOAD '/home/otto/logs/sampled' USING PigStorageWithInputPath() [19:33:21] i get: pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://analytics1001.wikimedia.org/user/erosen [19:33:22] Details at logfile: /home/erosen/pig_1353110462950.log [19:33:31] hmm [19:33:40] you running from hue or cli? [19:33:54] oh that is totally the wrong namenode name [19:33:54] hmmm [19:33:58] yeah, where are you running that? [19:34:04] directly on an10 [19:34:06] hm [19:34:07] or an01 [19:34:13] an01? [19:34:22] yeah [19:34:25] i think i tried both [19:34:50] erosen@analytics1001:~$ [19:35:02] was the shell from which i invoked grunt [19:35:18] is the error exactly the same from an10? [19:35:45] oo actually [19:35:55] I take it back--I didn't try it on an10 [19:35:58] cause I couldn't log in [19:36:11] and you just opened up pig by typing 'pig' [19:36:11] ? [19:36:14] yeah [19:36:17] do this: [19:36:21] open pig/grunt: [19:36:23] pig [19:36:23] then [19:36:24] ls [19:36:31] it should show you the contents of your hadoop user dir [19:36:42] works [19:36:48] hm [19:36:53] it seems to be the use of PigStorage [19:36:56] that creates the issue [19:37:06] it is looking for my home dir on the old name node [19:37:13] oh [19:37:25] well one thing [19:37:27] your path is wrong [19:37:28] you want [19:37:32] /user/otto/logs/sampled [19:37:34] not /home... [19:37:39] aah [19:37:57] but, its still weird that it thinks namenode is analytics1001, hmm maybe that is irrelevant [19:38:00] same prob though [19:38:03] maybe it just reports the machien you issued from? [19:38:04] oh poop [19:38:27] ooo [19:38:28] nooo [19:38:31] my fault now [19:38:31] weird [19:38:51] eh? [19:39:01] it wasn't the same error [19:39:13] it was an imports problem [19:39:21] so it was just that my path was wrong [19:39:28] and then I got all excited about the namenode [19:39:34] sorry for the bother [19:40:05] hm, ok…ok the PigStorage... [19:40:06] ok cool [19:40:43] yeah it was a special PigStorage subclass I had made to keep the file names [20:38:28] ottomata: another issue which might be a cluster issue [20:38:38] the job tracker urls are on analytics1010 [20:38:53] but I can't access them [20:38:58] any ideas? [20:39:25] there is no jobtracker :p [20:39:27] yarn doesn't have one [20:39:36] your best bet is [20:39:43] jobs.analytics.wikimedia.org [20:39:50] or jobhistory.analytics.wikimedia.org [20:40:43] so that is where I started [20:40:58] but the job-speciifc ApplicationMaster is on an10 [20:41:37] what url? [20:41:55] http://analytics1010.eqiad.wmnet:8088/proxy/application_1353342609923_0879/ [20:42:22] i tried switching the eqiad to wikimedia though I figured that wouldn't help cause I thought only an01 has a public ip [20:43:41] hmm [20:43:41] http://jobs.analytics.wikimedia.org/proxy/application_1353342609923_0879/ [20:43:46] that is not ideal :p [20:44:45] cool [20:44:46] works for me [20:45:16] is that the basic pattern for the current proxy solution? [20:46:01] yeah, hrm [20:46:03] hadn't htought of this [20:46:15] the .eqiad urls work for me cause i'm on vpn [20:46:16] hm [20:46:20] ya [20:46:50] the other proxy worked because it sent ALL of your browser requests through the proxy, rather than just the ones that you've defined in your hosts file [20:47:12] hm poo hmm [20:47:23] would it be possible to just add to the hosts file? [20:47:27] no, because of the port [20:47:29] otherwise yes [20:47:34] but since :8088 is on there [20:47:40] aah [20:47:41] it won't hit haproxy on an01 (which is listening only on port 80) [20:47:53] poop. [20:47:58] ok, adding a todo to think about that [20:48:05] cool [20:48:14] thanks for the quick work around [20:48:43] ja [20:48:55] if you have another url like that, check the index page [20:49:09] it has links to the internal urls too, and you can figure out which internal url maps to which host alias [20:49:30] which index page? [20:49:44] analytics.wikimedia.org [20:49:57] ja [20:50:16] (internal) is a diff url than the main one linked there [20:50:20] but that is the mapping [20:50:27] gotchaa [20:50:28] cool