[00:43:19] ping average_drifter [01:32:15] here [01:32:36] drdee: have you tried running the build script / [01:32:36] ? [01:43:07] no [01:43:13] but i trust that it works [01:43:57] send an email to ottomata and me with instructions for ottomata where to find the debian packages so that he can install them tomorrow [01:44:09] what's the progress with the editor version of wikistats? [01:52:43] finishing them up [01:55:56] couldn't, perhaps, that script be called "Makefile"? [01:55:58] just curious. [02:03:26] dschoon: which ? [02:03:43] "build" [02:05:04] there is already a Makefile [02:05:31] dschoon: https://gist.github.com/511891d05a0da84bef5b [02:05:33] dschoon: this is the script [02:05:38] dschoon: it doesn't do what a regular Makefile does [02:05:58] dschoon: it builds packages for different distributions of Ubuntu [02:07:15] ah [02:07:17] my bad :) [02:07:34] why not have a "make packages" target then? [02:08:09] dschoon: it can be done with a Makefile also , yes [02:08:29] dschoon: but it is not specific to one single project, it's a higher-level script [02:10:29] i don't think we should spend much more time on this, it's time to deploy it and move on to the next project [02:10:50] yes [02:15:15] yeah. no big. [02:15:17] ship it. [08:04:08] average_drifter (or others): do we have numbers for active editors per country yet? [08:04:34] i seem to recall this was planned as a feature for (the interactive part of) http://reportcard.wmflabs.org/graphs/active_editors [11:41:31] HaeB: hello, I'm working on it, want to get it ready asap [11:43:43] cool, looking forward to it! (even if i don't need it any more right now - a journalist had asked about the number for one country, but i just gave him some other interesting numbers) [14:24:15] good morning [15:00:56] morning [15:04:43] average_drifter, ottomata, milimetric [15:06:04] hey drdee [15:12:24] hey [15:12:38] wanna start with working on the editors version of wikistats [15:12:40] ? [15:14:43] drdee: I'm working on them in /home/spetrea on stat1 [15:14:53] drdee: I have configured part of it, still have some problems with paths [15:15:13] drdee: I want to get it done asap, I talked with HaeB , he told me he needs it as well [15:15:47] k [15:15:53] hey ottomata [15:16:59] drdee: sending an e-mail to Erik about detecting prod/test environment (which is currently being done by checking -e /home/ezachte and such) and proposing to replace with checking /etc/wikimedia-realm [15:17:38] morning [15:17:42] drdee: I went to #wikimedia-labs and asked what is a reliable way to get the hostname of a machine on labs and they told me /etc/wmflabs-instancename [15:17:42] well, i need other thing much more urgently ;) - but yes, this is the stuff that the press wans to know often [15:18:05] drdee: if there was such a thing for stat1 and all machines wikistats is running on, then I can use that [15:18:10] drdee: are webstats collected already for blog.wikimedia.org and wikimediafoundation.org? :) [15:18:11] drdee: is there a #wikimedia-prod ? [15:18:30] HaeB: those are part of our release of udp-filters which is ongoing [15:18:46] HaeB: packages are ready but they need to be deployed [15:18:53] ah cool [15:18:59] HaeB; yes we are deploying today [15:19:13] ottomata, can we start with rolling out the new version of webstatscollector? [15:19:44] also, fix all the path issues in wikistats [15:19:54] 1) no paths should be hardcoded in wikistats [15:19:55] drdee: equivalent of #wikimedia-labs but for production environments [15:20:15] oh probably #wikimedia-ops [15:20:44] 2) production vs debug mode should be a command line parameter. not determined in the code by the hostname [15:21:51] (re previous topic: once stats like editors/country have been available for a month or so, that should make for a really nice blog post to highlight the results of your work) [15:22:01] totally [15:22:27] hah, drdee [15:22:30] yes! [15:24:22] i have to do some errants for anna,be back in a hour but if you and average_drifter can get this running that would be really awesome [15:24:42] stat1 is the machine where it will run, right? [15:25:16] once we have confidence that the new udp-filter is working fine we should also upgrade udp-filter on locke, oxygen and emery as that would fix the X-Forwarded-for issue [15:26:04] and finally update the server log config: tab separator, accept language header and append 'new_format' to the log file names so we can easily recognize the old and new format [15:26:30] if we can get these 3 things done today then a whole bunch of people will be happy and then we can go back to kraken tomorrow [15:26:52] ok [15:26:58] sounds good to me! [15:27:01] average_drifter [15:27:07] uhhhh, what can I do? [15:27:07] yes yes [15:27:11] reading [15:28:01] ottomata: can we go for a quick test with the static binaries ? [15:28:08] The .deb packages were built with the script /home/diederik/build. You can find .deb packages for both lucid and precise in [15:28:18] /home/diederik/lucid [15:28:19] /home/diederik/precise [15:28:31] average_drifter: which VM? build1 or build2 [15:28:54] sure [15:29:36] drdee: they share the /home so it's the same [15:29:57] aarrgghhhhhh i keep forgetting that [15:32:23] ottomata: on build1 /home/diederik/wikistats/webstatscollector/collector-static [15:32:25] ok so, i see them [15:32:32] oh [15:32:33] static? [15:32:36] what about [15:32:41] ottomata: on build1 /home/diederik/wikistats/udp-filters/udp-filter-static [15:32:49] /home/diederik/lucid/ [15:33:10] ottomata: you can use packages if you want [15:33:26] oh isee [15:33:28] that's the binary [15:33:29] ok [15:33:31] drdee [15:33:35] the packages need to be named like this: [15:33:38] http://apt.wikimedia.org/wikimedia/pool/main/u/udp-filter/ [15:34:27] average_drifter ^^ [15:39:56] ottomata: oh ok, I can add the trailing ~lucid or ~precise [15:41:06] ok, yeah, and I think that needs to be in the changelog version name too [15:41:08] so [15:41:33] for example [15:41:37] for libcidr, I did this [15:41:38] https://github.com/wmf-analytics/libcidr/blob/debian/1.2.0-1precise-wikimedia/debian/changelog [15:42:06] i could recommend modifying changelog with regular changes in master [15:42:06] and then [15:42:13] when you are ready to build packages for different distributions [15:42:20] create a branch with the version and the dist name [15:42:27] and modify changelog there with the dist name [15:42:29] and build the package [15:42:43] oh, changelogs are under control, I can change that quickly with a cmdline param [15:42:44] then, create a new branch from master for the other dist, and do the same, [15:42:46] ok [15:42:50] then you got it [15:45:23] ottomata, quick question: what are these kafka jobs on hadoop, like: http://analytics1001.wikimedia.org:8088/cluster/app/application_1352758471717_0001 [15:45:52] man, drdee, y u no like vpn? [15:45:56] i have to change all of the urls you send me :p [15:46:01] those are the hourly pixel.php imports [15:46:15] i use the new proxy server :D [15:46:21] pshhhhh [15:46:28] but sory [15:46:37] hehe, s'opk [15:46:37] s'ok [15:46:59] is that data stored? [15:47:15] yeah, if anyone actually sends any [15:47:19] D: [15:47:21] i mean :D [15:47:38] right now its kidna dumb, i think its creating files every hour even if there is no data [15:48:24] /user/otto/pixel/logs [15:48:54] aight [15:49:09] we should also talk about the file structure on hdfs [15:49:15] like: [15:49:27] /traffic/ [15:49:45] /db// [15:49:52] and for traffic' [15:50:00] nawwwww why in /? [15:50:13] /traffic// [15:50:17] etc [15:50:21] just a suggestion [15:50:21] ok [15:50:41] it should not be in a user directory i think [15:50:41] i'm all for that, except keeping it / [15:50:42] we can google and see what others do, or just mimick unix [15:50:46] mimic [15:50:46] sounds good [15:51:52] ok, so help me understsand one more time what needs to be done to deploy this new stuff [15:51:57] you want to test this stuff on locke first? [15:52:02] or stat1? [15:52:07] stat1 [15:52:09] alongside of the currently scirpts [15:52:16] current versions [15:52:17] right? [15:52:19] i would like to keep the current webstatscollector running [15:52:21] ok [15:52:21] on locke [15:52:25] and have the new version on stat1 [15:52:29] the changes are quite big [15:52:29] can't we just verify [15:52:33] by taking a sampled log file [15:52:41] and piping that through the old and the new [15:52:44] and compare the results? [15:52:48] do we need to run it live? [15:52:53] good ide [15:52:54] a [15:53:18] you would have to do that on locke with the old version [15:53:22] and on stat1 with the new version [15:53:33] ok cool, so we should be able to do that on stat1, or build1/2, without using the .debs yet [15:53:42] the new version supports new domains [15:53:50] ok [15:53:51] so there will be differences between locke and stat1 version [15:54:11] ok, but you know what they should be, right? [15:54:14] but for enwiki for example there should be no differences [15:54:15] ok [15:54:24] well for the blog we have no comparison data [15:54:44] so whatever it says we have to assume it's correct unless it's so low / high that we know it doesn't make sense [15:55:02] so, remind me real quick. udp2log -> filter -> collector -> file [15:55:03] ? [15:55:21] so I can do [15:55:32] sampled-file | filter -> collector >> file [15:55:33] ? [15:55:58] udp2log -> filter -> log2udp -> collector -> file [15:56:25] so cat -> filter -> log2dup -> collector -> file [15:56:39] you would have to start the new collector in 'debug' mode (ask stefan) [15:56:48] so it will write the file every n minutes [15:56:58] the old version of collector only writes every 60 minutes [15:57:11] so it's a bit complicated to test [15:57:24] so you need the new version of collector because that has a new debug mode [15:57:30] again stefan knows all of this [15:57:36] average_drifter ^^ [15:57:50] so cat -> udp-filter -> log2dup -> collector -> file [15:57:54] ok [15:58:00] so for the old one (which I am doing now [15:58:01] ) [15:58:06] i have to wait 60 minutes? [15:58:11] yes [15:58:26] and I have no idea how collector is being run on locke :p [15:58:26] ha [15:58:29] like what is supervising it [15:58:33] and i am not sure if you can specify the output location is configurable [15:58:37] afaik someone started it in a screen [15:58:42] average_drifter ^^ [15:58:59] okay i have to do some errants really now [15:59:04] back in 90 [15:59:35] k [15:59:38] errands [15:59:43] you can do errants too if you want [15:59:43] hah [16:00:38] drdee, real quick, before you go. who worked on setting up webstatscollector before the analytics team existed? [16:01:06] probably domas [16:01:22] hm [16:01:31] we gotta bite this bullet once :D [16:01:31] that woulda been a long time ago then, right? [16:01:34] yup [16:01:37] at least 3 years [16:01:52] that's why i want to do stuff first on stat1 [16:01:53] b [16:01:54] e [16:01:55] c [16:01:57] because it's so brittle [16:02:00] really going now [16:02:03] okbye [16:03:10] ottomata: what's that field in changelog [16:03:16] ottomata: when you do lucid-wikimedia [16:03:20] ottomata: what's that called ? [16:03:20] ah it daemonizes itself! [16:03:57] i think 'distribution-name' [16:03:59] but i'm not sure [16:21:38] ok, collector is running, waiting to dump some data on build1 [16:22:03] once we have that output, we should be able to do the same thing that I just did with the same file but with the new filter (udp-filter?) and collector [16:22:08] and compare the results [17:17:30] growl, sorry, my irc quit! it does that sometimes [17:17:35] average_drifter, let me knwo if you need anything [17:30:58] ottomata: --distribution paramter added to git2deblogs [17:31:25] now updating the "build" script so it knows to give a distribution name dependin gon the machine it's running on [17:32:26] ottomata: using lsb_release to get the distro name [17:32:55] cool [17:42:03] ottomata: what do you think about using git2deblogs on libanon and libcidr ? [17:42:18] ottomata: it would allow us to automatically set all the stuff [17:49:42] yo [17:50:03] ottomata: ah it daemonizes itself! [17:50:04] yes [17:54:42] ls [17:55:04] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:56:28] milimetric i am in [18:00:04] average_drifter, re: git2deblogs, don't know what that is, but I guess? [18:00:19] do we need to build new versions of those? [18:02:34] ottomata https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [18:04:58] ottomata: it's just something that converts your git logs to debian/changelog [18:05:08] ottomata: we're using it for udp-filters and webstatscollector [18:05:51] ottomata: do you manually update your debian/changelog for libcidr and libanon ? [18:07:17] yes, but i mean, these are 3rd party libs [18:07:19] we didn't write them [18:07:26] i just created debian packaging for them [18:13:35] ottomata, let me know when I can compare the old and new webstatscollector output [18:13:45] i;ll take care of that [18:14:22] cool [18:14:24] yeah they are there [18:14:27] on build1 [18:14:31] /home/otto/webstats/dumps [18:14:39] aight [18:14:45] Reminder: Update Analytics Roadmap! [18:14:58] That's the annoying thing I forgot to mention :) [18:15:01] did that! [18:15:05] sweet! [18:15:07] ha, doesn't look like it worked correctly [18:15:15] drdee, i gotta change locations [18:15:16] * average_drifter worries [18:15:22] i'll be back on in 30 mins [18:15:23] k [18:15:40] i swear to god i leave for work at the same time every day [18:15:56] i have no idea how muni sometimes gets me here 15m early, and sometimes 10m late. [18:17:05] ottomata, quick question [18:17:26] /home/otto/webstats/dumps is the old version? [18:17:40] drdee: whitelist appears to be working for the office now. [18:17:51] i didn't get a password prompt for http://analytics1001.wikimedia.org:8088/cluster [18:18:05] yes, old version [18:18:07] haven't done new [18:18:13] but the file in /home/otto/ is the file I used [18:18:16] to do taht [18:18:26] i'll kill my version of collector [18:18:28] and then you can run the new versions [18:19:37] * average_drifter is talking to Erik about wiki edits [18:20:20] dschoon: cool! [18:21:16] ottomata, what was the command line you used for the first test/ [18:21:16] ? [18:24:32] i did [18:24:46] cat file | filter > file.filter [18:24:49] bin/collector [18:25:22] cat file.filter | log2udp -h 127.0.0.1 -p 3815 [18:25:27] then wait an hour [18:25:47] ok, be back in 30 [18:27:41] average_drifter: can you help me? [18:29:36] drdee: yes [18:29:41] drdee: talking with Erik on skype [18:29:46] oh ok [18:29:48] drdee: trying to invite you [18:29:54] ty [18:29:57] drdee: we are discussing some of the geoip [18:30:04] loop me in :) [18:30:15] drdee: we're trying to but I don't have the option on my ipad [18:30:20] drdee: Erik's trying to [18:30:24] ok [18:30:25] maybe I'm dumb but setting the proxy and entering the password doesn't work for me in FF or Chrome [18:30:54] mmmm [18:46:19] anyone know where we're running the support software for the menagerie of hadoop-related systems? (Hive's MySQL, whatever crap Hue requires, etc) [18:49:12] dschoon: an1001 [18:49:44] Yeah, that's what I figured. Gotta move that. [18:53:44] ottomata and i discussed that yesterday, we could use stat1001 or stat1 (the machine that is in eqiad) [18:53:59] hm. [18:54:00] interesting. [18:54:20] i was thinking a R720, since we also need a secondary NN [18:54:30] and they could coexist [18:55:12] does Oozie require a MySQL? [18:55:26] I forget which services have external deps. [18:59:01] oozie, yes it uses the hive metastore [18:59:05] and that's mysql [18:59:13] R720 is fine with em as well [19:05:17] ok back [19:06:21] ottomata, i think you did something wrong with running filter [19:06:36] maybe parameters ? [19:06:40] when i just run filter the output is okay [19:06:50] no this is using the old version [19:07:04] (using filter not udp-filter [19:07:17] hm [19:07:25] is the filter output wrong? or the collector output? [19:07:29] ottomata, average_drifter: check //home/otto/webstats/filtered [19:07:34] that is the correct output of filter [19:07:37] i saved the filter output in a file, so we coudl check that [19:07:42] ottomata, I sent a reply on the hardware thread [19:07:46] lmk what you think [19:08:00] your output (sampled-1000.log-20121112-10000lines.filtered) is incorrect [19:08:05] hmmm [19:08:13] all I did was cat | file | filter [19:08:20] cat file | filter > file.filtered [19:08:24] dschoon, ok cool [19:08:27] will read in a sec, thanks [19:08:32] i did the same :) [19:08:33] we left out a few allocations [19:08:51] didn't you accidentally run udp-filter? [19:09:57] me? no [19:10:05] ohhhh, but i didn' do bin/filter [19:10:08] i did filter, hmmmmm [19:10:52] anyway, ok, do you want me to run it again or ahve you already done it? [19:12:10] i just ran it and i am waiting for collector to output the file ;) [19:12:26] * average_drifter is trying to think what happens if the webstatscollector is broadcasting udp filters and because it's UDP it might deliver some messages to filter and some to udp-filter but maybe not the same to both because UDP is not guaranteed to 100% deliver [19:12:32] ok cool [19:12:44] heh [19:12:51] yeah, but this is just localhost tests right now [19:12:54] so the tests should be valid [19:12:59] oh alright [19:13:01] * dschoon is trying *not* to think about webstatscollector. [19:13:01] i don't think we're going to lose packets on lo [19:13:24] average_drifter: where are the binaries of the new webstatscollector ? [19:14:02] drdee: /home/diederik/wikistats/webstatscollector [19:14:07] thx [19:14:26] http://stackoverflow.com/a/2662405/827519 [19:14:52] "If you are losing packets over the loopback interface after sending only 6 or 7 packets, then it sounds like maybe your receive buffer is too small. You can increase the size with setsockopt using the SO_RCVBUF option. However, if you are sending 1500 bytes, then if this is indeed the issue, it means the receive buffer is only about 9K (or more likely 8K, however that seems a rather small default). I believe on Win [19:16:20] man it takes forveeeever to build a raid array on 10 2TB disks! [19:23:50] brb [19:38:21] um, drdee, i don't think we have to wait an hour [19:38:25] we should be able to snd a sigalrm [19:39:43] working with average_drifter right now on comparing files [19:39:47] ok cool [19:39:48] so you can play with karekn [19:40:01] ok, welp, i'm waiting for long running things to finish! [19:40:28] change the server logs? [19:42:06] ? [19:45:44] replace space with tab, add accept-language header? [19:47:05] ahhh, hm, i can do that, sure, wikistats ready for that? [19:53:38] so, drdee, to make those changes, I need an ops person willing to babysit and make it happen with me [19:53:51] might be tough to convince them to do this now that fundraising is about to start being really active [19:53:54] what do you think? [19:54:22] better to do it before its in full swing [19:54:30] it'll just become a harder sell as time goes on [19:54:54] agree and we definitely need tab separator before kafka and storm are switched on [19:59:00] why do we def need it before kafka and storm are switched on? [19:59:10] i mean, it'll be nicer, but won't really make that much of a difference, right? [19:59:43] it will be a huge difference because else we have to do a lot of processing of the fields and as it's hard realtime that sucks [19:59:47] but [20:00:01] fundraiser is doing a large trial this week [20:00:06] so better not disturb them [20:00:19] i just remembered that [20:00:49] ottomata ^^ [20:02:06] yeah, that's w hy I was asking, really, Jeff was talking about tha tin ops room just now [20:02:17] can't we just make the scripts split on space or tab? [20:02:22] \s [20:02:23] ? [20:02:23] no [20:02:34] because we don't know how spaces there are in a field [20:02:45] the user agents string fucks us over [20:02:48] for now we just do what we have been doing [20:02:50] how often though? [20:02:51] that's why we need tab [20:02:52] not that often [20:02:55] right? [20:02:58] often often enough [20:03:05] yeah but, it won't stop us from working [20:03:09] i just mean, its not a blocker [20:03:10] its a problem [20:03:11] and we should first upgrade udp-filter [20:03:11] but not a blocker [20:03:27] i dunno, it makes the business logic in storm much more complex and slower [20:03:43] anyways this is not the right time (this week) [20:04:19] it does? [20:04:21] how so? [20:04:30] isn't it just one line of code? [20:04:48] split(' ') vs split("\t"), oorrrrr even split("/\s/") [20:05:16] how do you when the user agent string fields end? [20:05:23] it is not just the user agent string [20:05:25] those lines will cause you problems, sure [20:05:27] but most will not [20:05:35] it is also the mime type field that has this issue [20:05:38] we will ahve the same problem we have right now [20:05:58] so you have to do a lot of if branching to figure out what fields mean what [20:06:23] it doesn't matter that it doesn't happen often (in fact it does happen often) [20:06:32] you still need code to handle the exceptions [20:06:58] are we handling them right now? [20:07:40] yes, that's why wikistats is a gazillion lines [20:17:01] well well, an02 is back online! [20:17:03] how about tha! [20:20:00] WOOT WOOT [20:22:38] an07 is still weird [20:22:42] checking it [20:37:30] i love kernel panics [20:46:07] drdee, why did we want x-wap-profile in event log but not in web log? [20:46:54] i think we want in particular more device information for feature development / experimentation [20:47:06] but we can put it in both as well [20:48:14] well, we had it in web, but then we took it out in favor of x-carrier [20:49:23] those two things have nothing in common [20:50:02] riiighhhhhhht, one of my git commit messages is [20:50:02] Not using x-wap-profile, now using X-Carrier [20:50:05] Date: Wed Jun 20 11:58:02 2012 -0400 [20:50:37] they are still unrelated :) [21:08:56] drdee: in udp2log does the downsampling happen per filter, or per instance? [21:09:09] per filter [21:09:41] drdee: excellent :) [21:10:05] aight [21:13:16] drdee: actually, just so I'm totally clear, the argument -p A,B would maintain separate sample counts for A and B? [21:15:39] not exactly, udp-filter would only pass through url's that match either A or B the actual counting needs to happen in your scripts [21:16:10] ah; so how does downsampling work then? [21:16:51] I thought it was an argument one passed to udp-filter [21:18:11] the sampling happens using the 'pipe' command [21:18:19] pipe 1 is unsampled [21:18:26] pip 100 means 1 in 100 [21:18:27] etc [21:18:31] ahah; I see [21:18:42] have a look at the filters in puppet [21:41:36] pipe and sampling ahve nothing to do with udp-filter [21:41:38] udp2log does that [21:42:01] drdee, i thought we had already done this, but we need the ability to change the whitespace separator via cli flag in udp-filter [21:42:20] isn't that in the new version of udp-filter? [21:42:28] average_drifter ^^ [21:42:53] looks hardcoded to me [21:43:09] didn't I already do this though? [21:43:46] ah naw, I guess not, I had just manually changed it when I was testing [21:44:09] can I change it and commit? [21:44:11] or [21:44:12] i mean [21:44:14] can I add that flag? [21:44:26] you can yes [21:44:32] I mean from my point of view [21:44:38] I'm debugging right now [21:44:51] ok [21:45:34] totally go ahead, make sure you have the latest version of the code D: [21:45:35] :D [21:45:38] :D :D :D [21:45:39] i do [21:47:26] ottomata: test with -t for the moment, without it there are some problems which I'm fixing now [21:48:31] what does this mean? [21:48:32] Activate internal traffic rules. [21:48:49] internal traffic rules means filter out lines that match internal ips? [21:53:58] ottomata: more than that [21:54:14] ottomata: see collector-output.c , but basically [21:54:32] accepting fixed new domains *.planet.wikimedia.org , wikimediafoundation.org , blog.wikimedia.org [21:54:37] accepting those domains [21:55:19] accepting those domains? were they filtered out before? [21:56:34] and that was kernel panic 2 today [21:56:38] both caused by firefox [21:56:47] ottomata: [21:56:51] ottomata: they're new [21:57:11] erosen: back [21:57:15] milimetric: back [21:57:23] sorry if I am coming into something wayyy too late in the game, are we now using udp-filter to modify the output? [21:57:24] hm? [21:57:26] i mean [21:57:31] do summing like collector does? [21:57:50] oh [21:57:56] kernels and their panicking [21:58:07] someone should make a hitchiker's guide module for linux [21:58:08] * HaeB looks at the list of these three domains with wide round eyes like a kid before the christmas tree [21:59:30] i'm really confused though, why are those domains in udp-filter code [21:59:56] ottomata, because that functionality used to be in 'filter' [22:00:04] and filter did for 50% what udp-filter was doing [22:00:24] so instead of having two separate filter piecse [22:00:30] sooooooooooooooo do not listen to me because I am too late to make comments on this [22:00:35] i decided to integrate them in a single package [22:00:38] but i think you are dangerously close to making a feature bloat [22:00:44] no? [22:00:56] we just copy / pasted the code from filter to udp-filter [22:01:02] tools should do one thing and do it simple [22:01:05] and gave it the -o switch to activate that [22:01:09] but doesn't filter do somethign completely different than udp-fitler? [22:01:25] ottomata: with -o udp-filter does the same as filter [22:01:31] now you are saying "ok, udp-filter can be used to geocode and anonymize and filter out lines ORRRR you can do some fancy hardcoded url aggregation stuff" [22:01:39] no, it filters urls, the only thing it does is that it sends it in another output format [22:01:47] no aggregation [22:01:53] aggregation happens in collector [22:01:56] sounds like this should be [22:02:08] udp-filter -d 'domain1,domain2' | output-transformer [22:02:11] not [22:02:18] udp-filter —output-transformer [22:02:22] i mean, there are hardcoded URLs in here now [22:02:30] udp-filter is no longer content agnostic [22:02:56] why not? [22:02:56] ottomata: hardcoded urls are only employed if you do -t [22:03:43] because now the code knows about the content [22:03:43] filter was completely unmaintained piece of code [22:03:58] only if you run it in webstatscollector mode [22:04:02] (so again, i'm not suggesting we change things this late, just giving opinions now that I know about it) [22:04:12] right, but the whole concept of 'a special mode' sounds bad to me [22:04:22] but this is just for quick fixes anyways [22:04:35] to make people like Haeb very happy [22:04:37] do we want to add a new mode for every differenet output type people want? [22:04:43] udp-filters is on it's way out [22:04:55] no we are not gonna do that [22:05:21] we just simplified the C legacy stuff to one "filter" package and one "collector" package [22:05:38] there is a filter package now? wait, i thought you just said you were getting rid of filter [22:05:51] i used "" [22:06:01] so the "filter" package is udp-filter [22:06:09] ah [22:06:17] yeah why not just modify the old filter package and have a deb for it? [22:06:21] rather than bloating udp-filter? [22:06:36] because we would duplicate functionality [22:06:44] filter needs bot filtering [22:06:50] but that could be useful in udp-filter as well [22:07:21] i see your point that it is not the most elegant solution [22:07:43] but at least we take control of webstatscollector, which was long overdue [22:07:57] and we fix some long outstanding wishes [22:11:00] :) [22:12:02] ottomata, kraken related question, erosen and I are trying to run a pig job and get the following error: [22:12:03] java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "analytics1017":8041; [22:12:16] are the dell machines already fully part of the cluster? [22:12:32] should be, but I didn't check aside from seeing that they joined the namenode [22:12:46] trying to access the log i get: [22:12:47] Proxy Error [22:12:48] The proxy server received an invalid response from an upstream server. [22:12:52] Reason: DNS lookup failure for: analytics1017 [22:13:09] hm [22:13:42] hmm, i have the names hardcoded /etc/hosts on an01 [22:13:42] hmmmm [22:13:46] i'll add those for the others too [22:13:51] dunno why we need that though, but for now I'll just do that [22:13:52] thx [22:14:49] k try now [22:15:05] oops one sec [22:15:10] no still proxy error [22:16:40] wait, probably because the link on http://analytics1001.wikimedia.org:8088/cluster/app/application_1352758471717_0037 is incorrect [22:16:53] oooook now [22:16:54] do it [22:17:12] Failed redirect for container_1352758471717_0037_01_000001 [22:17:16] but no proxy error [22:17:53] i think whenever it says failed redirect for container [22:17:59] that means your job has been moved to jobhistory [22:18:55] hmm [22:20:01] okay [22:20:01] no it is not in jobhistory either [22:20:04] partial success [22:20:09] i tried running a new job [22:20:12] and it is working [22:20:17] http://analytics1001.wikimedia.org:8088/cluster/app/application_1352758471717_0040 [22:20:31] <== this guy needs to get some advil (second wisdom tooth growing) [22:20:32] drdee: ^ [22:20:35] * average_drifter runs to pharmacy [22:20:42] brb 20m [22:20:43] oy [22:20:46] good luck [22:20:51] totally good luck! [22:20:58] average_drifter ^^ [22:21:06] erosen: got it, following job [22:21:20] however, the logs link still doesn't work for me: [22:21:21] http://analytics1017:8042/node/containerlogs/container_1352758471717_0040_01_000001/erosen [22:21:50] it actually might be running [22:21:51] that is a really weird link [22:22:15] so the bad news is that I commented out all of the interesting stuff in the script [22:22:22] so it is just loading and filtering and writing [22:22:28] but it is working [22:22:34] the job seems to be running fine [22:24:13] yeah [22:28:16] ottomata, drdee: fyi the log link works now: [22:28:16] http://analytics1001.wikimedia.org:19888/jobhistory/logs/analytics1017:8041/container_1352758471717_0040_01_000001/job_1352758471717_0040/erosen [22:28:26] not sure what changed [22:28:38] cool [22:28:39] as in why the link is different [23:15:05] ottomata, still around? [23:16:13] ja, halfway, but ja [23:16:53] i wanna make one quick change to hadoop, is puppet off? [23:16:58] should be! [23:17:00] lets see [23:17:13] NO! [23:17:14] i tis not off [23:17:15] grrrr [23:17:26] it's like a zombie that won't die [23:18:05] well, its confusing because of the secondary puppetmaster [23:18:10] I had resolved this weirdness once, but now I dunno [23:18:11] back on advil, this thing should kick in within the next 20m [23:18:16] :D [23:18:20] ok it soff now [23:18:20] I'm back on the code [23:18:22] we'll see if it comes back on [23:18:27] thx [23:19:35] average_drifter: we should also add support for the wikivoyage and wikidata domains to webstatscollector [23:19:45] drdee: will do [23:20:21] drdee: question for collector output (-o) [23:20:24] drdee: last column is title [23:20:29] drdee: what should I do if there is no title ? [23:20:37] output a "no_title" ? [23:21:21] which domains are susceptible to that? [23:21:27] wikis are not AFAIK [23:21:46] so only the blog and planet domains, right? [23:23:24] planet domains have title set to "main" [23:23:35] blog.wikimedia has a title, and it is correctly extracted [23:23:42] example for exceptional cases http://upload.wikimedia.org/math/2/b/6/2b6cd1cc064daedda6a821242d9ea512.png [23:23:59] what title should we give to this one ? [23:24:32] let me look [23:24:38] ok [23:24:43] that's a title? [23:25:10] no, just an url [23:25:15] from the logs [23:25:29] but we should not count pageviews for upload [23:25:32] or bits [23:25:38] for that matter [23:26:22] upload.wikimedia.org / bits.wikimedia.org should be ignored by webstatscollector [23:26:37] alright, will add these 2 new rules (and the one swith wikivoyage and wikidata) [23:26:50] btw, wikivoyage and wikidata are on the whitelist/blacklist ? [23:26:58] whitelist I suppose [23:27:00] mmmmmm maybe it's better to only whitelist domains [23:27:04] yes those go on the whitelist [23:27:18] ok [23:31:54] ottomata, should i update khadoop to work with the dell machines as well? [23:33:38] hm, yeah totally [23:34:06] anything i should be aware off? [23:34:53] can i just add an11-an20? [23:45:34] yeah just those [23:48:02] aight