[13:15:59] morning average_drifter [13:16:37] hey drdee [13:17:41] yo [13:19:20] fixed the issues with git-dch? [13:20:54] drdee: not yet [13:21:03] drdee: but I'm reading docs on how to solve them [13:21:30] k [13:21:34] good morning! [13:22:51] gooooooood morning milimetric [13:22:56] how was your weekend? [13:23:27] great. we had a big party for our anniversary and both our birthdays which are all around now [13:23:42] i wrote a windows 7 phone app [13:24:01] and i even got a little progress on limn [13:24:23] you guys were all working yesterday! that's awesome [13:24:34] me, ottomata and average_drifter [13:24:43] when is your birthday ? [13:24:54] thursday [13:24:58] my wife's was last week [13:25:05] and our one year anniversary just passed too [13:25:17] how was your weekend [13:25:20] that's amazing! [13:27:01] prepping for the arrival of the kiddo :) [13:27:28] and improving kraken performance by approx. 30% [13:27:30] aww, that's awesome [13:27:37] woa [13:27:55] how'd the performance go up? [13:28:24] there are many variables in hadoop that you can tweak [13:28:56] but mainly using larger buffers, and giving more memory, and increasing the number of tasks that run in parallell [13:29:08] so do you guys have some idea of how much data we can churn through in how much time? [13:29:14] yup [13:29:19] is it consistent or does it depend a lot on the analysis? [13:29:29] it totally depends on the mapper function [13:29:42] but i used a built-in hadoop benchmark called testDFSIO [13:30:02] and that shows the throughput [13:30:18] we should enable more monitoring to assist in further optimalizations [13:31:04] morning louisdang [13:31:10] :) [13:31:29] so how fast is it? How many gigatons per second are we tacking at? [13:31:29] what app did you write? [13:31:43] the throughput is about 70Mb/s [13:31:48] oh cool [13:32:12] but this is still only 7 nodes [13:32:21] i hope ottomata can get the other nodes to run as well [13:32:22] it's a logistics app for my dad, it manages people's workflow as they're loading / unloading cargo with just a phone (so taking a picture is the equivalent of filling out a form) [13:32:43] not exactly fun stuff but my dad pulled favors :) [13:33:08] meh, but if we need any mobile development, I can now do Android / Windows Phone / PhoneGap at least [13:33:11] is it C# based? [13:33:21] yep [13:33:43] made a REST service in the cloud and C# on the phone [13:33:50] and limn? [13:34:52] oh i watched the queen of versailles last night, it's a documentary [13:34:58] utterly bizarre [13:36:19] i just looked that up - very [13:36:43] limn I split it up the d3 thing I did in a way that should be absolutely perfect for limn [13:37:25] each component just has to know what data it's rendering, and has to have draw/redraw functions [13:39:02] if you want to see the difference, it's this: https://github.com/milimetric/limn/commit/48310d6dd801bb8a22426b19b981f21b2396bda6 [13:39:08] thx [13:40:11] the documentary is interesting i also just felt sad for these folks, not because they lost their money but because how they live [13:47:44] mmmmmoorning ottomata! [13:47:51] morning! [13:47:57] whats goin down? [13:49:09] not much not much [13:49:12] you doing more hadoopy things? [13:49:23] yes! wanna talk about it with you first though [13:50:02] i got a list of things that i think we need to do, some more urgent than others, i can turn them into asana tasks or we can look at the list first [13:50:06] but first….. [13:50:24] there seems to be a Kafka C++ client https://github.com/kafka-dev/kafka/blob/master/clients/cpp/README.md [13:50:30] or was this already known? [13:52:17] yes [13:52:24] but it does not have zookeeper support [13:52:28] which kinda defeats the purpose [14:02:07] whatcha wanna talk about? [14:02:09] list of things you think? [14:12:53] hangout in 10 min? [14:13:01] hmmm [14:13:05] (was sky ping with ez) [14:13:13] ok [14:26:01] hey drdee [14:26:06] so i'm about to run a new pig job [14:26:07] yo [14:26:08] taking this into account [14:26:09] https://gerrit.wikimedia.org/r/gitweb?p=analytics/wikistats.git;a=blob;f=squids/SquidCountArchiveProcessLogRecord.pm;h=5b0d03d6473ce5d63afc6f9495af8651bd90f74b;hb=HEAD#l18 [14:26:12] basically [14:26:22] if mime == text/html or mime == '-' [14:26:27] does that make sense? [14:26:41] oh hmmmm [14:26:51] no i only could '-' if the url matches those [14:26:53] too [14:26:53] i don't think you want mime == '-' [14:26:53] hmm [14:27:07] if mime == '-' AND ($url =~ /\.m\..*?\/wiki\//) || ($url =~ /\.m\..*?\/w\/index.php/) [14:27:15] that doesn't seem like a valid mime type response [14:27:16] to me [14:27:39] read the comment [14:27:42] https://gerrit.wikimedia.org/r/gitweb?p=analytics/wikistats.git;a=blob;f=squids/SquidCountArchiveProcessLogRecord.pm;h=5b0d03d6473ce5d63afc6f9495af8651bd90f74b;hb=HEAD#l20     # no mime type on log records from varnish, assume 'page request' on most, until that stream had been fixed [14:32:22] ughh, you are right :) that's actually a bug we should fix in the logging config [14:35:43] okay, i created an asana task https://app.asana.com/0/1056305935923/2108205677948 [14:43:53] running a big pig job again [14:43:59] there is more for it to do now (more regexes to match) [14:44:04] also fewer maps [14:44:08] probably do to block size changes? [14:46:59] yes [14:47:10] fewer maps are due to larger block size [14:47:12] did we manualy distcp the sampled logs though? [14:47:16] not yet [14:47:21] so they are probably still the same block size [14:48:22] maybe don't know that for 100% [14:48:28] skypie? [14:54:55] yeah block size on those is 64MB [14:55:03] gotta poo, then lets skype... [15:05:46] drdee eeeee [15:09:28] drdee: getting close [15:20:00] hm, rats drdee [15:20:10] those changes didn't affect my september number very much [15:20:18] 6950454000 vs [15:20:18] 9165370101 [15:20:30] before I included mobile varnishs with '-' as content type [15:20:31] i get [15:20:32] i got [15:20:33] 6944272000 [15:20:58] odd [15:21:12] a difference of 6M requests [15:21:16] so I guess that is a lot [15:21:34] but not enough to make up for the 2 billion I am missing [15:49:35] drdee [15:49:35] https://gist.github.com/3859662 [15:49:41] aight [15:53:35] ottomata, drop the index.php from your regex [15:53:45] drdee: https://gist.github.com/00cb20d791e8b8fb4101 [15:53:51] drdee: not ready yet, underway [15:54:05] why? ez has it [15:54:05] drdee: we're gonna be rollin very soon [15:54:49] drdee, ez has: [15:54:49] # no mime type on log records from varnish, assume 'page request' on most, until that stream had been fixed [15:54:49] if (($url =~ /\.m\..*?\/wiki\//) || ($url =~ /\.m\..*?\/w\/index.php/)) [15:57:45] time for lunchy, heading home for leftovers, be back in a bit [15:58:07] ottomata, what happens to lines with != 14 fields? [15:58:33] good q…not sure what PigStorage does [15:58:38] will look into it [15:58:55] either it discards them or fields get wrongly assigned [16:27:48] ottomata, i am counting the lines with more than 14 fields in a bash script for september [16:27:54] this could take a while :( [16:28:00] maybe run a pig script that does that :D [16:28:12] i read about PigStorage [16:28:30] it will just parse the line (that's my impression) and then misalign a field [16:28:39] but on second thought, it should actually not matter [16:28:50] because the field causes issues is the mime type field [16:28:57] and all the relevant fields are before that field [16:29:00] so...... [16:32:39] hmm [16:37:17] ok , just tested it out [16:37:32] for lines with more than 14 fields it just truncates the last fields and doesn't include them [16:37:41] for fields with fewer than 14 fields, it just leaves in null values for the missing fields [16:37:46] ok [16:38:00] so you would have lines that miss probably the user agent string [16:38:11] of course depending on how many spaces it finds [16:38:19] yeah, but that's ok though, since user agent is last [16:38:24] (this also completely illustrates why we need to move to the tab separator [16:38:30] and i'm not using it anyway [16:38:31] for this [16:38:50] right, so that's why i said i don't think it matters in this case [16:38:57] aye ja, just confirming [16:40:01] this is pretty strange though [16:40:05] if I was off by 2billion more [16:40:14] then maybe I would think there is something else I need to filter out [16:40:28] but I think i'm being pretty generous [16:40:39] hmm [16:40:44] maybe the 200|302 is causing a problem? [16:40:53] i was just thinking the smae [16:41:00] just run it with code 200 [16:42:50] actually yeah [16:42:53] so, i'm pretty sure [16:42:57] hmm actually [16:42:58] no [16:43:13] can you help me find the variable in SquidCountArchiveProcecssLogRecord.pm that == a page view? [16:43:16] i guess this [16:43:17] $html_pages_found += $count_event ; [16:43:17] right? [16:44:34] ok, it looks to me like he is only checking a few things [16:44:42] 1. mime is text/html (or '- [16:44:45] ' and matches that regex [16:44:57] 2. return if $line =~ /Banner(?:Cont|List|Load|beheer)/io ; [16:45:02] so he doesn't count lines that match that regex [16:45:05] aside from that [16:45:06] that's it [16:45:09] no filtering otherwise [16:45:15] so really jsut text/html and not a banner page [16:45:29] not even 404's? [16:45:40] well maybe just mimic that exactly [16:45:42] and see what hapens [16:46:01] hilarious that we count 40x and 50x as pageviews [16:46:20] i knew we did it in webstatscollector but i didn't' realize we also did it in wikistats [16:50:15] brb coffee TIME! [16:50:27] i am on a coffee hiatus [16:59:41] hangout? [17:00:02] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [17:01:04] yo [17:15:59] average_drifter: ready to kick those debian packages? [17:16:51] drdee: got the thing to generate nice changelogs [17:17:14] nice [17:19:15] drdee: I have to polish it a bit, still causes some small problems [17:19:16] drdee: https://gist.github.com/00cb20d791e8b8fb4101 [17:19:18] drdee: please try it [17:19:40] drdee: you need JSON::XS installed [17:19:52] drdee: sudo cpan JSON::XS [17:20:18] drdee: or easier, sudo aptitude install libjson-xs-perl [17:23:11] drdee: it will back up your debian/changelog (you can git-checkout it later to revert any changes) in debian_changelog_ [17:23:26] drdee: you can cat the debian/changelog after it's finished running [17:23:28] what issues are left? [17:27:36] my macbook starts beeping [17:27:43] ??? [18:13:37] drdee: ? [18:13:44] yo [18:13:46] drdee: hey [18:29:11] hey [18:29:20] can't get your script to run on osx [18:29:24] but that's okay [18:29:31] let's move to labs if you are ready [18:36:17] average_drifter ^^ [18:36:29] drdee: I still need to fix one error and put DEBMAINTFULL env variable so we can set the maintainer [18:36:47] drdee: do we want the date to be the one on which the tag was made ? [18:37:09] drdee: what I'm refering to are the lines like [18:37:10] -- s Tue, 09 Oct 2012 20:51:25 +0300 [18:37:25] these are put by dch inside the changelog whenever a new version is made (aka release) [18:37:41] (aka we make a tag new tag and debianize a new version of the package) [18:38:17] another example of such lines [18:38:19] -- Diederik van Liere (Wikimedia Foundation) Mon, 21 May 2012 17:42:30 -0500 [18:38:19] i would grab the date from the commit belonging tot the tag, not from the tag [18:38:29] yes, alright :) [19:35:25] ok drdee [19:35:26] 2012-09 de 117320000 [19:35:26] 2012-09 en 626649000 [19:35:26] 2012-09 es 120097000 [19:35:26] 2012-09 fr 111205000 [19:35:26] 2012-09 ja 110618000 [19:35:27] 2012-09 ru 111562000 [19:35:34] those are 404 counts [19:36:14] quite significant [19:37:24] that's 2.3 Billion 404s across those 6 languages in 2012-09 [19:38:19] if we subtracted 404s [19:38:59] that would lower the all projects line in the pageview graph by that much, from 19.1B to 16.8B in 2012-09 [19:59:54] yes so this is a discussion i am having with erik m and howie, i think we need to bite the bullet soon [20:00:13] but it impacts our larger goals so timing is crucial [20:00:29] average_drifter, to the lab? [20:08:54] ottomata, doesn't that add up to less than 2.3 billion? just roughly adding it in my head it looks more like 1.3 billion [20:09:39] and isn't that the 1/1000 sample? Do we have a good idea of the +/- accuracy on that technique? [20:10:40] i just multiplied by 1000, so i dunno [20:10:56] um, lemm edouble check…, i summed with awk [20:11:38] OH yes [20:11:42] sorry, I summed some extra numbers [20:11:48] good catch, thanks dan [20:11:49] oh that aren't here? [20:12:04] np [20:12:08] 1197451000 [20:12:38] yeah, I included log files in my input data from the months surrounding 09, just in case there were log lines in the 08-31 or 10-01 days [20:12:52] but when I piped through awk I didn't filter those months out [20:12:53] so [20:12:55] yeah [20:13:09] 1.2 billion 404s in those 6 languages in 2012-09 [20:13:34] so do you think Erik just didn't filter out 404s? [20:15:34] that's most likely the case [20:16:38] as far as I can tell, he doesn't filter on http status at all [20:16:42] only content type [20:16:50] gotcha, ok [20:36:05] I'm slightly confused [20:36:32] what's up? [20:36:38] bug.. :| [20:36:51] perl bug or shell bug? [20:37:11] related to iterating through commits [20:37:24] so what I do is: [20:37:37] 1) I just get the commits for each tag that was made [20:38:17] 2) then I assume, the first commit on master is the *start_commit* of the first tag and the commit the first tag was made on is the *end_commit* [20:39:02] 3) so then I go one further and I say, ok this is the start of the second tag [20:39:31] 4) and then I get the commit which the 2nd tag was made on and I use that to be the *end_commit* of the 2nd tag [20:39:40] and I do that for all tags [20:40:23] so basically you have [20:41:25] commit_0 commit_1 commit 2(tag 0.1) commit_3 commit_4 commit_5(tag 0.2) commit_6 commit_7 commit_8(tag 0.3) commit_9 commit_10 commit_11(tag 0.4) commit_12 commit_13 [20:43:09] so what I do is I say [20:43:20] ok, tag 0.1 starts from commit_0 and ends at commit_2 [20:43:33] tag0.2 starts at commit_3 and ends at commit_5 [20:43:51] tag 0.3 starts at commit_6 and ends at commit_8 [20:44:09] tag 0.4 starts at commit_9 and ends at commit_11 [20:44:58] hold on [20:46:27] commit_12 and commit_13 don't really matter at this point because there wasn't any tag made that includes them so.. that means we didn't have a release yet. they will have to wait for the next tag that will be made and then the automated script will sync them also from git-log ===> debian/changelog [20:49:11] * drdee is reading  [20:49:56] that all makes sense [20:50:07] so what's the issue? [20:51:06] the issue is sometimes start_commit <====> end_commit is... [20:51:08] actually just one commit [20:51:43] what happens is I call git log to get commit messages [20:51:47] and I call it like this [20:52:07] git log sha_start_commit..sha_end_commit [20:52:24] these 2 commits must be different [20:52:32] otherwise git-log will give an empty output [20:52:54] so this happens if you tag two commit right after each other? [20:52:58] now looking for some switches to git-log that will help to just print the git-log output of one single commit [20:53:17] if I tag two commits right after each other ? [20:53:25] the best case is if each has at least one commit [20:53:53] so [20:54:00] right but the problem happens if you tag two consecutive commits? [20:54:15] commit_0(tag 0.1) commit_1(tag 0.2) [20:54:30] so currently tag 0.1 consists of just commit_0 and tag 0.2 of just commit_1 [20:54:34] if that's the case, just give an error and say that there should be at least one commit between two tags [20:54:51] that's fine as well, i wouldn't worry about this too much [20:56:04] drdee [20:56:05] https://gist.github.com/3861365 [20:56:53] reading [20:57:25] milimetric: i would love love love love love love support in limn for external datasources (like gist :D :D : D) [20:57:38] oh me too [20:58:06] I believe stuff like that will be pretty darn quick once all this setup work is done [21:00:33] do you have to contact to contact an admin to open ports for Hadoop? I can't get my nodes to contact each other. [21:00:45] ottomata, so exactly valid http codes are 200, 302, 304, possibly 301 as well [21:01:01] louisdang: probably easiest to run hadoop in local mode [21:01:17] or ask in #wikimedia-labs [21:01:22] ok [21:55:44] drdee: is whym still using the instances I have? [22:12:41] louisdang, yes [22:19:03] ottomata: useful for setting up ganglia monitoring for hadoop: http://files.cloudera.com/pdf/Hadoop-Operations_sampler_2012-08.pdf [23:47:38] drdee: question for you/analytics team; when calculating page views/readership -- how do you account for bots? [23:48:16] mwalker: they are not excluded [23:48:56] we are working on a separate collection of datasets that exclude bot traffic [23:51:36] why are you asking? [23:52:52] this is related to the fundraising effort to determine why our banner impressions are lower this year than last [23:53:13] has the issue with the BannerController.js file been resolved? [23:53:26] it's down to 0.25% of all requests [23:53:32] from 0.50% [23:53:56] we're clearing them as we see them; and the cache epoch has been updated (but that takes a month to completely clear things) [23:54:12] we have a list with URL's that generate the 404's [23:54:19] you know that, right? [23:54:39] https://gist.github.com/3829470 [23:54:40] possibly; Jeff_Green set me up a 1:1000 log with that [23:55:05] check the gist, that shows for each BannerController url how often it results in a 404 [23:55:18] so you should clear thos [23:55:19] e [23:56:00] ah... but the problem is not with the page not existing... it doesn't exist because it's deprecated and annoying to backport the code [23:56:05] it's the referrers we're clearing [23:57:12] we can help there as well :) [23:57:20] what are the invalid referrers? [23:57:20] oh... do tell!? [23:57:44] anything that attempts to load Special:BannerController [23:58:17] ok, so how about we generate a list like the gist i showed you for referer url's, would that be helpful? [23:58:24] absolutely [23:58:29] you got it! [23:58:46] maybe later tonight, else first thing tomorrow morning [23:58:56] nifty [23:59:04] if you have more fundraise specific requests, just ping me [23:59:13] we want to help you guys out [23:59:19] yep yep! before you run away though; one more thing related to pageviews [23:59:25] shoot [23:59:35] do you have any stats on user-agents betwean this year and last year? [23:59:49] what exactly? [23:59:50] (to see if there were more bot hits this year than last) [23:59:56] you could check wikistats for that as well