[00:01:28] drdee: please review [00:03:17] merged [00:03:59] drdee: yay ! :) [00:04:09] looks all good! [00:04:16] so now we can test tomorrow [00:04:28] ttyl [12:51:31] morning milimetric [12:51:39] morning average_drifter [12:51:51] morning drdee [12:51:58] morning drdee [12:52:10] morning guys [12:52:12] did you guys watch the debates or is that not internationally cool? [12:52:21] i certainly did [12:52:40] what did you think? [12:52:42] i saw this cool online version where third party candidates answered the same questions: www.democracynow.org [12:53:33] well, my bias is that I'm the chair of the Justice party here in Pennsylvania and I've been supporting Rocky Anderson. I thought Romney did a better job of looking "presidential" whatever people mean when they say that [12:54:43] i wished obama was more aggressive and less on the defense, IMHO [12:55:08] but none of them covered the drone attacks that killed less than 10% militants and innocent civilians as the rest, the free pass to drill the arctic that was just given to Shell, the National Defense Authorization Act, etc. [12:55:32] well the drone stuff is here to stay,regardless of who wins [12:55:41] all those things are, yep :( [12:56:01] i assume that next debate will focus on foreign policy and social issues [12:56:13] that's why they both know not to talk about it. Which is silly. [12:56:14] The Australian Liberal oriented paper declared Romoney the winner. [12:56:40] well the biggest problem with NDAA is domestic - indefinite holding of American Citizens! [12:57:01] welcome to the post 9/11 world :[ [12:57:21] yea, welcome to V for Vendetta :) [12:57:26] :D [12:57:46] so about the report card report.... [12:57:50] yep [12:58:27] can you add a caption to the target charts saying that we are not updating them as we are refining our prediction model or something like that [12:59:08] hm, let's see where it could go [12:59:38] oh ok, like "Aug 2012: Editors in Commons are now included in the overall total, and users who share a name across multiple projects are counted as the same user" in the one before it? [12:59:44] yes [13:00:04] btw, dsc told me why July 2012 was hard coded - it's not, we just haven't pushed new code in a long time [13:00:17] unfortunately we can't push for a while because we're mid transition/big change [13:00:29] k, I'm looking around to see where that caption's set [13:00:41] don't we have a master branch with prod code and a develop branch :D :D ? [13:00:46] average_drifter: let's build some debian packages! [13:00:57] drdee: yes [13:00:59] drdee: but first [13:01:46] drdee: https://gerrit.wikimedia.org/r/#/c/26567/ [13:01:49] drdee: please review [13:02:36] done [13:03:27] drdee: ok, let me just look at the few items left like moving some stuff to utils.c [13:03:32] drdee: and also separating matching [13:03:39] k [13:05:36] drdee, yes, but at this point they're so separate that I wouldn't know what commits to cherry pick onto prod [13:05:49] k, found the stuff looks like it's just set in json so I'm changing it [13:06:54] "We are refining our prediction model therefore we are not currently updating this data"? [13:06:56] k [13:11:13] 1 sec [13:19:10] drdee, I got all the files up, let me know the final copy :) [13:19:19] it's okay [13:28:35] drdee done http://reportcard.wmflabs.org/ [13:29:52] looks good, can you also add it to the mobile target chart? [13:30:34] same message? [13:30:56] yes [13:34:31] k drdee, done. This is funny, it's in like 20 places [13:34:35] well, 8 [13:35:00] these kind of metadata things need to have a single canonical place [13:38:12] I think these are editable from the graphs themselves but I'm not sure how that'd work with commits [13:47:56] morning ottomata [13:48:39] morning [13:49:05] it seems that the hadoop job you started is having some issues [13:50:34] wah wahhh [13:53:27] too many mappers are launched [13:53:33] about 7000 in total [13:53:42] while a node should launch between 10-100 mappers [13:53:52] one way of reducing the number of mappers [13:54:21] is to increase the HDFS block size, right now it is 64mb but i think we should increase it to 256mb [13:54:30] not sure if you can do that on a live HDFS partition [13:54:36] or that you have to create a new one [13:54:45] the job is running for 15 hours now [13:54:51] and seems to be stuck [13:54:56] so i am digging through the logs [13:55:39] and could you give me read access to /user/otto ? [13:55:53] it is per file [13:55:56] the block size [13:56:16] and can you enable the hadoop metrics server on port 8042 [13:56:16] oh yeah sure, you also should have sudo on the analytics machines [13:56:20] so you can go sudo -u otto ... [13:56:24] ughh [13:56:30] block size is not per file [13:56:33] no? [13:56:51] it's an HDFS property [13:57:21] http://hadoop.apache.org/docs/r0.18.3/hdfs_design.html#Data+Replication [14:00:32] 7000 maps? [14:00:36] i see total maps: 4720 [14:00:37] so it's both a global and local property [14:00:43] and i've run jobs with that many maps before [14:00:53] aye, the global is the default [14:00:55] but generally it is set as a global proeprty [14:01:03] right but it's too many mappers [14:01:09] it's bad for performance [14:01:12] there are 0 maps running right now [14:01:19] you are probably right [14:01:20] but [14:01:25] as each mapper does not get enough workloa [14:01:26] d [14:01:34] i think the problem is that I did not manually set the parallelism as I did the other jobs [14:01:38] so there are way too many reducers [14:01:39] each mapper runs for 5 seconds on average [14:01:49] especially for this, since there are only 3 different keys [14:02:00] but the number of reducers is independent of the number of mappers [14:02:03] yes [14:02:11] certainly, but the mappers are not stuck, its the reducers [14:02:14] http://analytics1001.wikimedia.org:8088/proxy/application_1349195921521_0030/mapreduce/job/job_1349195921521_0030 [14:02:23] right, i think there are multiple issues [14:02:28] one of them is too many mappers [14:02:29] there are 54 currently running, probably waiting for more data from mappers or something [14:02:33] but no mappers running [14:02:41] gonna kill the job [14:02:47] it should take no mor ethan 50 minutes total [14:04:39] can you wait with killing it [14:04:45] so i can grab the logs? [14:05:37] oh did already, [14:05:43] logs disappear if I kill it? [14:06:11] in the web console i think it does [14:06:13] not locally [14:06:20] could be wrong though [14:07:00] you started a new job? [14:07:01] drdee: I've used extern for stuff used by both match.c and udp-filter.c [14:07:17] k [14:07:24] drdee, re hadoop metrics server [14:07:31] yes [14:07:51] what is that? is that a JSON interface thing? [14:08:00] REST interface? [14:08:06] it's a link in the hadoop cluster [14:08:22] and if i click it it doesn't work, so i guess it's just another webconsole [14:08:56] so the job consists of 4719 maps (instead of 7000, i was exaggerating) [14:09:25] and we have 7 nodes, so that's about 675 mappers per node [14:11:08] it looks much better! [14:12:00] yeah, i'm not sure if this is right, but I did 3 reduces because there are 3 keys [14:12:18] and, i think I read somewhere that each node can only run 2 reducers at a time anyway [14:12:35] so you might as well not set more than 2* number of nodes for reducers [14:13:46] drdee: please review https://gerrit.wikimedia.org/r/26622 [14:13:51] according to cloudera "The number of reducers is best set to be the number of reduce slots in the cluster " [14:14:05] ahahh what are 'reduce slots'? [14:18:26] drdee, here's what I got: [14:18:27] 200 655438604 [14:18:27] 404 137 [14:18:39] mmmmmm [14:18:49] right [14:18:50] that is 2 days of 404 logs [14:18:54] sorry no [14:18:56] of ummmmmm [14:19:01] banner logs? [14:19:03] banner logs [14:19:15] yeah, from the 1:1 logs [14:19:20] 20120930-20121001 [14:19:21] yeah, so the issue is with BannerController [14:19:37] lemme gist the script and results for reference [14:21:18] maybe I soudln't filter [14:21:21] and just sum all of the statuses? [14:22:08] https://gist.github.com/3833806 [14:23:01] average_drifter: review is complete: https://gerrit.wikimedia.org/r/#/c/26622/ [14:23:06] drdee: reading [14:23:52] ottomata; let's have Jeff_Green a look at this [14:25:43] morning Jeff_Green [14:25:57] morning! [14:26:01] ottomata, just finished running a script counting 404's on the 1: 1 banner logs [14:26:08] https://gist.github.com/3833806 [14:26:13] basically there are no 404 errors [14:26:53] how can that be? [14:27:32] wait a minute . . . [14:27:33] well, bannerImpressions doesn't have the BannerController stuff in it, right? [14:27:50] ok back up a sec [14:28:18] we have 404's collected for a ~10(?) hour period yesterday [14:28:33] and we ran tests in italy yesterday which overlapped somewhat with that period [14:28:48] we have 1:1 banner logs for the test [14:29:13] so the best we can do afaik is to count banners from one log, and 404's from the other, for the period of overlap [14:29:27] well, yesterday [14:29:52] you guys had me count the URIs that matched 'BannerController' from the 404.log that you created [14:30:04] ya [14:30:11] then, diederik had me load in the 1:1 bannerImpressions logs for 09-30 and 10-01 [14:30:19] which do not match for 'BannerController', afaik [14:30:34] and then check it for 404s [14:30:42] but, since bannerImpressions.log [14:30:42] k [14:30:51] does not filter for 'BannerController' in the uri [14:30:57] they are different data sets, no? [14:31:06] so now we need to load the banner impressions for yesterday so we can do the overlap [14:31:12] ja [14:31:16] ? [14:31:41] step 1: load yesterday's bannerImpressions-1:1 log [14:31:44] you ran the banner impression for 09-30 and 10-01 [14:31:48] we can do that, with the caveat that I turned off the 1:1 bannerImpressions log for an hour or two [14:32:04] step 2: figure out the period of overlap in the 404 vs bannerImpressions logs [14:32:06] right, has something changed since those dates? [14:32:12] okay, let's start fresh [14:32:17] how about a new filter [14:32:18] yeah, i'm pretty confused [14:32:23] what is BannerController? [14:32:30] that's good, but we have the data! [14:32:41] that captures both BannerController and B12 campaign [14:32:49] run that filter today for a couple of hours [14:32:51] load data in kraken [14:32:54] run pig script [14:32:56] BannerController was a chunk of javascript plus a geo-specific list of possible banners [14:33:04] BannerController --> 404.log [14:33:19] B12 Campaign --> 1:1 banner impressions.log [14:33:34] yeah. the deprecated BannerController in August or something [14:33:44] drdee exactly [14:34:05] ottomata: so BannerController was causing the 404's [14:34:35] yeah. specifically the problem is that we have cached wikipedia pages that still link to the deprecated BannerController URL [14:34:38] but we have to figure out the extent to which BannerController decreases the number of banner pageviews [14:34:52] exactly [14:35:12] so the problem is now that the data is in different files with little time overlap [14:35:20] so that's why i said let's create a new filter [14:35:25] and capture both aspects [14:35:27] even an hour or two of overlap should give us a pretty good idea [14:35:27] in the same file [14:35:33] but I agree going forward [14:35:41] if we get that filter running now [14:35:55] then in 2 or 3 hours from now we can load the data in kraken [14:36:05] and we don't have to do complicated dataset merging stuff [14:36:09] for example, try the window between 10AM-12PM EDT yesterday [14:36:50] I don't understand why it's complicated? It's just two separate hit counts for a 2 hour period [14:37:16] well because we also turned off bannerimpressions.log for 2 hours yesterday [14:37:24] right when the 404 filter was running [14:37:36] but that was toward the end of the day (EDT) [14:38:06] okay: let's just check in both files if we have observations for the time period that Jeff mentions [14:38:14] if that's the case then work with data [14:38:20] if not then have a new filte [14:38:21] r [14:38:25] sounds good? [14:38:25] I'm not sure when the test began/ended but we should have clean log overlap from when I turned on 404 logging (~5AM EDT) to when you first turned off 1:1 banner logging [14:38:39] if we just do a new filter, then I dont' have to write the pig logic to filter on timestamp [14:38:40] :) [14:38:42] just sayin [14:38:52] yeah, that's also a good point [14:39:01] ottomata: hahahah. I'm gonna mock you about that now. [14:39:05] and we can be sure that the data is from the same time [14:39:06] hahaha [14:39:08] aw man! [14:39:18] i should learn how to do it, but it is a bit tedious, heh [14:39:39] you better haxor your haxoring skills b/c I can assure you fundraising going to make you work harder than this! :-P [14:39:40] even though Jeff is technically right, i think the easy way is running a new filter for 2 hours [14:39:52] naw man! [14:39:56] FR people gotta learn how to do this! [14:40:05] +1 [14:40:06] i'm not the pig scripter [14:40:12] drdee: I'm not sure if we're running a test now [14:40:19] i'm just using this as an excuse to learn what is needed :) [14:40:29] we are running the 1:1 banner impression stuff [14:40:32] right now [14:40:37] * Jeff_Green seeks carrot and stick [14:40:59] so let's update the filter stuff on oxygen, [14:41:12] people have promised me a lot of beers to get stuff done. i wonder if I can transfer the IOUs. [14:41:12] maybe I'll learn what I can, and when we get Kraken to a state of wanting to let more people into it, I will run a little workshop with FR on how to do this stuff [14:41:12] just add BannerController to the path of the 1:1 banner impression filter [14:41:26] drdee: please review [14:41:32] ok [14:41:36] zat ok with you Jeff? [14:41:53] a german accent ? [14:42:02] drdee: I'm not sure we're even running a banner test at the moment, and that gets us our answers in a few hours [14:42:19] whereas I could grep the files on disk in 10 minutes at most even without kraken [14:42:24] ok okok [14:42:45] then, can you figure out the times that I shoudl filter for? [14:42:47] from the logs? [14:42:50] ooh carrot: I'm gonna use O2 to grep and induce packet loss :-P [14:42:57] that is a stick! [14:43:03] oh you're right [14:43:12] and i'll figure out how to write pig timestamp filter [14:43:30] 8AM EDT to 12PM EDT [14:43:50] on 10-03? [14:43:56] yup [14:43:59] ok [14:44:01] average_drifter: merged [14:44:31] drdee: :) [14:44:38] drdee: now we can jump on the debianization [14:45:10] let's use the debianization script from webstatscollector [14:45:14] as a starting point [14:45:36] there are some warnings left when compiling: [14:45:38] src/match.c: In function ‘match_domain’: [14:45:39] src/match.c:166: warning: implicit declaration of function ‘extract_domain’ [14:45:40] src/match.c:166: warning: assignment makes pointer from integer without a cast [14:45:41] src/match.c: In function ‘match_http_status’: [14:45:42] src/match.c:217: warning: implicit declaration of function ‘extract_status’ [14:45:43] src/match.c:217: warning: assignment makes pointer from integer without a cast [14:46:03] I will have a look at these now [14:49:04] question about udp-filter . . . preferred way to match two possible --path substrings, along the lines of /(BannerLoade&banner=B12|SomethingElse)/ ? [14:52:30] just: [14:52:41] BannerLoader&banner=B12,SomethingElse [14:54:50] ok [14:55:09] what if, in theory, I needed a comma (I don't)? escape it? [14:55:48] there is also a -r option [14:55:48] no you can't match a comma right now [14:55:52] that turns your arg into a regexp [14:56:06] that's also true [14:56:30] but AFAIK, comma's in urls are url encoded anyways [14:56:50] cool. non-issue--I was just curious [14:57:17] drdee: ok sorry for the small commit, this fixes those warnings https://gerrit.wikimedia.org/r/26629 [14:58:00] merged [14:58:35] average_drifter: move to labs? [14:58:36] drdee: shall we use the same VERSION generating mechanism we have in debianize.sh ? [14:58:39] drdee: moving to labs [14:58:41] yes [14:58:49] wait [14:58:55] udp-filter is slight more sane [14:59:11] there is one canonical place for the VERSION and that's in udp-filter.c [14:59:26] but that should be updated automatically [14:59:47] drdee: yes, I will modify that so it can be picked up and modified by the Perl one-liners in debianize.sh [15:00:15] k [15:00:55] drdee: user@garage:~/wikistats/udp-filters$ git describe | awk -F'-g[0-9a-fA-F]+' '{print $1}' | sed -e 's/\-/./g' [15:00:59] fatal: No names found, cannot describe anything. [15:01:11] drdee: dunno why git describe says "No names found, cannot describe anything" [15:01:27] uhmmmm because a tag is missing [15:01:45] drdee: oh alright, should I make a tag , what should the tag name be ? [15:02:00] so we should add retroactively tags to udp-filter based on version numbers :) [15:02:18] use the same tag convention as in webstatscollector [15:02:34] drdee: ok, so I should go back to the last commit just before I started writing on it [15:02:35] i think it's just V0.1 or maybe even 0.1 [15:02:38] drdee: and do a tag there right ? [15:03:03] slightly different [15:03:20] find all the commits where i changed the version numbr [15:03:29] ok [15:03:32] and tag those commits with the version number [15:03:40] so we can easily navigate through history [15:03:44] alright [15:04:22] i have a quenching thirst for a coffee [15:04:51] and one more wanting : [15:04:56] i mean warning :D [15:04:57] src/udp-filter.c: In function ‘parse’: [15:04:57] src/udp-filter.c:761: warning: implicit declaration of function ‘determine_num_obs’ [15:18:44] brb [15:25:41] milimetric: do you have ideas for using git flow and tags? [15:27:54] tags? [15:28:15] drdee git flow tags for you wherever it makes sense (like finishing releases) [15:28:24] oh cool [15:28:40] perfect, i was exactly wondering about that [15:29:10] yeah, without gerrit it eliminates basically all git-fu and limits you to start / finish / publish, commit, and merge resolution [15:29:24] you can even do stuff like git flow feature rebase blah [15:39:06] so what's your verdict? do gerrit and fit glow coexist or they do hate each other [15:39:44] it depends on what the gerrit policy is [15:40:08] so git flow tries to keep stuff out of master until a release is ready - and that's sensible [15:40:55] that clashes with the current goal of reviewing every commit as we go (and therefore merging it into master from develop / feature / etc.) [15:42:16] but my verdict is we can still do it - our local repositories will be sensible and I see no problem with git flow / git review every commit [15:42:35] if we want to git review only at the end off the release branch, then we have to git rebase and squash it all into one commit. [15:43:30] the squashing is bad imho, loss of history and too much to review [15:48:22] definitely [15:49:04] i think git-review as you go is a better option. Then when you start your release branch everything's reviewed and the review could check the "production-readiness" of it all [15:49:54] the only other option is to get gerrit to commit to the develop branch - that would be a more natural fit to [15:51:28] you mean that the develop branch is reviewed and that you then can pull once in awhile directly from develop into master? [15:53:04] ahhh doh [15:53:11] the 10-03 logs are in the 10-04 archive file [15:58:56] drdee, no i mean we change gerrit to merge into develop if the changes are accepted [15:59:06] and then continue to use git flow the way it's meant to be used [15:59:40] i think that's possible, you just have to update the .gitreview file and indicate a different remote branch [15:59:48] yep [15:59:55] i read about people doing that [16:00:20] it seems a better fit if we want to actually review every commit [16:00:35] because every commit isn't necessarily supposed to go into master [16:11:53] brb lunch [16:12:00] hey is the metrics meeting today? [16:17:41] ottomata: yes [16:18:07] mk [16:23:48] drdee, check this out! [16:23:48] http://docopt.org/ [16:24:14] generates command line interface from help message string [16:24:16] you write the help string [16:24:22] and the parser is generated from that [16:24:24] so cooooool! [16:24:45] a great rudimentary tool for debugging C code [16:25:05] fprintf(stderr," at line %d\n",__LINE__); [16:25:11] :) [16:25:31] otto: that rules [16:26:14] that is very coooool indeed [16:26:20] the C port is coming :( [16:34:03] drdee: let's add BannerController to the banner impression filter [16:34:10] ok [16:35:28] i can just do it I guess [16:40:25] yup [16:50:01] goood morning [16:50:08] i just talked with herr moeller. all is well for the meeting. [16:51:05] he particularly complimented milimetric on his fine work updating the reportcard [16:51:12] Jeff_Green + drdee: [16:51:13] https://gist.github.com/3834876 [16:51:36] no way [16:51:42] that can't be!? [16:51:45] waay [16:51:46] why? [16:51:50] this is bannerImpressions.log [16:51:51] right? [16:51:55] not BannerController [16:52:00] oh! ok. [16:52:23] so I should do the same for 404.log and filter for BannerController? [16:52:27] or no [16:52:28] that's pretty slick, ottomata [16:52:28] hmmm [16:52:41] ottomata: ya [16:52:48] well, hmm [16:52:56] but we know that 404.log is all 404s, right? [16:53:15] you just want to know the # of 404s for BannerController in that time period? [16:53:15] it's all 404s, not just BannerController 404s [16:53:19] thanks dschoon. I'm glad my vim -o `grep -ril "July 2012" *` fu was appreciated [16:53:20] yes, but for all 404's [16:53:20] righto [16:53:48] reminder: metrics meeting is at 10:30 [16:53:51] for my next trick, I hope to actually update the website as opposed to the generated code, lol [16:53:55] if we're all here, shall we start scrum early? [16:53:59] sure [16:53:59] sure [16:54:10] milimetric: i mean, let's not get overly ambitious ;) [16:54:27] https://plus.google.com/hangouts/_/2e8127ccf7baae1df74153f25553c443bd351e90 [16:54:37] drdee in case you didn't see ^^ [17:05:41] i love these gists you guys. Is it weird that I'm attracted to large numbers? [17:06:14] no, that's a requirement to be part of the analytics team :) [17:06:58] drdee, Jeff_Green: [17:06:58] https://gist.github.com/3834876 [17:07:32] wow, 10% [17:07:47] i'm still not clear on what these numbers mean or what you are trying to show [17:08:00] OH [17:08:01] wait [17:08:02] sorry [17:08:04] do not look at that [17:08:08] I did not filter for BannerController doh [17:08:10] sorry [17:08:13] oh [17:12:51] ok, here we go: [17:12:53] https://gist.github.com/3834876 [17:12:55] Jeff_Green ^ [17:13:25] 1% [17:13:28] that's not very much... [17:15:04] yeah, so what are you comparing? i'm just curious [17:15:13] you ran a test during your 404 collection that did what? [17:15:19] oh hm [17:15:20] so [17:15:22] ottomata, can you run a pig script against the 404 log file and count the number of times you find B12 in the URL? [17:15:40] you think the BannerController 404s are from cached urls [17:15:47] that should be hitting your B12 stuff instead? [17:16:13] i checked with binasher yesterday, and those BannerController url's are not cached in squid [17:16:14] essentially, yeah [17:16:15] so you were hoping that your dip in B12 requests would be accounted for by the cached URLs pointing at BannerController hits? [17:16:29] B12 [17:16:29] yes [17:16:30] in 404 [17:16:33] ok [17:16:34] but surely we have logfiles from nonsquids [17:16:38] from the apaches? [17:16:43] not from apaches [17:16:52] its more than squids [17:17:09] we call it squids but it is varnish / squid / nginx [17:17:12] squids, varnich, nginx, etc. [17:17:14] but not apach [17:17:15] e [17:17:22] since that woudl duplicate stuff anyway [17:17:36] the request woudl be log from frontend cache or proxy, and that is good enough [17:17:51] no requests go directly to apaches [17:17:57] so logging from apaches too would duplicate [17:18:58] drdee, you want that number in the same time span? [17:19:11] yes let's do that [17:19:36] ottomata: did you know that you can put pictures too in gists ? :) [17:19:57] no! [17:20:01] ottomata: I mentioned that because I saw your gist and was wondering how the charts might look like after running the commands you did [17:20:20] ottomata: there is a clone url above inside the gist, if you clone it it's just a regular git repo, and you can add files to it like images for example [17:20:25] ottomata: and they will appear in the gist :) [17:20:35] limn needs support to load external data urls :D [17:23:04] ottomata in addition,, is it possible to count the number of banner impressions from ip6 addresses using pig? [17:24:41] ah, the proxy [17:24:42] true [17:24:57] hm, should be possible [17:25:01] limn does! [17:25:03] or should. [17:25:06] but soon [17:26:58] drdee [17:27:03] 32 [17:27:15] not 42? [17:27:17] 32 banner=B12 404s in that time period [17:27:17] well well well [17:27:36] now that we've got hits+misses in the same log I hacked a realtime reporter [17:27:55] ...drums please… [17:27:56] just tailing and tallying counts for 1 minute intervals [17:28:00] 2885:20 [17:28:22] 3573:20 [17:28:29] banners:404s of course [17:29:01] :20? [17:29:57] that's a coincidence [17:30:03] 3473:12 [17:30:28] so i don't know exactly what BannerController does, does it inject/load the B12 banner? [17:30:49] because if that's the case then obviously you would see few B12 404's [17:31:37] Jefff_Green so what does BannerController do? [17:31:50] no, i mean, i'm not sure what your numevers are [17:31:52] numbers* [17:31:54] i'm told it was a chunk of javascript, a list of possible banners URLs, and the browser used it to select and fetch a banner [17:31:54] ?:? [17:32:10] banners served : 404s served [17:32:13] ah, k [17:32:22] in each minute [17:32:24] yeah [17:32:27] aye [17:32:34] validates your numbers imo [17:32:57] especially since mwalker went cowboy last night and tried to purge 1.7m stale docs [17:33:14] those nubmers include BannerController requests? [17:33:21] B12 + BannerController? [17:34:07] /=B12/ : 404 and /BannerController/ [17:35:39] ah i see [17:35:40] ok yeah [17:35:46] man webex [17:35:46] grrrr [17:37:12] can't get in to meeting [17:37:13] oh well [17:39:36] me neither [17:39:39] what's going on? [17:39:45] the plugin doesn't load [17:43:48] drdee: please review https://gerrit.wikimedia.org/r/26644 [17:43:57] drdee: this fixes the referer test [17:44:06] thanks [17:45:15] done [17:51:12] if it helps, this works on windows pretty well. So drdee if you still have your VM trial, you can try chrome in that [17:51:18] (the webex) [18:03:03] RACING CAMEL [18:03:06] I WANT A RACING CAMEL [18:03:09] WTF [18:04:00] especially with a spider monkey jockey [18:04:05] or did i misunderstand that [18:04:18] maybe it was his friend? it was definitely a spider monkey [18:27:07] does ./udp-filter -i 0.0.0.0/0 match all ips ? [18:27:14] it looks like something that would match any ip [18:27:20] is that accurate ? [18:27:53] i think it should match all ip's, ottomata can you shine some light on cidr ranges? [18:28:13] or maybe just 0.0.0.0 [18:28:20] i think that's the case [18:28:21] i think /0? [18:28:31] ottomata: what does /0 mean ? [18:28:43] sorry, I'm not an expert in networking and stuff [18:28:47] so that's why I'm asking [18:28:47] 1 address IIRC [18:31:24] hey do you guys know how to interpret what udp2log records as squid's action code? is TCP_MISS a straight TCP_MISS or is there something else to that since we have the dual-squid layer config? [18:31:36] naw, the /x refers to the number of bits in the netmask [18:31:38] so [18:31:56] how many bits are part of the network, vs how many are avaialbe for assignment [18:32:34] 32 - x == number of bits avaiable for the IP range [18:32:35] so [18:32:36] 0 [18:32:38] would mean all bits [18:32:53] however, i'm not sure if 0.0.0.0/0 is a valid cidr range [18:32:59] semantically it is, but not sure [18:33:01] trying to check [18:33:12] I think it's valid [18:33:14] ottomata: it appears in the tests, that's why I ask [18:33:33] http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing#CIDR_blocks [18:33:51] see the IPv4 CIDR blocks chart for a quick visual [18:34:36] ottomata: so the second test checks how many of these ips 208.80.152.111 , 208.80.152.222 , 216.38.130.161 , 2002:6981:27e2::6981:27e2 , 2002:6981:27e2::6981:15e2 , 127.0.0.1 are in the range 0.0.0.0/0 [18:34:53] ottomata: and the test actually wants that 4 of these ips to be in the range 0.0.0.0/0 [18:35:29] all but the ipv6 addresses are in that range [18:35:45] Jeff_Green: I fully agree [18:35:52] what test? [18:36:30] tests that are part of upd-filter [18:36:34] yes [18:36:43] oh you are adding tests! [18:36:44] nice [18:37:27] those tests always have been part of upd-filter :D [18:37:39] there are in the run.sh script [18:38:06] you had cidr range tests? [18:38:25] ottomata: the cidr range test I described above was written before I came on the project [18:38:34] ottomata: right now I am trying to understand how the test works [18:38:46] ottomata: and why I am getting 2 instead of 4 ips in 0.0.0.0/0 range [18:39:08] wondering who wrote that test? i don't htink I did, and I just added cidr addressing to udp-filter like a month ago [18:39:22] yeah, good q, i'm suspicious of 0.0.0.0/0 [18:39:31] i betcha for some reason somethign doesn't like it [18:39:56] btw [18:39:59] check out cidrcalc [18:40:05] it is installed as part of libcidr [18:40:10] ottomata: so 0.0.0.0/0 matches just the first and last of the list of ips I mentioned above [18:40:17] cli tool for printing details of cidr ranges [18:40:22] otto@stat1:~$ cidrcalc 0.0.0.0/0 [18:40:22] Address: 0.0.0.0 [18:40:22] Netmask: 0.0.0.0 (/0) [18:40:22] Wildcard: 255.255.255.255 [18:40:22] Network: 0.0.0.0/0 [18:40:22] Broadcast: 255.255.255.255 [18:40:23] Hosts: 0.0.0.1 - 255.255.255.254 [18:40:24] NumHosts: 4,294,967,294 [18:40:28] I will write the list again just for verbosity :) [18:40:28] looks good [18:40:28] 208.80.152.111 , 208.80.152.222 , 216.38.130.161 , 2002:6981:27e2::6981:27e2 , 2002:6981:27e2::6981:15e2 , 127.0.0.1 [18:40:40] it doesn't match .222 and .161 [18:40:41] ? [18:40:53] ottomata: no, it does not, not sure why [18:42:22] hmm, i guess I wrote these tests? [18:42:36] nope, was diederik [18:43:43] user@garage:~/wikistats/udp-filters$ git blame run.sh -L 6,6 [18:43:43] cae3acdc (Diederik van Liere 2012-10-03 11:35:51 -0400 6) ip_filter2=$(cat example.log | ./udp-filter -i 0.0.0.0/0 | wc -l) [18:43:49] yes [18:43:53] ottomata: but the test is good [18:44:14] ottomata: I mean we agree that 4 should be the result of that [18:45:19] ja [18:48:18] I just compiiled and ran that [18:48:19] I get 4 lines [18:48:24] ottomata: you get 4 ? [18:48:26] cat example.log | udp-filter ./udp-filter -i 0.0.0.0/0 | wc -l [18:48:26] 4 [18:48:30] ottomata: please tell me your branch and commit hash [18:48:47] (master)[6839b1b] [18:48:57] average_drifter: then you introduced a bug with the refactorign :) [18:49:08] i just pulled [18:49:17] , recompiled [18:49:18] and ran that [18:49:45] ottomata: wait, you pulled the latest udp-filter [18:49:53] ottomata: and I have the same, but you get 4 and I get 2 [18:50:01] ottomata: what is the architecture of your machine ? just curious [18:50:36] x86 , amd64 ? [18:50:50] amd64 ubuntu precise [18:51:17] Hi, do you need any volunteers? [18:53:17] ottomata: I have an x86_64 machine available, let me just see what I get on that, I'm just curious.. [18:54:46] average_drifter: i get 4 on labs as well [18:59:10] hi louisdang [18:59:16] what are you interested in volunteering for? [18:59:26] my experience has been in analytics [18:59:35] can I help on Kraken? [18:59:56] I'm not sure but the people to talk to are dschoon and ottomata [19:00:10] they're the Kraken wranglers right now [19:00:25] louisdang: can you tell us more about yourself? [19:00:39] I'm a graduating student at the University of Washington [19:01:07] I worked at Research in Motion for a year on search analytics [19:01:14] on Blackberry devices [19:01:14] can you give us some url's of stuff that you have done? [19:02:09] My experience has been on mobile devices, I've never done open source so all I have is a resume [19:02:41] but I'm unemployed now with a lot of free time so I've been looking for things to do [19:03:11] what languages are you comfortable with [19:03:35] Java and C++. I've used PHP in a web programming course before. [19:05:00] I worked with SQL Server and was familiarizing my self with Hadoop before I left RIM [19:05:16] from what study are you graduating? [19:05:49] Mathematics, but I took several CS courses [19:06:06] so what would you like to do? [19:06:40] since my background is in analytics I thought I check here first. I'm up for anything analytics related [19:06:51] I specifically worked in text analytics [19:07:04] including query classification and spelling correction [19:07:17] using Java and JDBC [19:07:45] though I would like to develop skills in using noSQL databases [19:08:03] are you familiar with lucene? [19:08:04] in developing for noSQL databases [19:08:07] cat ~/vagrantbox1_ssh [19:08:09] damn [19:08:10] sorry [19:08:10] yes I have used it before [19:08:22] but only for stemming queries [19:09:55] do you know solr? [19:10:59] test [19:11:31] sorry, I got disconnected for a second. I'm on a spotty connection. [19:11:49] do you know solr? [19:12:22] no [19:13:07] I'm looking to learn more by volunteering [19:16:09] So the triggy part is that most of our projects are quite big so I am trying to think of something where you can make progress by volunteering only a few hours [19:16:31] I actually have a lot of free time. I'm only taking one class this quarter before I graduate. [19:17:24] So I can commit 20-40 hours per week [19:18:44] while I look for work [19:20:16] so how many hours can you volunteer in total? [19:22:38] louisdang ^^ [19:22:41] I would like to start with 20 hours per week and I should have time to volunteer up to the end of December [19:22:58] ok, we could give the following a shot: [19:23:16] are you familiar with Pig? [19:23:32] no [19:23:59] Pig is a high-level scripting language which you can use to write map/reduce jobs for hadoop [19:24:06] Ok [19:24:16] sounds like something I can learn [19:24:21] it would be totally awesome if you would help us write these pig scripts [19:24:24] this is kraken related [19:24:32] ok [19:24:36] agreed! [19:24:38] and it would be a huuuuuge help of us [19:25:12] so, the 1st thing to do is to go to labsconsole.wikimedia.org and get an account [19:25:26] I already requested one [19:25:34] and then we will help you setup a 1 node hadoop-local mode instance [19:25:39] ok [19:25:46] can I contact you by email? [19:25:52] just come to IRC [19:25:56] ok [19:25:58] yup, and you can also join the analytics mailing list, if you want to [19:26:03] or email me davnliere @ wikimedia dot org [19:26:15] davnliere => dvanliere [19:26:23] https://lists.wikimedia.org/mailman/listinfo/analytics [19:26:30] sure thing [19:26:31] also have a look at reportcard.wmflabs.org [19:26:43] that's the kind of metrics we are looking for [19:26:59] obviously we will explain much more later on [19:27:14] first, get your account and get your hadoop / pig instance running on labs [19:27:28] ok [19:28:23] let us know when you get a labs account, we'll see if we can add you to our project group and you can create an instance there [19:28:26] so are the scripts for automated reporting? [19:29:10] ja so, basically, right now that data is generated from a long and convoluted pipeline of perl scripts and excel spreadsheets by a guy that has been doing it for a long time [19:29:19] (well, the data that comes from web access logs) [19:29:28] I see [19:29:42] and it takes a very long time for him to process that data [19:29:52] we'd like to use kraken to do that much faster [19:29:53] yes. that is similar to how I did it at RIM [19:30:01] since we didn't have Hadoop yet [19:30:10] louisdang: can you send me your resume? [19:30:19] sure [19:31:49] sent [19:32:42] I made a dashboard using Excel and VBA [19:32:57] and used Java to process the data [19:33:32] unfortunately there's a law in Canada that only allowed RIM to keep me for a year before I returned to school [19:33:48] So I only started defining requirements for the Hadoop data collection/reporting [19:34:46] since we were only beginning to implement Hadoop [19:39:06] my email is dangl@uw.edu by the way, in case I get disconnected [19:47:00] so first step is to get labs access, once you've got that up and running: ping us! [19:47:44] alright I just have to wait for someone to pick up my request. I'll read up on Pig and Kraken in the mean while [19:48:01] yep. https://www.mediawiki.org/wiki/Analytics/Kraken [19:48:22] anything else I should learn? [19:48:43] i think that's a decent chunk to start with [19:48:44] :) [19:49:27] alright thanks for this opportunity and everything [19:50:42] yeah, read up on pig and hadoop for sure [19:50:52] you can probably spawn a VM of your own on the meantime and try to install hadoop locally [19:51:24] alright [19:52:14] ottomata: quick question: does the speed with which an ssh connection can send back the output of some commands which are run in serial, affect the speed overall of these commands ? [19:52:59] commands are run in serial, which means yes, the command sequence is affected by latency, but everything is sent at once [19:53:02] if i follow you [19:53:13] dschoon: you do [19:53:24] dschoon: so basically everything is capped by the speed with which it can send back the output [19:53:37] I don't know where build1 is but I'm pretty sure it's far far away from me [19:53:59] I'm in 46° 00' N and 25° 00' E [19:54:02] so if you ran `ssh server -- foo | bar | baz` it would only send one command (the stuff after the dashes) and only recieve one reply [19:54:16] dschoon: how about if I run commands with >/dev/null [19:54:20] dschoon: would that be faster ? [19:54:22] no. [19:54:30] but it will use less bandwidth. [19:54:30] 2>&1 > /dev/null ? [19:54:45] depends on what you mean by "faster" [19:54:50] it will return control to your shell sooner [19:55:10] dschoon: faster as in not letting each command in a bash script wait until the output gets across the ssh connection so it can continue with the next command inside the bash script [19:55:25] it doesn't wait for the output to get to you, no [19:55:32] it just pushes it into a buffer. [19:55:58] ok, then .. is build1 a VM ? [19:56:04] or physical machine ? [19:56:10] no idea :) [19:56:14] a VM, if it's labs [20:00:46] VM [20:01:05] yup [20:06:18] http://highscalability.com/blog/2012/10/4/linkedin-moved-from-rails-to-node-27-servers-cut-and-up-to-2.html [20:06:29] 20x faster, cut 27 servers [20:06:50] wow [20:07:55] For our inevitable rearchitecting and rewrite, we want to cache content aggressively, store templates client-side (with the ability to invalidate and update them from the server) and keep all state purely client side. [20:08:07] That sounds exactly like arguments I have made many times about app architecture. [20:08:17] Pure Client solutions are totally the future. [20:08:24] :) +1 [20:08:24] it's the BACK END that will become dumb! [20:09:24] dan and i are of one mind. [20:10:19] dude i didn't grok the full power of d3 scales [20:10:27] isn't it awesome? [20:10:34] it's a coordinate conversion [20:10:37] I'm like computing all this zoom stuff on my own - I could've just changed the domain on the scale!! [20:10:38] domain -> range [20:10:41] yep! [20:10:46] i think i might have said that :P [20:10:50] i didn't realize I can do that dynamically, I thought I'd have to redraw the graph [20:10:54] which is what I was doing before [20:11:05] it's also in my example stuff in the chart/type/d3 directory [20:11:07] for the lens [20:12:14] well, it's not perfect because I still have the unscaling problem I think, so I need some hybrid approach [20:12:41] do you? [20:12:42] why? [20:13:11] because you get the nice "transition" effect only if you manually scale the containers of the lines and circles [20:13:28] which makes you need to unscale the contents of those containers [20:13:37] paths have that built in property but circles have to be done manually [20:13:38] d3.transition, yo [20:13:49] yeh, i'm using that [20:13:56] that's what I'm saying [20:13:58] hm. [20:14:01] well [20:14:12] i am still trying to figure out where i left my headphones [20:14:19] and then you can walk me through the code [20:14:22] np [20:37:44] drdee: please review https://gerrit.wikimedia.org/r/26701 [20:39:38] done [20:41:16] ok 6/16 tests are passing [20:41:20] 10 more to go [20:42:50] probably they are all off due to the internal traffic filtering [20:48:48] http://testanything.org/wiki/index.php/Tap-functions [20:48:51] http://svn.solucorp.qc.ca/repos/solucorp/JTap/trunk/tap-functions [20:49:03] if we have continous integration for udp-filters [20:49:14] we can use the bash tap functions to run the tests [20:49:26] probably we get the output in TAP and we can feed that to some hudson or something like that [20:49:33] not sure if we do CI [20:49:39] do we ? [20:51:25] not worth it ATM [20:51:29] alright [21:53:39] drdee: cat example.log | ./udp-filter -d "([wiki])" -r | wc -l [21:53:43] drdee: is that a regex ? [21:53:53] drdee: if it's a regex, then the [wiki] is a character class right ? [21:54:02] /agree [21:54:12] drdee: I've fiddled with PCRE, I don't know much about regex.h [21:54:24] i think you are thinking too complicated [21:54:28] that used to work [21:54:32] and we didn't touch it [21:54:48] what is the problem? [21:54:48] i agree with average_drifter [21:54:53] that doesn't make sense as a regex [21:55:00] you'd need to escape the braces [21:55:08] "(\[wiki\])" [21:55:27] yeah I mean what dschoon is saying [21:55:35] provided you actually wanted to match the literal string "[wiki]" [21:55:35] drdee: so that's my question, isn't that supposed to be escaped ? [21:55:54] there was something odd with that but it's a while back [21:56:14] it has to be. [21:56:26] average_drifter: you can just change the regex in run.sh [21:56:26] otherwise it's trying to match one of any of the letters "w", "i", or "k" [21:56:33] and the double-i doesn't make ANY sense [21:56:37] OR it's not a regex. [21:57:03] drdee: if I leave it like it is, I get 2 matches. if I escape it , then I get 0 matches [21:57:16] and the 2 matches make sense or not? [21:58:41] drdee: if I do cat example.log | ./udp-filter -d "([wabcdefghijklmzxcqweqpo901471njkr2905jkhqwe])" -r [21:58:45] drdee: I get the same 2 matches [21:58:50] drdee: that's just because w is in them [21:58:51] okay so it's broken [21:58:57] no, the test is broken. [21:59:13] dschoon: yes because I don't know the syntax regex.h accepts [21:59:14] yes the test is broken [21:59:22] ah [21:59:43] i think regex.h has PCRE syntax compatibility IIRC [21:59:50] if I drop the square brackets it makes sense to me [22:00:05] dschoon , drdee what do you think if I drop the square brackets ? [22:00:23] then it makes sense because there are two en.wikipedia.org entries and those can be matched by /(wiki)/ [22:00:29] the idea is to match an arbitrary string in the domain [22:00:36] so just adjust the ragex accordingly [22:00:40] alright [22:06:36] sometimes using PCRE definitely makes me ragex. [22:06:39] RAGE-EX [22:06:46] THE PATTERN MATCHER SMASHER [22:20:16] I have joe rogan podcast all the time on the background [22:20:20] that guy is funny as hell [22:35:49] question about http statuses [22:35:52] is -s 50 [22:35:58] supposed to match 50 and 504 as well ? [22:38:24] ok 11/11 [22:38:25] all green [22:38:32] echo bio tests :) [22:40:47] drdee: https://gerrit.wikimedia.org/r/26722 [22:40:49] drdee: please review [23:00:42] done [23:29:05] drdee: I just found the revisions in which the VERSION changed so I can know what to make tags out of [23:29:12] drdee: I just had to run this perl oneliner [23:29:14] drdee: perl -e 'sub diff_consec {return `git diff HEAD~$_[0] HEAD~$_[1] src/udp-filter.c 2>&1 | perl -ne "/^[+-][^+-]|fatal/&&print"` }; sub rel2sha1{return `git rev-parse HEAD~$_[0]`}; $i=0; while(1){$i++; $diff=diff_consec($i,$i+1); last if $diff =~ /fatal/; print rel2sha1($i) if $diff =~ /VERSION/ }; ' [23:29:30] wrote it in like 7m [23:42:16] is there any documentation for the planned system architecture?