[00:01:28] drdee: please review [00:03:17] merged [00:03:59] drdee: yay ! :) [00:04:09] looks all good! [00:04:16] so now we can test tomorrow [00:04:28] ttyl [12:51:31] morning milimetric [12:51:39] morning average_drifter [12:51:51] morning drdee [12:51:58] morning drdee [12:52:10] morning guys [12:52:12] did you guys watch the debates or is that not internationally cool? [12:52:21] i certainly did [12:52:40] what did you think? [12:52:42] i saw this cool online version where third party candidates answered the same questions: www.democracynow.org [12:53:33] well, my bias is that I'm the chair of the Justice party here in Pennsylvania and I've been supporting Rocky Anderson. I thought Romney did a better job of looking "presidential" whatever people mean when they say that [12:54:43] i wished obama was more aggressive and less on the defense, IMHO [12:55:08] but none of them covered the drone attacks that killed less than 10% militants and innocent civilians as the rest, the free pass to drill the arctic that was just given to Shell, the National Defense Authorization Act, etc. [12:55:32] well the drone stuff is here to stay,regardless of who wins [12:55:41] all those things are, yep :( [12:56:01] i assume that next debate will focus on foreign policy and social issues [12:56:13] that's why they both know not to talk about it. Which is silly. [12:56:14] The Australian Liberal oriented paper declared Romoney the winner. [12:56:40] well the biggest problem with NDAA is domestic - indefinite holding of American Citizens! [12:57:01] welcome to the post 9/11 world :[ [12:57:21] yea, welcome to V for Vendetta :) [12:57:26] :D [12:57:46] so about the report card report.... [12:57:50] yep [12:58:27] can you add a caption to the target charts saying that we are not updating them as we are refining our prediction model or something like that [12:59:08] hm, let's see where it could go [12:59:38] oh ok, like "Aug 2012: Editors in Commons are now included in the overall total, and users who share a name across multiple projects are counted as the same user" in the one before it? [12:59:44] yes [13:00:04] btw, dsc told me why July 2012 was hard coded - it's not, we just haven't pushed new code in a long time [13:00:17] unfortunately we can't push for a while because we're mid transition/big change [13:00:29] k, I'm looking around to see where that caption's set [13:00:41] don't we have a master branch with prod code and a develop branch :D :D ? [13:00:46] average_drifter: let's build some debian packages! [13:00:57] drdee: yes [13:00:59] drdee: but first [13:01:46] drdee: https://gerrit.wikimedia.org/r/#/c/26567/ [13:01:49] drdee: please review [13:02:36] done [13:03:27] drdee: ok, let me just look at the few items left like moving some stuff to utils.c [13:03:32] drdee: and also separating matching [13:03:39] k [13:05:36] drdee, yes, but at this point they're so separate that I wouldn't know what commits to cherry pick onto prod [13:05:49] k, found the stuff looks like it's just set in json so I'm changing it [13:06:54] "We are refining our prediction model therefore we are not currently updating this data"? [13:06:56] k [13:11:13] 1 sec [13:19:10] drdee, I got all the files up, let me know the final copy :) [13:19:19] it's okay [13:28:35] drdee done http://reportcard.wmflabs.org/ [13:29:52] looks good, can you also add it to the mobile target chart? [13:30:34] same message? [13:30:56] yes [13:34:31] k drdee, done. This is funny, it's in like 20 places [13:34:35] well, 8 [13:35:00] these kind of metadata things need to have a single canonical place [13:38:12] I think these are editable from the graphs themselves but I'm not sure how that'd work with commits [13:47:56] morning ottomata [13:48:39] morning [13:49:05] it seems that the hadoop job you started is having some issues [13:50:34] wah wahhh [13:53:27] too many mappers are launched [13:53:33] about 7000 in total [13:53:42] while a node should launch between 10-100 mappers [13:53:52] one way of reducing the number of mappers [13:54:21] is to increase the HDFS block size, right now it is 64mb but i think we should increase it to 256mb [13:54:30] not sure if you can do that on a live HDFS partition [13:54:36] or that you have to create a new one [13:54:45] the job is running for 15 hours now [13:54:51] and seems to be stuck [13:54:56] so i am digging through the logs [13:55:39] and could you give me read access to /user/otto ? [13:55:53] it is per file [13:55:56] the block size [13:56:16] and can you enable the hadoop metrics server on port 8042 [13:56:16] oh yeah sure, you also should have sudo on the analytics machines [13:56:20] so you can go sudo -u otto ... [13:56:24] ughh [13:56:30] block size is not per file [13:56:33] no? [13:56:51] it's an HDFS property [13:57:21] http://hadoop.apache.org/docs/r0.18.3/hdfs_design.html#Data+Replication [14:00:32] 7000 maps? [14:00:36] i see total maps: 4720 [14:00:37] so it's both a global and local property [14:00:43] and i've run jobs with that many maps before [14:00:53] aye, the global is the default [14:00:55] but generally it is set as a global proeprty [14:01:03] right but it's too many mappers [14:01:09] it's bad for performance [14:01:12] there are 0 maps running right now [14:01:19] you are probably right [14:01:20] but [14:01:25] as each mapper does not get enough workloa [14:01:26] d [14:01:34] i think the problem is that I did not manually set the parallelism as I did the other jobs [14:01:38] so there are way too many reducers [14:01:39] each mapper runs for 5 seconds on average [14:01:49] especially for this, since there are only 3 different keys [14:02:00] but the number of reducers is independent of the number of mappers [14:02:03] yes [14:02:11] certainly, but the mappers are not stuck, its the reducers [14:02:14] http://analytics1001.wikimedia.org:8088/proxy/application_1349195921521_0030/mapreduce/job/job_1349195921521_0030 [14:02:23] right, i think there are multiple issues [14:02:28] one of them is too many mappers [14:02:29] there are 54 currently running, probably waiting for more data from mappers or something [14:02:33] but no mappers running [14:02:41] gonna kill the job [14:02:47] it should take no mor ethan 50 minutes total [14:04:39] can you wait with killing it [14:04:45] so i can grab the logs? [14:05:37] oh did already, [14:05:43] logs disappear if I kill it? [14:06:11] in the web console i think it does [14:06:13] not locally [14:06:20] could be wrong though [14:07:00] you started a new job? [14:07:01] drdee: I've used extern for stuff used by both match.c and udp-filter.c [14:07:17] k [14:07:24] drdee, re hadoop metrics server [14:07:31] yes [14:07:51] what is that? is that a JSON interface thing? [14:08:00] REST interface? [14:08:06] it's a link in the hadoop cluster [14:08:22] and if i click it it doesn't work, so i guess it's just another webconsole [14:08:56] so the job consists of 4719 maps (instead of 7000, i was exaggerating) [14:09:25] and we have 7 nodes, so that's about 675 mappers per node [14:11:08] it looks much better! [14:12:00] yeah, i'm not sure if this is right, but I did 3 reduces because there are 3 keys [14:12:18] and, i think I read somewhere that each node can only run 2 reducers at a time anyway [14:12:35] so you might as well not set more than 2* number of nodes for reducers [14:13:46] drdee: please review https://gerrit.wikimedia.org/r/26622 [14:13:51] according to cloudera "The number of reducers is best set to be the number of reduce slots in the cluster " [14:14:05] ahahh what are 'reduce slots'? [14:18:26] drdee, here's what I got: [14:18:27] 200 655438604 [14:18:27] 404 137 [14:18:39] mmmmmm [14:18:49] right [14:18:50] that is 2 days of 404 logs [14:18:54] sorry no [14:18:56] of ummmmmm [14:19:01] banner logs? [14:19:03] banner logs [14:19:15] yeah, from the 1:1 logs [14:19:20] 20120930-20121001 [14:19:21] yeah, so the issue is with BannerController [14:19:37] lemme gist the script and results for reference [14:21:18] maybe I soudln't filter [14:21:21] and just sum all of the statuses? [14:22:08] https://gist.github.com/3833806 [14:23:01] average_drifter: review is complete: https://gerrit.wikimedia.org/r/#/c/26622/ [14:23:06] drdee: reading [14:23:52] ottomata; let's have Jeff_Green a look at this [14:25:43] morning Jeff_Green [14:25:57] morning! [14:26:01] ottomata, just finished running a script counting 404's on the 1: 1 banner logs [14:26:08] https://gist.github.com/3833806 [14:26:13] basically there are no 404 errors [14:26:53] how can that be? [14:27:32] wait a minute . . . [14:27:33] well, bannerImpressions doesn't have the BannerController stuff in it, right? [14:27:50] ok back up a sec [14:28:18] we have 404's collected for a ~10(?) hour period yesterday [14:28:33] and we ran tests in italy yesterday which overlapped somewhat with that period [14:28:48] we have 1:1 banner logs for the test [14:29:13] so the best we can do afaik is to count banners from one log, and 404's from the other, for the period of overlap [14:29:27] well, yesterday [14:29:52] you guys had me count the URIs that matched 'BannerController' from the 404.log that you created [14:30:04] ya [14:30:11] then, diederik had me load in the 1:1 bannerImpressions logs for 09-30 and 10-01 [14:30:19] which do not match for 'BannerController', afaik [14:30:34] and then check it for 404s [14:30:42] but, since bannerImpressions.log [14:30:42] k [14:30:51] does not filter for 'BannerController' in the uri [14:30:57] they are different data sets, no? [14:31:06] so now we need to load the banner impressions for yesterday so we can do the overlap [14:31:12] ja [14:31:16] ? [14:31:41] step 1: load yesterday's bannerImpressions-1:1 log [14:31:44] you ran the banner impression for 09-30 and 10-01 [14:31:48] we can do that, with the caveat that I turned off the 1:1 bannerImpressions log for an hour or two [14:32:04] step 2: figure out the period of overlap in the 404 vs bannerImpressions logs [14:32:06] right, has something changed since those dates? [14:32:12] okay, let's start fresh [14:32:17] how about a new filter [14:32:18] yeah, i'm pretty confused [14:32:23] what is BannerController? [14:32:30] that's good, but we have the data! [14:32:41] that captures both BannerController and B12 campaign [14:32:49] run that filter today for a couple of hours [14:32:51] load data in kraken [14:32:54] run pig script [14:32:56] BannerController was a chunk of javascript plus a geo-specific list of possible banners [14:33:04] BannerController --> 404.log [14:33:19] B12 Campaign --> 1:1 banner impressions.log [14:33:34] yeah. the deprecated BannerController in August or something [14:33:44] drdee exactly [14:34:05] ottomata: so BannerController was causing the 404's [14:34:35] yeah. specifically the problem is that we have cached wikipedia pages that still link to the deprecated BannerController URL [14:34:38] but we have to figure out the extent to which BannerController decreases the number of banner pageviews [14:34:52] exactly [14:35:12] so the problem is now that the data is in different files with little time overlap [14:35:20] so that's why i said let's create a new filter [14:35:25] and capture both aspects [14:35:27] even an hour or two of overlap should give us a pretty good idea [14:35:27] in the same file [14:35:33] but I agree going forward [14:35:41] if we get that filter running now [14:35:55] then in 2 or 3 hours from now we can load the data in kraken [14:36:05] and we don't have to do complicated dataset merging stuff [14:36:09] for example, try the window between 10AM-12PM EDT yesterday [14:36:50] I don't understand why it's complicated? It's just two separate hit counts for a 2 hour period [14:37:16] well because we also turned off bannerimpressions.log for 2 hours yesterday [14:37:24] right when the 404 filter was running [14:37:36] but that was toward the end of the day (EDT) [14:38:06] okay: let's just check in both files if we have observations for the time period that Jeff mentions [14:38:14] if that's the case then work with data [14:38:20] if not then have a new filte [14:38:21] r [14:38:25] sounds good? [14:38:25] I'm not sure when the test began/ended but we should have clean log overlap from when I turned on 404 logging (~5AM EDT) to when you first turned off 1:1 banner logging [14:38:39] if we just do a new filter, then I dont' have to write the pig logic to filter on timestamp [14:38:40] :) [14:38:42] just sayin [14:38:52] yeah, that's also a good point [14:39:01] ottomata: hahahah. I'm gonna mock you about that now. [14:39:05] and we can be sure that the data is from the same time [14:39:06] hahaha [14:39:08] aw man! [14:39:18] i should learn how to do it, but it is a bit tedious, heh [14:39:39] you better haxor your haxoring skills b/c I can assure you fundraising going to make you work harder than this! :-P [14:39:40] even though Jeff is technically right, i think the easy way is running a new filter for 2 hours [14:39:52] naw man! [14:39:56] FR people gotta learn how to do this! [14:40:05] +1 [14:40:06] i'm not the pig scripter [14:40:12] drdee: I'm not sure if we're running a test now [14:40:19] i'm just using this as an excuse to learn what is needed :) [14:40:29] we are running the 1:1 banner impression stuff [14:40:32] right now [14:40:37] * Jeff_Green seeks carrot and stick [14:40:59] so let's update the filter stuff on oxygen, [14:41:12] people have promised me a lot of beers to get stuff done. i wonder if I can transfer the IOUs. [14:41:12] maybe I'll learn what I can, and when we get Kraken to a state of wanting to let more people into it, I will run a little workshop with FR on how to do this stuff [14:41:12] just add BannerController to the path of the 1:1 banner impression filter [14:41:26] drdee: please review [14:41:32] ok [14:41:36] zat ok with you Jeff? [14:41:53] a german accent ? [14:42:02] drdee: I'm not sure we're even running a banner test at the moment, and that gets us our answers in a few hours [14:42:19] whereas I could grep the files on disk in 10 minutes at most even without kraken [14:42:24] ok okok [14:42:45] then, can you figure out the times that I shoudl filter for? [14:42:47] from the logs? [14:42:50] ooh carrot: I'm gonna use O2 to grep and induce packet loss :-P [14:42:57] that is a stick! [14:43:03] oh you're right [14:43:12] and i'll figure out how to write pig timestamp filter [14:43:30] 8AM EDT to 12PM EDT [14:43:50] on 10-03? [14:43:56] yup [14:43:59] ok [14:44:01] average_drifter: merged [14:44:31] drdee: :) [14:44:38] drdee: now we can jump on the debianization [14:45:10] let's use the debianization script from webstatscollector [14:45:14] as a starting point [14:45:36] there are some warnings left when compiling: [14:45:38] src/match.c: In function ‘match_domain’: [14:45:39] src/match.c:166: warning: implicit declaration of function ‘extract_domain’ [14:45:40] src/match.c:166: warning: assignment makes pointer from integer without a cast [14:45:41] src/match.c: In function ‘match_http_status’: [14:45:42] src/match.c:217: warning: implicit declaration of function ‘extract_status’ [14:45:43] src/match.c:217: warning: assignment makes pointer from integer without a cast [14:46:03] I will have a look at these now [14:49:04] question about udp-filter . . . preferred way to match two possible --path substrings, along the lines of /(BannerLoade&banner=B12|SomethingElse)/ ? [14:52:30] just: [14:52:41] BannerLoader&banner=B12,SomethingElse [14:54:50] ok [14:55:09] what if, in theory, I needed a comma (I don't)? escape it? [14:55:48] there is also a -r option [14:55:48] no you can't match a comma right now [14:55:52] that turns your arg into a regexp [14:56:06] that's also true [14:56:30] but AFAIK, comma's in urls are url encoded anyways [14:56:50] cool. non-issue--I was just curious [14:57:17] drdee: ok sorry for the small commit, this fixes those warnings https://gerrit.wikimedia.org/r/26629 [14:58:00] merged [14:58:35] average_drifter: move to labs? [14:58:36] drdee: shall we use the same VERSION generating mechanism we have in debianize.sh ? [14:58:39] drdee: moving to labs [14:58:41] yes [14:58:49] wait [14:58:55] udp-filter is slight more sane [14:59:11] there is one canonical place for the VERSION and that's in udp-filter.c [14:59:26] but that should be updated automatically [14:59:47] drdee: yes, I will modify that so it can be picked up and modified by the Perl one-liners in debianize.sh [15:00:15] k [15:00:55] drdee: user@garage:~/wikistats/udp-filters$ git describe | awk -F'-g[0-9a-fA-F]+' '{print $1}' | sed -e 's/\-/./g' [15:00:59] fatal: No names found, cannot describe anything. [15:01:11] drdee: dunno why git describe says "No names found, cannot describe anything" [15:01:27] uhmmmm because a tag is missing [15:01:45] drdee: oh alright, should I make a tag , what should the tag name be ? [15:02:00] so we should add retroactively tags to udp-filter based on version numbers :) [15:02:18] use the same tag convention as in webstatscollector [15:02:34] drdee: ok, so I should go back to the last commit just before I started writing on it [15:02:35] i think it's just V0.1 or maybe even 0.1 [15:02:38] drdee: and do a tag there right ? [15:03:03] slightly different [15:03:20] find all the commits where i changed the version numbr [15:03:29] ok [15:03:32] and tag those commits with the version number [15:03:40] so we can easily navigate through history [15:03:44] alright [15:04:22] i have a quenching thirst for a coffee [15:04:51] and one more wanting : [15:04:56] i mean warning :D [15:04:57] src/udp-filter.c: In function ‘parse’: [15:04:57] src/udp-filter.c:761: warning: implicit declaration of function ‘determine_num_obs’ [15:18:44] brb [15:25:41] milimetric: do you have ideas for using git flow and tags? [15:27:54] tags? [15:28:15] drdee git flow tags for you wherever it makes sense (like finishing releases) [15:28:24] oh cool [15:28:40] perfect, i was exactly wondering about that [15:29:10] yeah, without gerrit it eliminates basically all git-fu and limits you to start / finish / publish, commit, and merge resolution [15:29:24] you can even do stuff like git flow feature rebase blah [15:39:06] so what's your verdict? do gerrit and fit glow coexist or they do hate each other [15:39:44] it depends on what the gerrit policy is [15:40:08] so git flow tries to keep stuff out of master until a release is ready - and that's sensible [15:40:55] that clashes with the current goal of reviewing every commit as we go (and therefore merging it into master from develop / feature / etc.) [15:42:16] but my verdict is we can still do it - our local repositories will be sensible and I see no problem with git flow / git review every commit [15:42:35] if we want to git review only at the end off the release branch, then we have to git rebase and squash it all into one commit. [15:43:30] the squashing is bad imho, loss of history and too much to review [15:48:22] definitely [15:49:04] i think git-review as you go is a better option. Then when you start your release branch everything's reviewed and the review could check the "production-readiness" of it all [15:49:54] the only other option is to get gerrit to commit to the develop branch - that would be a more natural fit to [15:51:28] you mean that the develop branch is reviewed and that you then can pull once in awhile directly from develop into master? [15:53:04] ahhh doh [15:53:11] the 10-03 logs are in the 10-04 archive file [15:58:56] drdee, no i mean we change gerrit to merge into develop if the changes are accepted [15:59:06] and then continue to use git flow the way it's meant to be used [15:59:40] i think that's possible, you just have to update the .gitreview file and indicate a different remote branch [15:59:48] yep [15:59:55] i read about people doing that [16:00:20] it seems a better fit if we want to actually review every commit [16:00:35] because every commit isn't necessarily supposed to go into master [16:11:53] brb lunch [16:12:00] hey is the metrics meeting today? [16:17:41] ottomata: yes [16:18:07] mk [16:23:48] drdee, check this out! [16:23:48] http://docopt.org/ [16:24:14] generates command line interface from help message string [16:24:16] you write the help string [16:24:22] and the parser is generated from that [16:24:24] so cooooool! [16:24:45] a great rudimentary tool for debugging C code [16:25:05] fprintf(stderr," at line %d\n",__LINE__); [16:25:11] :) [16:25:31] otto: that rules [16:26:14] that is very coooool indeed [16:26:20] the C port is coming :( [16:34:03] drdee: let's add BannerController to the banner impression filter [16:34:10] ok [16:35:28] i can just do it I guess [16:40:25] yup [16:50:01] goood morning [16:50:08] i just talked with herr moeller. all is well for the meeting. [16:51:05] he particularly complimented milimetric on his fine work updating the reportcard [16:51:12] Jeff_Green + drdee: [16:51:13] https://gist.github.com/3834876 [16:51:36] no way [16:51:42] that can't be!? [16:51:45] waay [16:51:46] why? [16:51:50] this is bannerImpressions.log [16:51:51] right? [16:51:55] not BannerController [16:52:00] oh! ok. [16:52:23] so I should do the same for 404.log and filter for BannerController? [16:52:27] or no [16:52:28] that's pretty slick, ottomata [16:52:28] hmmm [16:52:41] ottomata: ya [16:52:48] well, hmm [16:52:56] but we know that 404.log is all 404s, right? [16:53:15] you just want to know the # of 404s for BannerController in that time period? [16:53:15] it's all 404s, not just BannerController 404s [16:53:19] thanks dschoon. I'm glad my vim -o `grep -ril "July 2012" *` fu was appreciated [16:53:20] yes, but for all 404's [16:53:20] righto [16:53:48] reminder: metrics meeting is at 10:30 [16:53:51] for my next trick, I hope to actually update the website as opposed to the generated code, lol [16:53:55]