[00:02:24] awight: you have a moment to look at something with me? [00:02:30] mwalker: si [00:02:43] ok; I just shared a document with you called "Banner Count Analysis" [00:02:44] the growing stack of PCR in your queue ? :p [00:02:56] hehe [00:02:57] not yet [00:03:09] oh ho [00:03:18] you should also have a document shared with you by megan called "July 2013 Results" [00:03:29] someone is fired ;) [00:03:47] in megan's document in the '2013 FY Daily" tab -- note the HUGE jump in impressions on the 26th [00:04:15] (which is a global result) [00:04:38] "july banner analysis"? [00:04:59] ah, i see in 2013 results/ [00:05:07] no https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0AphVIE-Yv6AndDRlMlc4TTlOaVhfMWxFX2ZpYlBmZHc#gid=10 [00:05:18] that's not how I'd interpret it. [00:05:41] Megan shows 3 or 4 discontinuities around that date [00:06:38] maintenance day...the rise on the 22nd is even harder to begin to explain [00:06:49] it [00:06:50] 's [00:06:53] a persistant rise [00:07:14] and I dont see it on the raw log/site data which is my document [00:07:23] so the only thing I've got is that peter's script is in error [00:07:29] goddam timestamp... [00:07:34] I know [00:07:51] where is the steady rise? i see no such thing [00:08:01] megan's #s? [00:08:04] ya [00:08:10] I shoved a plot in on that tab [00:08:11] where [00:08:25] "July daily total"? [00:08:36] no no; 2013 FY Daily [00:08:37] oh sorry, backscroll.. [00:09:02] you are saying, 22-26 is a steady climb? [00:09:35] no; that's just log bullshit -- it's the jump from 70M imps/day to 210 imps/day that confuses me [00:09:52] yes [00:09:54] no kidding [00:11:21] and since we dont see that same type of magnitude of jump in the raw data -- I think peter's script goofed [00:11:32] (if anything we should have seen our counts go down) [00:12:26] ugh. 1374832800 [00:13:16] have you audited any time ranges? [00:13:45] I would just do a pre- and a post- hour by hand [00:15:14] nice work, btw. pretty soon -creative will need VR gloves to go to work ;) [00:16:02] what's at that timestamp; I'm not seeing anything unusual [00:16:16] mwalker: nothing, I agree [00:16:32] can you explain the upper-right graph on your Global tab? [00:16:51] wp log #s versus BI raw logs? [00:16:59] ya [00:17:10] why does the ratio go crazy if neither line diverges? [00:17:31] because it's a log graph [00:17:44] we're seeing small absolute changes; but large relative changes [00:18:10] maybe, normalizing the red and blue lines would help. [00:18:18] also a good question -- why are WP sitewide #'s so much greater than ours [00:18:18] and plot them linearly [00:18:28] that's a darn good one. [00:18:35] u win 450M$ ! [00:18:41] I wish :) [00:18:51] or... cos the upstream analytics are effed [00:19:34] well; partially it's because the upstream is all page hits (including special pages) [00:19:56] so I'm sort of assuming we're just seeing the site overhead in the plots [00:20:03] but also spam/bot activity on the smaller wikis [00:20:09] because the enwiki plot is not nearly so crappy [00:20:14] I don't think that 99.5% of traffic is bot and Special [00:21:47] Am I reading correctly, that the mobile ratio goes > 100% ? [00:21:52] That would be a bug... [00:23:38] interesting, in that graph the enwiki traffic changes waveform after 7-26, but with roughly the same mean [00:23:38] yep [00:24:00] ya -- so we know that on the 26th we moved fundraising log collection from the distrobution point [00:24:07] which meant less packet loss on that server [00:24:31] *distribution server to another server [00:24:59] oh god, so peaks were getting truncated. [00:25:07] yep yep [00:25:31] it's a fairly persistent problem; and I'm not entirely sure we solved it by moving ourselves to a different server [00:25:44] but I cant tell because gadolinium is not reporting those stats to ganglia [00:25:57] (gadolinium being the server we're now on) [00:25:59] Yeah, first step is to get good estimates of packet loss [00:28:20] oh, your day/day graph is the normalization i was looking for. [00:30:32] the ratio is about the same on either side of 7-26, but the behavior changes, they do not follow each other as closely. [00:31:11] right; which could totally be a fluke; or it could be something persistant [00:31:18] and my money is on something persistant [00:31:24] but I don't have this months site numbers [00:33:17] So far, it looks like nothing happened... that we were getting clipping cos of packet loss, and now the graphs have more features... [00:33:39] maybe there is a networking reason that RecordImpression counts were depressed more than other loglines [00:36:10] why does the ratio go down around 1374000000? [00:36:31] if that is banner/wmo, shouldn't it go up? [00:39:10] no; you have more wmo and the same banner [00:39:28] banner impressions in this spreadsheet are done independently from Megan's BIs. got it. [00:39:37] yes [00:39:44] my spreadsheet is raw log analysis [00:39:48] megan's numbers are from the db [00:40:01] So the only question remaining is, what happened with PG's script, as u said an hour ago [00:40:19] very fun charts, tho ;) [00:40:34] well; we know that it was having trouble with file names [00:40:35] Where is your log analysis code? [00:40:38] but somehow still generating numbers [00:40:45] so I suspect that [00:40:49] I think we figure out that problem [00:40:56] it shouldn't have caused data loss [00:41:03] for f in bannerImpressions-sampled100.tsv-201307*; do c=$(zcat $f | egrep "/200.*meta.m.wikimedia.org" | wc -l); echo $f $c >> ~/all.m.banner.count; echo $f; done [00:41:05] it's stuff like that [00:41:55] yes pls check in somewhere [00:42:37] https://github.com/pgehres/wmf_fundraiser_django/commits/master [00:42:45] July 24 was eventful... [00:43:51] btw, what is the "A:B" column? total banner impressions? [00:43:59] right; and 'Updating file format for impression data' was the change that 'fixed' mobile [00:44:07] ya; A:B is total for that day [00:44:38] untuitive :p [00:44:51] no kidding [00:46:14] Yeah we have to audit two days manually. let's try 7/20 and 7/28 [00:48:21] I get 296 M banner impressions on 7/20. [00:48:35] C472:C495 [00:49:21] ugh, using a random time of day for the endpoints [00:55:00] 296M BI for 7/28 [00:55:06] suspicious ;) [00:56:48] neither one nearly matches the pg script [00:56:58] full teardown... [01:32:02] awight: sorry; got distracted by gwicke; but the numbers are not actually supposed to match pg [01:32:15] pg does not count empty's [01:32:19] where as the raw does [05:06:35] mwalker|away: k, let's try to audit this... [17:04:03] (CR) Mwalker: [C: 2 V: 2] Utility to dump campaign logs [wikimedia/fundraising/tools] - https://gerrit.wikimedia.org/r/62094 (owner: Adamw) [17:13:30] (CR) Mwalker: [C: 2 V: 2] "(4 comments)" [wikimedia/fundraising/tools] - https://gerrit.wikimedia.org/r/37361 (owner: Adamw) [17:54:40] #1063: (AW) Tech Task #1063 API: getcampaignlogs cannot output XML: O:AW|TS:B|P:NtH|T:TT Description changed -- https://mingle.corp.wikimedia.org/projects/fundraiser_2012/cards/1063 [17:55:31] mwalker: shall I audit those bogus BI numbers? [17:55:59] what do you mean by audit? [17:56:34] though -- I think I've figured out what happened [17:56:38] K [17:57:17] peter's script is counting 302 requests -- so we're double counting the number of mobile impressions [17:57:24] the audit i was suggesting yesterday: do a manual breakdown of one day, and compare against wmf-fundraising script results [17:57:30] yes ok [17:57:36] redirect? how is it happening? [17:57:48] it happens at the varnish layer [17:57:54] buh [17:58:01] it says; oh hey, mobile device trying to talk to meta; let's redirect to meta.m.wm.o [17:58:32] doesn't that happen before the mobile device is connected to real meta? [17:58:44] + we are getting logs from squid? [17:59:10] we get logs from the front end squids which are what do the redirects [17:59:17] and then we get logs from the varnishes [17:59:27] whoa i did not realize. [17:59:35] ya [17:59:39] it's messy [17:59:47] I thought the mobile useragent grepping was exclusively in varnish [17:59:51] we also have to drop the nginx SSL logs [18:00:12] is there a doc describing this architecture? [18:00:15] :( [18:00:18] probably? [18:00:35] jump over into #wikimedia-mobile and ping jon robson [18:00:38] jdlrobson [18:01:00] https://wikitech.wikimedia.org/wiki/Squid_logging [18:01:48] hehe, varnish is not mentioned in the wikitech index [18:02:06] https://wikitech.wikimedia.org/wiki/Varnish [18:02:20] is it true that we are still only using varnish on bits? [18:04:28] Anyway, I'm available to munge through numbers if there are unexplained issues once you factor out the 2x mobile BI [18:05:23] ok [18:05:24] https://wikitech.wikimedia.org/wiki/MobileFrontend [18:06:55] I don't see how RecordImpression is getting called during the request to meta.wmo [18:07:12] That should simply be frontend squid matching UA and returning a 302 Location [18:08:44] Oh. I get it, the mobile site js is making a request to http://meta.wmo/wiki/RecordImpression [18:09:12] that's lame. Yeah, we should be able to fix that in js. [18:10:40] or... if .m. are separate servers, fix in the config files [18:12:19] if $wmgMobileFrontend { $wgCentralBannerRecorder = http://meta.m...RecordImpression [18:13:26] nvm. bad docs... [19:07:00] mwalker: whatchu think about a wikimedia/tools/lib for shared code? [19:07:19] I'm ready to assume we only use python... [19:07:38] as in a gitrepo? [19:07:47] nah just a subdir [19:07:58] submodule is an interesting idea, though [19:08:30] I'm confused; where is that path relative to; what is this question in context to? [19:08:56] wikimedia/fundraising/tools [19:09:07] we have a growing amount of shared code [19:09:13] how about ./lib ;) [19:10:37] why would you not just put it in the root? [19:11:06] it is the root of a repo [19:12:17] the only questions are, does the python include path become byzantine when you go up a dir? I think you have to make all project subdirs into a __init__.py module at that point. [19:12:35] and 2) u think we can just ignore PHP needs altogether? [19:13:34] I think you have to make all project subdirs into a __init__.py module <-- yes [19:13:50] 2) yes; php scripts should be maintenance scripts inside the php project [19:14:18] so should python scripts for that matter -- but tools does not have anything that strictly belongs to anything [19:15:01] yah i'm ok w/o ever having php shared libraries in tools/lib [19:15:12] thanks [19:15:23] but; once again; why would you need a lib folder at all? [19:16:09] just to differentiate what is a script; and what is a library? [19:16:29] ah, good point. [19:16:34] yes ;) [19:16:48] but now I will reconsider [19:17:08] yeah, I think that's a useful distinction [19:19:43] hmm, if only to save me from coming up with distinctive project names [19:19:59] live_analysis/fr, for example, would be refactored to tools/fundraising... [19:20:24] which would then confuse the shit out of everyone [19:20:45] whereas, tools/lib/fundraising would be more clear [19:20:47] sigh. [19:26:43] ya; you're going to actually have to call them something [19:26:46] that's not fundraising :p [19:26:58] cruel and unusual. [19:32:10] ... but now we have to deal with this camlcase holy war ;) [19:32:19] PEP-8? [19:32:33] ya [19:32:39] I've been meaning to enable it on this repo [19:32:52] and jslint [19:33:24] we have a bunch of modules named using camlcase [19:33:30] PEP-8 suggests snake [19:33:34] wtf ever [20:08:40] #1014: (AW) Description changed -- https://mingle.corp.wikimedia.org/projects/fundraiser_2012/cards/1014 [20:28:54] marktraceur: would you kindly take a look at https://gerrit.wikimedia.org/r/#/c/81378/ [20:28:56] :D [20:34:08] mwalker: Later, if that's OK [20:34:16] sure [20:34:18] Ping me around 15:00 [20:34:22] gotcha