[07:39:21] (PS11) Nuria: Adding coding guidelines to README.md file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 [07:47:37] (PS12) Nuria: Adding coding guidelines to README.md file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 [07:50:30] (CR) Nuria: [C: 2] Adding coding guidelines to README.md file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 (owner: Nuria) [07:56:58] nuria: :-((( [07:57:24] but wait, isn't that what you ask me to do on your last e-mail? (see thread) [07:57:30] To self-merge? [07:57:40] No. [07:58:21] And I do not see why you have to resolve merge conflicts. Just [07:58:21] >checkout (in the sense of “git checkout") the README.md from your [07:58:21] >latest patch set onto master, and your done." [07:58:43] That does not say anything to by-pass review. Does it? [08:00:01] no, certanly not, it says to merge on top of master, but feel free to revert if merging was not what you mean. [08:00:36] No it does not say to "merge on top of master". Or at least I cannot find the word "merge" in the quote. Where do you see it? [08:01:30] ok, let me unmerge the changes. [08:01:38] No. It's not worth it. [08:02:05] But I really dislike "misreading" emails and then self-merge, only because we cannot bring a commit through code-review. [08:03:15] (CR) QChris: "Self merging controversial things is bad :-(" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 (owner: Nuria) [08:04:00] Heck the Patch Set even comes with trailing spaces again :-D [08:05:11] (PS1) Nuria: Reverting last commit to master of README.md [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/127599 [08:05:36] I shall self-merge the revert and things will be back as they were. [08:05:41] NO. [08:05:47] s/NO/No/ [08:05:56] No need to overreact. [08:06:07] The self-merge was bad from my point of view. [08:06:20] I'd CR-1 the change you self-merged. [08:06:23] Ok, i do not understand, don't you want to CR the changes again before they can be merged? [08:06:30] But it's not worth the energy to roll back. [08:06:38] it's 1 click [08:06:47] no energy needed really [08:06:53] But the commit history would get more complicated. [08:07:13] And it's harder to understand why we committed, rolled-back, and shortly afterwards committed again. [08:07:40] I've made my point that I didn't like the self merge. But I am fine to leave things as is. [08:08:27] (CR) QChris: "Not necessary to revert from my point of view." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/127599 (owner: Nuria) [08:09:10] I actually prefer that we do, i had no intention to merge if that was not what was intended. [08:09:34] from your comment above. So let's get things back as they were and we can proceed from there. [08:10:14] Like force pushing previous HEAD? [08:10:32] It's not worth it. [08:14:22] I think is more important that we feel good about the code on master than the commit history. If you truly feel the code is not up to the standards tobe on master it should be reverted. [08:14:58] Let's please do that to be consistent. [08:31:14] (CR) QChris: [C: 2] "Taking a closer look, I agree to the revert." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/127599 (owner: Nuria) [08:33:00] * YuviPanda reads interesting backscroll [08:33:25] we like to keep things exciting arround here ...... [08:33:35] :D [08:34:29] Personally I like having these in the history too. Git to me has a good mix of malleability of history locally and non-malleability once you have pushed to master (force push has issues) [08:34:56] YuviPanda: Agreed. [08:35:05] I do not like force pushing. [08:35:06] :D [08:35:14] We should turn it of across all of our repos? [08:35:23] it already is by default, IIRC [08:35:24] s/?/!/ [08:35:28] on gerrit, at least [08:35:32] I think? [08:35:46] it is also why using --amend with gerrit makes me twitch [08:35:52] a little [08:35:53] at least [08:36:32] Not for the analytics team. Our repos allow force push to us :-( [08:36:40] to master? [08:36:47] nuria: to refs/* [08:36:54] https://gerrit.wikimedia.org/r/#/admin/projects/analytics,access [08:37:05] ah ok. [08:37:06] oh! [08:37:14] yeah, might make sense to turn it off [08:37:45] Probably I should bring that up for discussion on the mailing lists. It bothers me since ages. [08:38:02] you should, qchris. [08:38:07] are force pushes common? [08:38:18] Luckily enough not. [08:38:25] But we should not even have push permission. [08:38:30] much agree that that should not be enabled [08:38:57] indeed. if I were up for it I'd just disable it and make an announcement :P [08:39:14] * qchris is too timid ... [08:41:17] heh [08:41:30] is 'move eventlogging data into hadoop' on the radar *at all*? [08:41:38] yes it is [08:41:41] so we don't have to write humongous SQL statements with the smallish tools mysql offers? [08:41:45] nuria: aah! how far? [08:41:55] like, in broad terms. Months? Years? [08:41:56] that qchris might know better [08:42:05] Pheee. Good thath I need not answer that question. [08:42:07] What? [08:42:11] but "raw" EL data is already going into kafka [08:42:13] No I do not know :-) [08:42:24] from varnish (as a test) [08:42:36] If I were to answer YuviPanda's question, I'd say "Currently not so much on our radar" [08:43:15] Right. [08:43:25] * YuviPanda is building instrumentation into the wiki app that'll probably be a PITA to analyze with just mysql [08:43:41] might as well write a python script that does batch processing. [08:43:46] funneling likely right? [08:43:56] oh yeah, *very* accurate funneling [08:44:03] each action as part of a funnel has a UUID [08:44:13] and some funnels are 'referred to' from other funnels and referenced by UUID [08:45:09] would also lend itself to some very nice visualizations, I think [08:45:18] ya, like the ones GA has [08:45:39] yea. [08:45:46] I am naive ... but "very accurate funneling" make me nervously think about 1984 ... [08:45:57] but in this case ( and again qchris might know better) i do not see how hadoop will help you either, it is very "custom" data extraction [08:46:31] nuria: more with the fact that I can run custom scripts on hadoop and not fight with mysql's 'SQL' [08:47:12] but those scrips will closely resemble the python you are going to write to interact and interpret your data in SQL [08:47:12] * qchris needs breakfast ... catch up with you later :-) [08:47:29] nuria: true, but I bet they'll run faster on a hadoop cluster? [08:47:56] I also am not sure if I can hack mysql into being useful for this, but that just sounds like another major fragile thing [08:48:15] i do not know, i have never used hadoop at the scale that would require [08:48:52] ah, right [08:49:09] should be interesting either way. Plus I'll get to not write Java for a while :) [09:19:50] qchris, since you agree on reveryting can you please revert the README change ? [09:19:57] sorry, *reverting [09:20:28] The revert at https://gerrit.wikimedia.org/r/#/c/127599/ has already been merged. [09:20:37] Or are you referring to a different commit? [09:22:42] ah ok, very well, sorry i had not seen it [10:04:37] * ori will revert the revert. [10:05:20] just kidding; i'm actually sleeping. [10:41:59] zz_yuvipanda: mmm, funnels… [10:51:57] nuria: ping [10:53:44] hola prtksxna [10:53:50] o/ [10:54:19] nuria: Mind discussing - https://gerrit.wikimedia.org/r/#/c/116260/ - here? Might be quicker :) [10:54:54] Sure, you guys tried the "store event and poll" approach right? [10:55:11] it seems that you did so from the wiki i read [10:55:28] nuria: There were issues with that [10:55:38] nuria: localStorage not being available everywhere etc. [10:56:14] right, it will only work for IE8 and up if iam not mistaken [10:56:48] but that seems preferable than a slower UX (it is of course the call of the PM ultimately) [10:56:48] nuria: right, http://caniuse.com/#search=localstorage [10:57:00] nuria: I am open to trying other approaches [10:57:23] nuria: See, at some places people are making the UX even slower, this mitigates those issues atleast [10:59:03] I will defer for performance team on that as i do not know wikimedia codebase and cannot speak to chnages there [10:59:17] my main concern with this change is that is not "measurable" [10:59:38] Example: we just reported recently page timings for all pages [11:00:03] the "delay" introduced by this logging is not in any way present on those numbers [11:00:27] So i would be presenting data that says, the 50th percentile in the US is 400ms [11:00:34] for time to glass [11:00:58] but that would be not what the user saw if we introduce arbitrary client side delays. [11:01:21] I know, I am trying to think of an alternative [11:01:34] yes, not trvial [11:01:42] the solutions i see [11:02:00] 1) use local storage +polling thus reporting data for ie8 and up [11:02:00] What I was trying to say was - I saw some places where the timeout wasn't there - and isn't that even worse than this [11:02:10] So giving a solution like this is actually better? [11:02:37] right, that is worst for sure [11:02:58] Now, this code does not solve the issue, just makes it less pronounced, right? [11:03:09] You could say that, yes [11:03:24] Its not a *solution*, its just better than the problem… [11:04:53] and solution 2) will be logging server side completely based on referers of http request [11:05:03] 1 > 2 [11:05:05] on referer hedaers [11:05:10] *headers [11:05:41] also not optimal, i know, cause server side referers might be incomplete, not there ... [11:06:08] nuria: I think it might help if you document all your concerns here too - https://bugzilla.wikimedia.org/52287 - probably CC someone from performance? [11:06:15] there is no good solution, imo [11:06:40] hola! [11:06:42] ori: I "undrafted" the patch… [11:06:56] the beacon proposal from the w3c's web perf group proposes adding Navigator.sendBeacin for exactly this use-case [11:06:58] nuria: hi! [11:07:01] see https://dvcs.w3.org/hg/webperf/raw-file/tip/specs/Beacon/Overview.html [11:07:16] ya, i saw that [11:07:22] that is the solution [11:07:34] "There are other techniques used to ensure that data is submitted. One such technique is to delay the unload in order to submit data by creating an Image element and setting its src attribute within the unload handler. As most user agents will delay the unload to complete the pending image load, data can be submitted during the unload. [...] [11:07:41] Not only do these techniques represent poor coding patterns, some of them are unreliable and also result in the perception of poor page load performance for the next navigation." [11:08:24] not only poor page performance but also "poor page performance you cannot easily detect" [11:08:48] i can make comments to this effect on the bug [11:09:07] but agree with ori,there is no good solution [11:09:09] that'd be cool [11:09:16] nuria: thanks [11:09:27] hi prtksxna, btw [11:09:42] ori: o/ [11:13:05] thank you prtksxna for your fast response to comments. [11:13:24] nuria: yw :) [13:35:21] (CR) QChris: "The reverted change is I25c47b0512ea2e80521edbe8c38e70dfd8ceec8e" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/127599 (owner: Nuria) [13:35:39] (CR) QChris: "Reverted in I08bcc2c651a8b5d9bdd8b8be5fe0e122dfc364d6" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 (owner: Nuria) [14:31:24] qchris_meeting: [14:31:25] ah meeting [14:31:30] ping me when you wanna talk about zero logs [14:31:43] ottomata: I am mostly listening there anyways... [14:31:47] So... zero :-) [14:31:56] ok, so i just looked at mobile logs too [14:32:00] and those are not consistently different [14:32:16] the sizes are different, but sometimes kafkatee logs are bigger, sometimes udp2log are bigger [14:32:29] I've just looked at filesizes so far [14:32:46] I looked only at the number of lines. [14:32:55] ok, shoudl be the same either way [14:33:11] kafkatee and udp2log should be the same. Right. [14:33:50] hm, yeah, especially because zero is not sampled, right? [14:33:55] For the zero files ... kafkatee is constantly 10% smaller. [14:34:02] yes. Both are unsampled [14:34:04] i was about to say 'statistically the same' but, zero is unsampled [14:34:13] so they should on the whole have the exact same lines [14:34:19] Yes. Udp2log zero files are unsampled as far as I know. [14:34:25] No, not exact same lines. [14:34:38] Sequence numbers are not really the same. [14:34:38] not exactly in the same files [14:34:42] oh [14:34:44] right, i mean [14:34:47] Neither are the timestamps. [14:34:50] exact same requests. [14:34:57] Exact same requests. Yes. [14:35:08] the timestamps should be the same, no? those are the request timestamps from the varnish logs [14:35:12] (Modulo different start/end dates) [14:35:15] yeah [14:35:31] No. The timestamps of requests do not agree. [14:35:37] I was puzzled about that myself. [14:35:50] But I found counterexamples when diffing the streams. [14:37:10] as in [14:37:14] you found what you are sure is the same request [14:37:18] but they have different timestamps? [14:37:31] I thought so ... but it was early in the morning. [14:37:36] Let me double-check [14:37:41] k i will try to find some too [14:39:22] Got something. [14:39:28] On stat1002 in my home directory, run [14:39:33] diff ~/cp-kafka-2014-04-19.sorted-w-time ~/cp-udp2log-2014-04-19.sorted-w-time | grep 002025604 | head -n 2 [14:39:37] ottomata: ^ [14:40:21] The files are from zero tsvs with 20140419 in the filename. [14:40:38] (They have sequence number removed, so they diff better) [14:41:29] hm. [14:42:00] weird, perhaps I am wrong about how timestamp is constructed [14:42:08] i thought for sure it would be stored in the shared memory logs [14:42:23] I am clueless about varnish :-) [14:42:43] Here you go with sequence numbers: [14:42:47] zgrep '^cp3013.*2014-04-18T19:03.*002025604' /a/{squid/archive,log/webrequest}/zero/zero.tsv.log-20140419.gz [14:43:09] Note that the sequence number is decreasing, while the timestamp is increasing. [14:43:17] But it seems to be the same request to me. [14:44:42] seq decreasing timestamp increasing?!?! [14:44:56] Yes :-) [14:45:09] I said it did look strange to me as well, didn't I? [14:45:19] Probably some races somewhere. [14:45:49] Shared memory, delayed writing, ... no clue. [14:46:16] wait, those seqs are from the two different files, right? [14:46:25] the seqs are generated by the loggers, not varnish [14:46:36] Ah. Ok. [14:46:45] i think this is just really weird [14:46:53] because the seqs happen to be 1 off [14:46:54] for the same request [14:46:55] Can they be that close if they are different loggers? [14:46:56] in the different files [14:47:00] they could be, sure [14:47:08] like, if the loggers both started at the same time [14:47:13] Ok. [14:47:16] they all jus start counting at zero when they start [14:47:35] cp3013 is one of the new mobiles, right? [14:47:39] Yes. [14:47:51] yeah, so maybe puppet ran and just started htem up at about the same time [14:48:13] ok, so let's not worry about that, I am a little annoyed that the timestamps are not the same...have you ever seen one where the timestamp is more than a second off between the two? [14:49:02] I didn't check, as it was just by accident that I noticed that they are different at all. [14:49:10] you know, Snaps might know! :) [14:49:11] We didn't vet kafkatee data :-) [14:49:19] Yes. [14:49:19] Snaps: , yt? [14:49:27] hi theres [14:49:34] the varnishncsa docs say %t is Time when the request was received [14:49:38] hey [14:49:38] so [14:49:40] we are just curious [14:49:46] we are comparing udp2log and kafkatee log lines [14:49:52] we've found a few lines [14:49:56] that we are sure are the exact same request [14:50:02] but have different timestamps [14:50:05] request timestamps [14:50:11] only 1 second different [14:50:12] but still [14:50:26] i had thought that the request timestamp was being stored in the varnish shared memory logs [14:50:37] along with the rest of the request data [14:50:59] so all of the request content would be the same independent of what was reading the shared memory logs [14:52:00] which one is earlier? [14:52:04] if consistent [14:54:41] udp2log is earlier [14:54:46] but that is only one request we are looking at [14:54:55] vk will read the date from RxHeader tag first, and if that one isnt found, ReqEnd. [14:55:06] problem is it only matches RxHeader from BACKEND, but thats a CLIENT tag. [14:55:08] So its a bug. [14:55:38] And as ReqEnd tag is the last tag of a connection it will have a later timestamp, some times rolling over into the next second. [14:56:37] (I checked half a dozen corresponding requests with different timestamps, and udp2log was consistently earlier) [14:56:55] https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L924 <-- that should be VLS_S_CLIENT instead of VSL_S_BACKEND. [14:58:11] oh, ok, so a bug in vk? :) [14:58:15] yepyep! [15:00:03] huhm, but varnishncsa does the same thing. huhm huhm [15:03:14] hm [15:03:56] ah, varnishncsa overwrites previously set values, while vk doesnt. So varnishncsa will always use the timestamp from the ReqEnd tag, while vk will use the "Date: " header. [15:04:00] thats my take [15:05:11] ah hm [15:05:18] where does Date header come from? [15:05:19] client/ [15:05:19] ? [15:05:22] originating client? [15:07:58] probably not since that value is really of no interest (it might be bogus) [15:08:50] maybe the date: header is rewritten by the cache frontend [15:08:54] need to check the sauce [15:11:35] hm [15:11:58] ok, well, if you think varnishkafka is doing the right thing, and we can explain the differences, then I'm not too worried about it [15:12:06] the second difference isn't a problem really [15:13:31] even though the Date: header is probably supplied by varnish I would trust the ReqEnd tag more (partly because it doesn't require parsing a date) [15:14:30] but since there doesnt seem to be any larger time lapses between vk and vncsa, I dont think it matters. [15:17:45] I dont know if standard browsers sends the Date: header in requests [15:17:47] yeah, i mean, we haven't seen any, we just kinda noticed this one [15:17:49] ok [15:17:52] i think its fine then [15:18:21] so, we're looking at this, because a little less than a month ago, the zero logs that are generated by kafkatee contain fewer requests than the ones from udp2log...not sure why yet, need to investigate more [15:18:22] but it is strange [15:18:30] seems to be a specific date when that started too [15:20:21] huhm, ok. fewer by how much? [15:22:20] 10% [15:22:42] ouch [15:23:30] is this verified by another consumer than kafkatee? [15:50:57] no, all we have seen so far is differences in line counts in the archived files that qchris was going to use [15:51:01] i'm checking into it now [16:05:54] ok, partly good news Snaps, qchris_away [16:06:02] i just checked webrequest_mobile imports in hdfs [16:06:09] for a line that was in both udp2log and in kafkatee [16:06:11] it was there [16:06:19] i also checked for al ine that was only in udp2log, but not in kafkatee [16:06:22] and it was in hdfs [16:06:27] so, that means that the log was actually in kafka [16:17:54] qchris_away: i see that the kafkatee has only been running since march 26th [16:18:00] the kafkatee process [17:04:29] ottomata: If kafkatee has only been running since march 26th ... where are all the files in /a/log/webrequest/mobile/ (on stat1002) from before march 26th coming from? [17:05:16] oh, i mean [17:05:21] it was restarted then [17:05:26] (Glad to read that the messages made it into hdfs, and are just missing from the tsvs) [17:05:27] that's the lsat time it was started [17:05:31] Oh. [17:05:32] Ok. [17:06:32] yeah not sure [17:06:40] kafkatee version hasn't changed since feb 20 [17:07:02] Mhmm. Interesting. [17:19:24] lunchtime, back in a bit [18:08:37] hi Ironholds. [18:08:48] hey YuviPanda :) [18:08:57] Ironholds: I've implemented the session tracking schema. Some data should start flowing in when we push out the alpha today. [18:09:02] there's already some data from my testing. [18:09:09] cool! [18:09:24] Ironholds: I'll check to make sure that our EL infra can handle that, though [18:09:29] * YuviPanda pokes ori [18:09:58] ori: that's an Event for every link clicked, for mobile apps (which are a few % of traffic). Now that I write that out that sounds a bit scary in terms of network overhead and also our serverss. [18:11:38] 20/(20/100) [18:11:59] how much is that? 4%? [18:12:07] yeah. [18:12:18] that or my math is so wrong [18:12:22] either can be true [18:12:33] I suck at math [18:12:39] what do you think I am, a statistically-oriented programmer?! [18:12:49] 20% of 20% is 4%? [18:13:02] 10% of 20% is 2% so 20% of 20% is 4%? [18:13:04] I want to say 'yes' [18:13:09] or am I in a post dinner haze [18:13:10] but, you know, we could just get a calculator [18:13:20] or a REPL for a statistical language, like the one I have open [18:13:26] yes! [18:13:32] I typed into google 0.2 * 0.2 [18:13:35] and it gave me 0.04 [18:13:48] yes, 4 percent [18:13:50] * YuviPanda likes decimals [18:13:55] oh my god we're terrible. [18:14:03] people PAY US to DO STUFF? [18:14:17] yeah, but 20 / ( 20 / 100 ) is like 20 / (0.2) = 100 [18:15:11] Ironholds: either way, I'll build in a kill switch. I realized I could stagger scroll events by 10s or more, but then I'm unsure how much the servers can take [18:15:21] eh. Scroll events are a bonus. [18:15:24] Ironholds: I can also build in a ramp up mechanism, where we start with 10% of users and ramp up [18:18:21] YuviPanda: what's the volume, expressed as average events / sec? [18:18:41] Ironholds: ^ can you give a rough answer based on WikipediaMobile UAs? [18:19:56] Ah...no. [18:20:05] I mean, I can tell you an estimated volume for just clicks [18:20:11] that's about it ;). [18:20:11] Ironholds: yeah, that should be fine [18:20:22] sure, vun moment while I pull up zee logs. [18:20:40] Ironholds: we could even just take our article page view rate and multiply by 0.04 [18:20:43] oh, bugger, no I can't, I only stored percentages. [18:20:55] okay, I'm pretty sure I have a serialised file around here that has the raw data. Wait one. [18:21:04] can you estimate an order of magnitude? 1/sec, 10/sec, 100/sec, 1000/sec? [18:21:17] "less than the last one, more than the first one" [18:21:24] I'll have a precise estimate in about 10 minutes, if that works? [18:21:26] It's a really big file. [18:21:43] sure yeah [18:21:50] Snaps: you still around? [18:24:40] qchris: for a second I thought maybe we were losing some logs during logrotate [18:24:50] but the losses look fairly evenly spread throughout the hour I am looking at [18:25:03] ottomata: 10% due to logrotate :-D [18:25:11] yeah, too much, i know [18:25:21] but i saw some messages about X number of messages in queue during a logrotate [18:25:22] in syslog [18:25:44] It didn't look like we're only loosing around 6:30. [18:25:45] but, i don't think it should matter, kafkatee should be pretty resilient, and yeah, that's not the pattern of the losses anyway [18:25:50] I had some hits for 19:30. [18:25:51] yeah [18:26:13] But logrotate would be a nice explanation. [18:26:17] Yes. [18:26:22] yeah, i'm looking at 2014-04-19 07:00 - 08:00 for cp1046 [18:26:31] and I see losses throughout the hour [18:26:46] (files in /a/otto/kafkatee-compare, btw) [18:27:28] i'm thinking about restarting kafktee just to see if this goes away, even though I doubt it will...i'm not really sure where else to check [18:27:30] I had a look at cp301[34] (files in /home/qchris/cp*) [18:27:52] Restarting is certainly worth a try. [18:28:03] It came with a restart ... maybe it will go with a restart as well. [18:28:21] haha, i'll sprinkle some magic dust on it before I restart it [18:28:26] restarted. [18:28:33] * qchris keeps fingers crossed. [18:29:08] What strikes me, is that it's really so close to 10%. [18:29:36] Did we turn on some (maybe unrelated) sampling before the previous restart? [18:30:31] not that I could tell, no [18:30:34] but possible [18:30:52] the files in /etc/kafkatee.d have not changed since Feb 27th at the latest [18:31:48] guess I should get some kafkatee monitoring up soon eh? [18:31:54] :-) [18:32:04] there is a json stats file similar to what varnishkafka has [18:32:24] Where does this file live? [18:33:35] analytics1003:/var/cache/kafkatee/kafkatee.stats.json [18:34:09] Meh. Does not let me in. [18:34:28] Well. Not so important. If the file does not explain the problem to you, [18:34:37] it certainly does not explain it to me either. [18:36:46] qchris: i copied it to stat1002 in /a/otto/kafkatee-compare/kafkatee.stats.json [18:36:53] Thanks. [18:39:06]