[07:39:21] (PS11) Nuria: Adding coding guidelines to README.md file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 [07:47:37] (PS12) Nuria: Adding coding guidelines to README.md file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 [07:50:30] (CR) Nuria: [C: 2] Adding coding guidelines to README.md file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 (owner: Nuria) [07:56:58] nuria: :-((( [07:57:24] but wait, isn't that what you ask me to do on your last e-mail? (see thread) [07:57:30] To self-merge? [07:57:40] No. [07:58:21] And I do not see why you have to resolve merge conflicts. Just [07:58:21] >checkout (in the sense of “git checkout") the README.md from your [07:58:21] >latest patch set onto master, and your done." [07:58:43] That does not say anything to by-pass review. Does it? [08:00:01] no, certanly not, it says to merge on top of master, but feel free to revert if merging was not what you mean. [08:00:36] No it does not say to "merge on top of master". Or at least I cannot find the word "merge" in the quote. Where do you see it? [08:01:30] ok, let me unmerge the changes. [08:01:38] No. It's not worth it. [08:02:05] But I really dislike "misreading" emails and then self-merge, only because we cannot bring a commit through code-review. [08:03:15] (CR) QChris: "Self merging controversial things is bad :-(" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 (owner: Nuria) [08:04:00] Heck the Patch Set even comes with trailing spaces again :-D [08:05:11] (PS1) Nuria: Reverting last commit to master of README.md [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/127599 [08:05:36] I shall self-merge the revert and things will be back as they were. [08:05:41] NO. [08:05:47] s/NO/No/ [08:05:56] No need to overreact. [08:06:07] The self-merge was bad from my point of view. [08:06:20] I'd CR-1 the change you self-merged. [08:06:23] Ok, i do not understand, don't you want to CR the changes again before they can be merged? [08:06:30] But it's not worth the energy to roll back. [08:06:38] it's 1 click [08:06:47] no energy needed really [08:06:53] But the commit history would get more complicated. [08:07:13] And it's harder to understand why we committed, rolled-back, and shortly afterwards committed again. [08:07:40] I've made my point that I didn't like the self merge. But I am fine to leave things as is. [08:08:27] (CR) QChris: "Not necessary to revert from my point of view." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/127599 (owner: Nuria) [08:09:10] I actually prefer that we do, i had no intention to merge if that was not what was intended. [08:09:34] from your comment above. So let's get things back as they were and we can proceed from there. [08:10:14] Like force pushing previous HEAD? [08:10:32] It's not worth it. [08:14:22] I think is more important that we feel good about the code on master than the commit history. If you truly feel the code is not up to the standards tobe on master it should be reverted. [08:14:58] Let's please do that to be consistent. [08:31:14] (CR) QChris: [C: 2] "Taking a closer look, I agree to the revert." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/127599 (owner: Nuria) [08:33:00] * YuviPanda reads interesting backscroll [08:33:25] we like to keep things exciting arround here ...... [08:33:35] :D [08:34:29] Personally I like having these in the history too. Git to me has a good mix of malleability of history locally and non-malleability once you have pushed to master (force push has issues) [08:34:56] YuviPanda: Agreed. [08:35:05] I do not like force pushing. [08:35:06] :D [08:35:14] We should turn it of across all of our repos? [08:35:23] it already is by default, IIRC [08:35:24] s/?/!/ [08:35:28] on gerrit, at least [08:35:32] I think? [08:35:46] it is also why using --amend with gerrit makes me twitch [08:35:52] a little [08:35:53] at least [08:36:32] Not for the analytics team. Our repos allow force push to us :-( [08:36:40] to master? [08:36:47] nuria: to refs/* [08:36:54] https://gerrit.wikimedia.org/r/#/admin/projects/analytics,access [08:37:05] ah ok. [08:37:06] oh! [08:37:14] yeah, might make sense to turn it off [08:37:45] Probably I should bring that up for discussion on the mailing lists. It bothers me since ages. [08:38:02] you should, qchris. [08:38:07] are force pushes common? [08:38:18] Luckily enough not. [08:38:25] But we should not even have push permission. [08:38:30] much agree that that should not be enabled [08:38:57] indeed. if I were up for it I'd just disable it and make an announcement :P [08:39:14] * qchris is too timid ... [08:41:17] heh [08:41:30] is 'move eventlogging data into hadoop' on the radar *at all*? [08:41:38] yes it is [08:41:41] so we don't have to write humongous SQL statements with the smallish tools mysql offers? [08:41:45] nuria: aah! how far? [08:41:55] like, in broad terms. Months? Years? [08:41:56] that qchris might know better [08:42:05] Pheee. Good thath I need not answer that question. [08:42:07] What? [08:42:11] but "raw" EL data is already going into kafka [08:42:13] No I do not know :-) [08:42:24] from varnish (as a test) [08:42:36] If I were to answer YuviPanda's question, I'd say "Currently not so much on our radar" [08:43:15] Right. [08:43:25] * YuviPanda is building instrumentation into the wiki app that'll probably be a PITA to analyze with just mysql [08:43:41] might as well write a python script that does batch processing. [08:43:46] funneling likely right? [08:43:56] oh yeah, *very* accurate funneling [08:44:03] each action as part of a funnel has a UUID [08:44:13] and some funnels are 'referred to' from other funnels and referenced by UUID [08:45:09] would also lend itself to some very nice visualizations, I think [08:45:18] ya, like the ones GA has [08:45:39] yea. [08:45:46] I am naive ... but "very accurate funneling" make me nervously think about 1984 ... [08:45:57] but in this case ( and again qchris might know better) i do not see how hadoop will help you either, it is very "custom" data extraction [08:46:31] nuria: more with the fact that I can run custom scripts on hadoop and not fight with mysql's 'SQL' [08:47:12] but those scrips will closely resemble the python you are going to write to interact and interpret your data in SQL [08:47:12] * qchris needs breakfast ... catch up with you later :-) [08:47:29] nuria: true, but I bet they'll run faster on a hadoop cluster? [08:47:56] I also am not sure if I can hack mysql into being useful for this, but that just sounds like another major fragile thing [08:48:15] i do not know, i have never used hadoop at the scale that would require [08:48:52] ah, right [08:49:09] should be interesting either way. Plus I'll get to not write Java for a while :) [09:19:50] qchris, since you agree on reveryting can you please revert the README change ? [09:19:57] sorry, *reverting [09:20:28] The revert at https://gerrit.wikimedia.org/r/#/c/127599/ has already been merged. [09:20:37] Or are you referring to a different commit? [09:22:42] ah ok, very well, sorry i had not seen it [10:04:37] * ori will revert the revert. [10:05:20] just kidding; i'm actually sleeping. [10:41:59] zz_yuvipanda: mmm, funnels… [10:51:57] nuria: ping [10:53:44] hola prtksxna [10:53:50] o/ [10:54:19] nuria: Mind discussing - https://gerrit.wikimedia.org/r/#/c/116260/ - here? Might be quicker :) [10:54:54] Sure, you guys tried the "store event and poll" approach right? [10:55:11] it seems that you did so from the wiki i read [10:55:28] nuria: There were issues with that [10:55:38] nuria: localStorage not being available everywhere etc. [10:56:14] right, it will only work for IE8 and up if iam not mistaken [10:56:48] but that seems preferable than a slower UX (it is of course the call of the PM ultimately) [10:56:48] nuria: right, http://caniuse.com/#search=localstorage [10:57:00] nuria: I am open to trying other approaches [10:57:23] nuria: See, at some places people are making the UX even slower, this mitigates those issues atleast [10:59:03] I will defer for performance team on that as i do not know wikimedia codebase and cannot speak to chnages there [10:59:17] my main concern with this change is that is not "measurable" [10:59:38] Example: we just reported recently page timings for all pages [11:00:03] the "delay" introduced by this logging is not in any way present on those numbers [11:00:27] So i would be presenting data that says, the 50th percentile in the US is 400ms [11:00:34] for time to glass [11:00:58] but that would be not what the user saw if we introduce arbitrary client side delays. [11:01:21] I know, I am trying to think of an alternative [11:01:34] yes, not trvial [11:01:42] the solutions i see [11:02:00] 1) use local storage +polling thus reporting data for ie8 and up [11:02:00] What I was trying to say was - I saw some places where the timeout wasn't there - and isn't that even worse than this [11:02:10] So giving a solution like this is actually better? [11:02:37] right, that is worst for sure [11:02:58] Now, this code does not solve the issue, just makes it less pronounced, right? [11:03:09] You could say that, yes [11:03:24] Its not a *solution*, its just better than the problem… [11:04:53] and solution 2) will be logging server side completely based on referers of http request [11:05:03] 1 > 2 [11:05:05] on referer hedaers [11:05:10] *headers [11:05:41] also not optimal, i know, cause server side referers might be incomplete, not there ... [11:06:08] nuria: I think it might help if you document all your concerns here too - https://bugzilla.wikimedia.org/52287 - probably CC someone from performance? [11:06:15] there is no good solution, imo [11:06:40] hola! [11:06:42] ori: I "undrafted" the patch… [11:06:56] the beacon proposal from the w3c's web perf group proposes adding Navigator.sendBeacin for exactly this use-case [11:06:58] nuria: hi! [11:07:01] see https://dvcs.w3.org/hg/webperf/raw-file/tip/specs/Beacon/Overview.html [11:07:16] ya, i saw that [11:07:22] that is the solution [11:07:34] "There are other techniques used to ensure that data is submitted. One such technique is to delay the unload in order to submit data by creating an Image element and setting its src attribute within the unload handler. As most user agents will delay the unload to complete the pending image load, data can be submitted during the unload. [...] [11:07:41] Not only do these techniques represent poor coding patterns, some of them are unreliable and also result in the perception of poor page load performance for the next navigation." [11:08:24] not only poor page performance but also "poor page performance you cannot easily detect" [11:08:48] i can make comments to this effect on the bug [11:09:07] but agree with ori,there is no good solution [11:09:09] that'd be cool [11:09:16] nuria: thanks [11:09:27] hi prtksxna, btw [11:09:42] ori: o/ [11:13:05] thank you prtksxna for your fast response to comments. [11:13:24] nuria: yw :) [13:35:21] (CR) QChris: "The reverted change is I25c47b0512ea2e80521edbe8c38e70dfd8ceec8e" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/127599 (owner: Nuria) [13:35:39] (CR) QChris: "Reverted in I08bcc2c651a8b5d9bdd8b8be5fe0e122dfc364d6" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/120542 (owner: Nuria) [14:31:24] qchris_meeting: [14:31:25] ah meeting [14:31:30] ping me when you wanna talk about zero logs [14:31:43] ottomata: I am mostly listening there anyways... [14:31:47] So... zero :-) [14:31:56] ok, so i just looked at mobile logs too [14:32:00] and those are not consistently different [14:32:16] the sizes are different, but sometimes kafkatee logs are bigger, sometimes udp2log are bigger [14:32:29] I've just looked at filesizes so far [14:32:46] I looked only at the number of lines. [14:32:55] ok, shoudl be the same either way [14:33:11] kafkatee and udp2log should be the same. Right. [14:33:50] hm, yeah, especially because zero is not sampled, right? [14:33:55] For the zero files ... kafkatee is constantly 10% smaller. [14:34:02] yes. Both are unsampled [14:34:04] i was about to say 'statistically the same' but, zero is unsampled [14:34:13] so they should on the whole have the exact same lines [14:34:19] Yes. Udp2log zero files are unsampled as far as I know. [14:34:25] No, not exact same lines. [14:34:38] Sequence numbers are not really the same. [14:34:38] not exactly in the same files [14:34:42] oh [14:34:44] right, i mean [14:34:47] Neither are the timestamps. [14:34:50] exact same requests. [14:34:57] Exact same requests. Yes. [14:35:08] the timestamps should be the same, no? those are the request timestamps from the varnish logs [14:35:12] (Modulo different start/end dates) [14:35:15] yeah [14:35:31] No. The timestamps of requests do not agree. [14:35:37] I was puzzled about that myself. [14:35:50] But I found counterexamples when diffing the streams. [14:37:10] as in [14:37:14] you found what you are sure is the same request [14:37:18] but they have different timestamps? [14:37:31] I thought so ... but it was early in the morning. [14:37:36] Let me double-check [14:37:41] k i will try to find some too [14:39:22] Got something. [14:39:28] On stat1002 in my home directory, run [14:39:33] diff ~/cp-kafka-2014-04-19.sorted-w-time ~/cp-udp2log-2014-04-19.sorted-w-time | grep 002025604 | head -n 2 [14:39:37] ottomata: ^ [14:40:21] The files are from zero tsvs with 20140419 in the filename. [14:40:38] (They have sequence number removed, so they diff better) [14:41:29] hm. [14:42:00] weird, perhaps I am wrong about how timestamp is constructed [14:42:08] i thought for sure it would be stored in the shared memory logs [14:42:23] I am clueless about varnish :-) [14:42:43] Here you go with sequence numbers: [14:42:47] zgrep '^cp3013.*2014-04-18T19:03.*002025604' /a/{squid/archive,log/webrequest}/zero/zero.tsv.log-20140419.gz [14:43:09] Note that the sequence number is decreasing, while the timestamp is increasing. [14:43:17] But it seems to be the same request to me. [14:44:42] seq decreasing timestamp increasing?!?! [14:44:56] Yes :-) [14:45:09] I said it did look strange to me as well, didn't I? [14:45:19] Probably some races somewhere. [14:45:49] Shared memory, delayed writing, ... no clue. [14:46:16] wait, those seqs are from the two different files, right? [14:46:25] the seqs are generated by the loggers, not varnish [14:46:36] Ah. Ok. [14:46:45] i think this is just really weird [14:46:53] because the seqs happen to be 1 off [14:46:54] for the same request [14:46:55] Can they be that close if they are different loggers? [14:46:56] in the different files [14:47:00] they could be, sure [14:47:08] like, if the loggers both started at the same time [14:47:13] Ok. [14:47:16] they all jus start counting at zero when they start [14:47:35] cp3013 is one of the new mobiles, right? [14:47:39] Yes. [14:47:51] yeah, so maybe puppet ran and just started htem up at about the same time [14:48:13] ok, so let's not worry about that, I am a little annoyed that the timestamps are not the same...have you ever seen one where the timestamp is more than a second off between the two? [14:49:02] I didn't check, as it was just by accident that I noticed that they are different at all. [14:49:10] you know, Snaps might know! :) [14:49:11] We didn't vet kafkatee data :-) [14:49:19] Yes. [14:49:19] Snaps: , yt? [14:49:27] hi theres [14:49:34] the varnishncsa docs say %t is Time when the request was received [14:49:38] hey [14:49:38] so [14:49:40] we are just curious [14:49:46] we are comparing udp2log and kafkatee log lines [14:49:52] we've found a few lines [14:49:56] that we are sure are the exact same request [14:50:02] but have different timestamps [14:50:05] request timestamps [14:50:11] only 1 second different [14:50:12] but still [14:50:26] i had thought that the request timestamp was being stored in the varnish shared memory logs [14:50:37] along with the rest of the request data [14:50:59] so all of the request content would be the same independent of what was reading the shared memory logs [14:52:00] which one is earlier? [14:52:04] if consistent [14:54:41] udp2log is earlier [14:54:46] but that is only one request we are looking at [14:54:55] vk will read the date from RxHeader tag first, and if that one isnt found, ReqEnd. [14:55:06] problem is it only matches RxHeader from BACKEND, but thats a CLIENT tag. [14:55:08] So its a bug. [14:55:38] And as ReqEnd tag is the last tag of a connection it will have a later timestamp, some times rolling over into the next second. [14:56:37] (I checked half a dozen corresponding requests with different timestamps, and udp2log was consistently earlier) [14:56:55] https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L924 <-- that should be VLS_S_CLIENT instead of VSL_S_BACKEND. [14:58:11] oh, ok, so a bug in vk? :) [14:58:15] yepyep! [15:00:03] huhm, but varnishncsa does the same thing. huhm huhm [15:03:14] hm [15:03:56] ah, varnishncsa overwrites previously set values, while vk doesnt. So varnishncsa will always use the timestamp from the ReqEnd tag, while vk will use the "Date: " header. [15:04:00] thats my take [15:05:11] ah hm [15:05:18] where does Date header come from? [15:05:19] client/ [15:05:19] ? [15:05:22] originating client? [15:07:58] probably not since that value is really of no interest (it might be bogus) [15:08:50] maybe the date: header is rewritten by the cache frontend [15:08:54] need to check the sauce [15:11:35] hm [15:11:58] ok, well, if you think varnishkafka is doing the right thing, and we can explain the differences, then I'm not too worried about it [15:12:06] the second difference isn't a problem really [15:13:31] even though the Date: header is probably supplied by varnish I would trust the ReqEnd tag more (partly because it doesn't require parsing a date) [15:14:30] but since there doesnt seem to be any larger time lapses between vk and vncsa, I dont think it matters. [15:17:45] I dont know if standard browsers sends the Date: header in requests [15:17:47] yeah, i mean, we haven't seen any, we just kinda noticed this one [15:17:49] ok [15:17:52] i think its fine then [15:18:21] so, we're looking at this, because a little less than a month ago, the zero logs that are generated by kafkatee contain fewer requests than the ones from udp2log...not sure why yet, need to investigate more [15:18:22] but it is strange [15:18:30] seems to be a specific date when that started too [15:20:21] huhm, ok. fewer by how much? [15:22:20] 10% [15:22:42] ouch [15:23:30] is this verified by another consumer than kafkatee? [15:50:57] no, all we have seen so far is differences in line counts in the archived files that qchris was going to use [15:51:01] i'm checking into it now [16:05:54] ok, partly good news Snaps, qchris_away [16:06:02] i just checked webrequest_mobile imports in hdfs [16:06:09] for a line that was in both udp2log and in kafkatee [16:06:11] it was there [16:06:19] i also checked for al ine that was only in udp2log, but not in kafkatee [16:06:22] and it was in hdfs [16:06:27] so, that means that the log was actually in kafka [16:17:54] qchris_away: i see that the kafkatee has only been running since march 26th [16:18:00] the kafkatee process [17:04:29] ottomata: If kafkatee has only been running since march 26th ... where are all the files in /a/log/webrequest/mobile/ (on stat1002) from before march 26th coming from? [17:05:16] oh, i mean [17:05:21] it was restarted then [17:05:26] (Glad to read that the messages made it into hdfs, and are just missing from the tsvs) [17:05:27] that's the lsat time it was started [17:05:31] Oh. [17:05:32] Ok. [17:06:32] yeah not sure [17:06:40] kafkatee version hasn't changed since feb 20 [17:07:02] Mhmm. Interesting. [17:19:24] lunchtime, back in a bit [18:08:37] hi Ironholds. [18:08:48] hey YuviPanda :) [18:08:57] Ironholds: I've implemented the session tracking schema. Some data should start flowing in when we push out the alpha today. [18:09:02] there's already some data from my testing. [18:09:09] cool! [18:09:24] Ironholds: I'll check to make sure that our EL infra can handle that, though [18:09:29] * YuviPanda pokes ori [18:09:58] ori: that's an Event for every link clicked, for mobile apps (which are a few % of traffic). Now that I write that out that sounds a bit scary in terms of network overhead and also our serverss. [18:11:38] 20/(20/100) [18:11:59] how much is that? 4%? [18:12:07] yeah. [18:12:18] that or my math is so wrong [18:12:22] either can be true [18:12:33] I suck at math [18:12:39] what do you think I am, a statistically-oriented programmer?! [18:12:49] 20% of 20% is 4%? [18:13:02] 10% of 20% is 2% so 20% of 20% is 4%? [18:13:04] I want to say 'yes' [18:13:09] or am I in a post dinner haze [18:13:10] but, you know, we could just get a calculator [18:13:20] or a REPL for a statistical language, like the one I have open [18:13:26] yes! [18:13:32] I typed into google 0.2 * 0.2 [18:13:35] and it gave me 0.04 [18:13:48] yes, 4 percent [18:13:50] * YuviPanda likes decimals [18:13:55] oh my god we're terrible. [18:14:03] people PAY US to DO STUFF? [18:14:17] yeah, but 20 / ( 20 / 100 ) is like 20 / (0.2) = 100 [18:15:11] Ironholds: either way, I'll build in a kill switch. I realized I could stagger scroll events by 10s or more, but then I'm unsure how much the servers can take [18:15:21] eh. Scroll events are a bonus. [18:15:24] Ironholds: I can also build in a ramp up mechanism, where we start with 10% of users and ramp up [18:18:21] YuviPanda: what's the volume, expressed as average events / sec? [18:18:41] Ironholds: ^ can you give a rough answer based on WikipediaMobile UAs? [18:19:56] Ah...no. [18:20:05] I mean, I can tell you an estimated volume for just clicks [18:20:11] that's about it ;). [18:20:11] Ironholds: yeah, that should be fine [18:20:22] sure, vun moment while I pull up zee logs. [18:20:40] Ironholds: we could even just take our article page view rate and multiply by 0.04 [18:20:43] oh, bugger, no I can't, I only stored percentages. [18:20:55] okay, I'm pretty sure I have a serialised file around here that has the raw data. Wait one. [18:21:04] can you estimate an order of magnitude? 1/sec, 10/sec, 100/sec, 1000/sec? [18:21:17] "less than the last one, more than the first one" [18:21:24] I'll have a precise estimate in about 10 minutes, if that works? [18:21:26] It's a really big file. [18:21:43] sure yeah [18:21:50] Snaps: you still around? [18:24:40] qchris: for a second I thought maybe we were losing some logs during logrotate [18:24:50] but the losses look fairly evenly spread throughout the hour I am looking at [18:25:03] ottomata: 10% due to logrotate :-D [18:25:11] yeah, too much, i know [18:25:21] but i saw some messages about X number of messages in queue during a logrotate [18:25:22] in syslog [18:25:44] It didn't look like we're only loosing around 6:30. [18:25:45] but, i don't think it should matter, kafkatee should be pretty resilient, and yeah, that's not the pattern of the losses anyway [18:25:50] I had some hits for 19:30. [18:25:51] yeah [18:26:13] But logrotate would be a nice explanation. [18:26:17] Yes. [18:26:22] yeah, i'm looking at 2014-04-19 07:00 - 08:00 for cp1046 [18:26:31] and I see losses throughout the hour [18:26:46] (files in /a/otto/kafkatee-compare, btw) [18:27:28] i'm thinking about restarting kafktee just to see if this goes away, even though I doubt it will...i'm not really sure where else to check [18:27:30] I had a look at cp301[34] (files in /home/qchris/cp*) [18:27:52] Restarting is certainly worth a try. [18:28:03] It came with a restart ... maybe it will go with a restart as well. [18:28:21] haha, i'll sprinkle some magic dust on it before I restart it [18:28:26] restarted. [18:28:33] * qchris keeps fingers crossed. [18:29:08] What strikes me, is that it's really so close to 10%. [18:29:36] Did we turn on some (maybe unrelated) sampling before the previous restart? [18:30:31] not that I could tell, no [18:30:34] but possible [18:30:52] the files in /etc/kafkatee.d have not changed since Feb 27th at the latest [18:31:48] guess I should get some kafkatee monitoring up soon eh? [18:31:54] :-) [18:32:04] there is a json stats file similar to what varnishkafka has [18:32:24] Where does this file live? [18:33:35] analytics1003:/var/cache/kafkatee/kafkatee.stats.json [18:34:09] Meh. Does not let me in. [18:34:28] Well. Not so important. If the file does not explain the problem to you, [18:34:37] it certainly does not explain it to me either. [18:36:46] qchris: i copied it to stat1002 in /a/otto/kafkatee-compare/kafkatee.stats.json [18:36:53] Thanks. [18:39:06] there are 10 total partitions [18:39:18] maybe one of them is not consuming properly? or maybe one wasn't? [18:39:26] And 10 worker nodes IIRC. [18:39:29] and then kafkatee decided not to consume from it anymore? [18:39:32] hm, no [18:39:35] there are 10 datanodes [18:39:47] but those are not connected to kafkatee at all [18:39:52] hadoop is totally separate from this flow [18:40:02] Yes, you're right. [18:40:23] If kafkatee would ignore a partition, that would explain it. [18:41:00] whoa [18:41:07] kafkatee isn't writing offsets!~ [18:41:15] hmmm [18:41:51] -rw-r--r-- 1 kafkatee kafkatee 11 Feb 26 19:53 /var/cache/kafkatee/offsets/webrequest_mobile-0.offset [18:41:54] all Feb 26! [18:42:07] No clue :-D [18:47:05] Snaps: come back to us! :) [18:50:48] okay, test done [18:51:09] er. crap. [18:51:18] Probably a million events a day, just from the ones we're tracking server-side [18:51:24] (assuming the 1:1000 random sampling is truly random) [18:51:33] YuviPanda, y'might want to go for a subset ;p [18:51:46] heh [18:51:55] like, 5 percent of people, tops. [18:52:31] Ironholds: right. I could send the token as a header but I guess that has other issues. [18:54:42] to just allow requestlogs tracking? [18:55:21] Ironholds: yeah? [18:55:48] Ironholds: so I could send an extra header with the same token we are using here, but then we'll have to have some serverside code to capture that [18:55:57] Ironholds: and if we do that we can decide about sampling on the server itself [18:57:28] I would recommend against that. [18:57:48] because one, serverside code, and two, the sampling is per-request not per-token [18:57:50] 1m / day is ok [18:57:54] so you wouldn't actually be able to eke out any session data [18:58:00] ori, yeah, tyop. I meant an hour. [18:58:09] heh [18:58:30] how is that possible? [18:58:42] mean(monthly_sample_by_day)/24 turns out at around 11,000 requests from apps [18:58:52] and I'm assuming that the 1:1000 sampling is, well, at least pseudo-random [18:59:10] (it's probably not, but I have no way of checking) [19:00:41] Ironholds: yeah, that's consistent with what yuvi was saying earlier [19:03:46] Ironholds: wait, is that with or without 1:1000 sampling? [19:04:36] YuviPanda: poke [19:04:44] * YuviPanda peeks head out [19:05:10] I think the 1m/day is without sampling? [19:05:20] were you intending to sample? [19:05:28] 1m/day has to be without [19:05:31] I wasn't, but it's trivial to do. [19:05:37] yeah, has to be without. [19:05:52] what's the schema? [19:06:59] YuviPanda: ^ [19:07:10] ori: looking [19:07:20] ori: https://meta.wikimedia.org/wiki/Schema:MobileWikiAppReadingAction [19:07:39] I would feel slightly better just setting it as a header, but yeah, hard to do 'per token' sampling server side [19:08:10] heh, the CSS for the code sample is screwy [19:08:23] hah! just noticed it [19:12:56] ori: what do you think? I am going to build in a way for us to remotely tweak the sampling rate (which will be 'per token') [19:13:40] YuviPanda: it's a bit odd in that you're basically tacking mobile app requests via a surrogate request [19:13:54] tracking, even [19:13:55] ori: indeed, which is why I just want to send in an extra header. [19:15:29] yeah, that would be best. but what question(s) are you trying to answer w/that data? [19:16:58] ori: So The Question The Product Manager Wants Answered Is "What is the average session time for interaction with the app"? [19:20:06] YuviPanda: Activity.onPause handler? [19:21:01] ori: yeah, so that was the original plan (not onPause since that is triggered way too much, but similar) [19:21:15] but? [19:21:26] ori: we'd basically have to do a 'if the user has not been active in X minutes, session has ended' [19:21:43] ori: since onPause can be called in so many different ways (tap 'Random', onPause gets called, for example) and hence is useless. [19:22:14] ori: and he suggested just logging them serverside and determining 'X' empirically [19:26:09] YuviPanda: why don't you have an onPause handler that just updates a lastAction timestamp [19:26:24] YuviPanda: and then on application *start*, send the record of the *previous* session [19:26:36] ori: right, so 'application start' is also an ambiguous concept in Android. [19:26:43] ori: processes linger for a fair amount of time somtimes. [19:26:56] ori: so I've to have a timeout. [19:27:12] ori: the problem is when to reset lastAction and send out an event. [19:28:18] YuviPanda: onWindowFocusChanged? [19:29:20] ori: all of the activity methods are kinda useless or contorted, since the app has multiple activities.there can be multiple instances of each activity as well. [19:29:22] YuviPanda: i mean, i don't see how logging all actions evades this problem [19:29:31] Ironholds: ^ [19:30:04] ori: it's so we can, serverside, just say 'x% sessions stopped sending events after 10m of last activity, y% after 2m, z% after 30s' etc? [19:30:23] ori: http://blog.ironholds.org/the-myth-of-the-30-minute-session-timeout/ is what Ironholds pointed at to me [19:30:34] how do you determine when a session stopped sending events? [19:31:09] ori: so currently I send events on every link clicked. Can easily make that every time someone finishes scrolling, or somesuch. The Java code that logs this is literally called logSomethingHappened(); [19:31:28] so client keeps sending events when things happen, and then we analyse later serverside. [19:31:49] yeah, but suppose i generate a bunch of events, then i go do something else, and then two hours later i use the app again [19:32:24] how do you segment the data server-side into discrete sessions? [19:32:40] you need to specify that any gap longer than N minutes indicates the end of a session [19:32:45] but that's the very thing you are setting out to determine [19:32:55] Ironholds: ^ [19:33:17] I assumed there's going to be some sort of clustering algorithm that lets me derive it, but tbh I hadn't fully thought about that. [19:33:40] I think Ironholds can answer that better than me at this point. [19:34:06] maybe you look at the distribution of the deltas between events [19:34:37] indeed, that's the naive thing that was in my head [19:35:02] 'A% has X min gaps, B% has Y min gaps, C% has Z min gaps between glusters' [19:35:30] ori: from what I remember Ironholds is also working on detecting session times from our regular pageview data using similar algorithms [19:36:27] i'm not persuaded it's a problem, really. so Ironholds's point that the 30m figure is basically an outdated convention seems valid, but [19:37:01] remember that other things changed since 1994, namely what sort of things count as activity [19:38:09] back then it was basically page requests from the server; you didn't know when the user was scrolling or anything like that [19:38:25] right. [19:38:37] theoretically in the app I could detect scroll stops and other things as well [19:38:47] so if you have a 10,000 word essay it's plausible that the user will just sit there and read the whole thing and then continue on to something else, all in the span of one session [19:39:05] right. [19:39:17] and tracking these by just links will miss those. we need scroll for that [19:39:22] but with mobile it's really hard to have a session that is active and generates no actiivty [19:39:49] like, you have to do something, or something else will steal focus or the phone / tablet will suspend [19:40:02] right [19:40:44] so if it comes to it I can even do raw touches (with a timeout so I don't send 5000 in a sec) [19:40:50] and with sampling too [19:40:56] and I think that'll be a very accurate measurement [19:41:05] but I don't know at what point people will get creeped out [19:41:25] but i'm saying, you don't actually need to log every one if you determine the number [19:41:57] once you determine that anything longer than N minutes of inactivity indicates a new session [19:43:06] ori: right. but then if I am only tracking 'page navigation', then someone reading a large article would count as inactive. [19:43:07] then you can just have: onAppEvent() { if (currentTime - lastEventTime > sessionThreshold) { reportSessionTime(); currentSessionStart = currentTime; } [19:43:14] right [19:43:20] ori: that's waht I meant by 'with a timeout' [19:43:28] ori: should've said 'batched' [19:44:36] so let's separate two questions here [19:44:41] ori: ah, hmm. so if I set that to say 1m, then clicking new links will also be sent at most once per minute. [19:44:46] ok! [19:44:49] the primary question is "how long do sessions last" [19:45:09] to answer that, we need "how do you know a session has ended" [19:45:21] indeed. [19:45:43] we already know that the answer to the second question will have the form "no activity has been generated in N minutes" [19:46:31] right [19:46:53] but we need to determine what N is, either by reasoning about mobile devices to our satisfaction or by looking at the distribution of currentEventTimeStamp - previousEventTimestamp [19:47:14] Im back. ottomata [19:47:22] even if we do the latter (collect data to determine N), i think it's a logically separate data collection job [19:47:33] and you'd only need to run it for like, a day [19:47:43] aah, hmm [19:47:44] right [19:48:05] so we can kinda sample it waay low (0.05%?) and run it for a very short period of time, and then use that to tweak the N? [19:48:16] and then you'll just know N and you could use the timeout logic to compute session duration on the client and report that [19:48:35] yeah. is it easy to push updates? [19:49:39] ori: I've a 'RemoteConfig' switch that lets me push a JSON file to the app very easily. I use it for config switches that change over time and need to be without the hassle of pushing out an entire update [19:50:03] ori: even otherwise, pushing an android update is pretty easy. iOS is a PITA tho [19:50:28] cool, yeah, so you don't even need to sample based on token / device id, it can be totally anonymous [19:50:57] just do: onUserActivity() { if (randomChance()) { reportTimeDeltaFromLastEvent(); } } [19:51:58] ah. [19:52:00] * YuviPanda thinks [19:52:28] so we can see a distribution, and then figure out N from that [19:52:42] scrolling up [19:53:06] ori, it's not about when sessions stopped sending events, it's about inter-time periods. [19:53:25] So, I have the inter-time periods for, say, 30,000 users; the time elapsed between each pair of events for each user. [19:53:49] if I aggregate those, I expect to see a dip - a point where if users haven't taken any actions by N number of seconds, they're not going to again for an extended period of time [19:53:53] and that's the cutoff [19:54:03] "expectation maximisation" is halfak's name for it (I dunno who else uses it) [19:54:40] wait, you want to compute this per-user? [19:55:17] oh god no, that's be horrifyingly intensive [19:55:36] so how is that different from what i'm suggesting? [19:55:38] there is an excellent paper I was reading that recommended that, but halfak and I talked about it and agreed the justification was not necessary for our datasets [19:55:46] okay, I should read up more in more detail ;p [19:56:12] Bah. Got pinged. Reading scrollback. [19:56:47] ori, so, if I'm reading right, you're advocating that once we've determined 'session length', or average session length, we only need the aggregate count of actions in a session? [19:57:20] I agree; that's actually what the original EL schema would've done ("Count the number of events before N seconds of inactivity"). This is just working out N. [19:57:25] no, i'm saying, once you've determined session length, you can determine on the device itself [19:57:48] when a new session has started, by comparing the current time to the time a previous event has occurred (where 'event' here means something only stored locally) [19:58:15] milimetric: thing I didn't want to ask in the meeting: Can I go ahead and use simplemediawiki (+ kitchen) to interface with the MediaWiki API, or are there other plans? [19:58:22] so you can log session duration by basically reporting the previous session duration every time the delta between the current and last event is greater than the configured session delta [19:58:49] where the configured session delta is..~? [20:00:51] I saw that in requirements.txt awight, and I have no problem with it [20:00:55] Ironholds: by running an experiment for a day or so and figuring out the delta from the data collected from there? [20:01:54] the only regret I have, awight, is not submitting pull requests to kitchen :) [20:04:29] milimetric: lol is it that bad? We're pretty much insulated by simplemediawiki, fortunately. [20:04:54] YuviPanda, yeah,t hat's what I'm suggesting with this schema [20:04:56] awight: eh, not horrible, but utils.py is getting big-ish [20:07:12] Ironholds, YuviPanda: here's what i have in mind, in pseudo-java: https://dpaste.de/tM30 [20:07:53] * YuviPanda reads [20:07:58] um, setPreviousEventTime( currentEventTime ) should actually be moved from line 33 to line 5 [20:08:06] but otherwise that's the idea [20:08:28] YuviPanda, Ironholds: updated paste: https://dpaste.de/jcig [20:08:42] yeah, that sounds very doable. [20:08:59] and much much less hard on the EL infrastructure. [20:10:16] so, I'm reading that as "compute a breakpoint and, once that's been computed, use that as a cutoff and just provide aggregate data"? [20:10:39] I think so. [20:10:41] yes, pretty much. [20:10:48] that's pretty much was proposed, but has the advantage of not requiring separate analysis or multiple schemas over time. [20:10:50] I like it :D [20:10:53] milimetric: oic, yes wikimetrics/utils.py should wither under the intense sunshine :D [20:11:00] :) [20:11:12] and you guys can determine when the breakpoint has been established to your satisfaction [20:11:14] indeed [20:11:19] at which point YuviPanda would specify it in a config update [20:11:45] and the logging would switch to the else branch of the if(sessionTimeout==null) [20:12:43] right. I can also make it slightly more configurable, so we collect data for a couple of days or so, and then spend some time analyzing them to figure out the breakpoint and then set it in config. [20:12:53] this way we can ocassionally re-caliberate and verify our breakpoint too [20:12:56] but yeah, pretty much the same [20:12:56] halfak is doing some thinking, so no implementing ;) [20:13:00] one small problem that i didn't consider is that using simple random sampling of all user actions will be biased in favor of highly active users [20:13:31] we should perhaps do sampling per-user than per event [20:13:35] yeah [20:13:39] you do the randomChance() once and then save it and just stick to it [20:13:45] right [20:15:12] Ironholds: ori the current master already pings EL ever page view (i.e. current schema). An alpha might go out tomorrow, but that's just about 20-30 people so it's ok I guess. [20:16:18] yeah, provided the mobile alpha isn't featured on the front page of the new york times or something [20:17:58] yeah, that's the beta in about another week. Obama is going to mention it at some point on twitter, I think. [20:18:49] (kidding) [20:19:28] obama uses a blackberry ;) [20:20:22] hahah! :D [20:20:30] ori: we had(have?) a blackberry app in some store somewheree... [20:22:03] okay, we return [20:22:30] so, conclusion; ori's solution looks absolutely perfect if we can be sure of capturing the endpoint (presumably when the app dies, or when it's restarted if it doesn't die? Me no implementation person) [20:22:38] Ironholds: tho see point above about sampling per-user [20:22:51] *nods* [20:23:03] I read it as per-user sampling to compute per-user breakpoints, though [20:23:05] did I read wrong? [20:23:16] (not to compute global breakpoints) [20:24:01] I guess that would mean the unique client side ID is still required [20:24:07] and sampling has to be per user [20:24:20] it's to compute the global breakpoint [20:24:28] aha [20:24:32] then yeah, that'd be a problem :( [20:24:43] what's the problem? [20:25:36] just the biasing-towards-highly-active. But I don't think it's a serious one. Unfortunately my brain can't generate any good reasons as to why it is/is not a problem, it's just a hunch, and hunches are useless. [20:26:03] but it won't be biased, since we'll determine whether or not to log on a per-user basis rather than per-event [20:26:17] ahh, I got my wires crossed, I thnik. [20:26:27] ignore me, then. That sounds awesome; thank you for coming up with the solution :) [20:27:07] * YuviPanda cuts Ironholds's red wire [20:27:19] dude, blue. you cut the blue wire. [20:27:24] cool cool [20:27:25] ori: Ironholds think either of you can document this somewhere so the iOS guys can implement it as well? [20:27:32] i'll amend the pseudojava [20:28:12] :) [20:28:14] actually can't, have to run [20:28:24] ok! [20:28:28] ori: thankyou! :) [20:28:34] i was going to change if (randomChance()) to if (userInBucket()) [20:28:47] bye [22:25:24] hey, anybody around with kafka experience? [22:25:56] I'm wondering how it has worked for you so far [22:37:41] gwicke: I guess ottomata would be the only one with real experience. [22:38:15] Plain analytics-devs haven't touched kafka much. [22:38:16] qchris, thanks! [22:39:03] will ping him when he's around