[00:05:44] hey YuviPanda, fancy helping me with a naming problem? [00:05:54] A library for session reconstruction. What do I call it. [00:49:42] Ironholds: Resession [00:49:48] I'll show myself out [00:51:41] Emufarmers, that's awesome! [00:51:47] I already went for reconstructr, though :( [02:02:47] halfak, here's a good question for you. [02:03:17] Work out what you'd need to do to take a df of uuid-timestamp and automatically identify a good threshold for splitting, and how good said threshold is ;p [02:10:08] I can do all but the last bit. [02:10:15] I might be able to do the last bit. [02:10:36] Actually, the auc is a pretty good stat for the last bit. [02:13:45] cool! [02:13:55] I'm thinking if we want to really make this easy, we should try to automate as much of the process as possible. [02:14:31] in the middle of writing a vignette on session identification methodologies, cannibalising our paper [02:28:34] vignette? [02:28:37] Ironholds, ^ [02:29:30] halfak, LaTeX or markdown documentation that R processes into Rmd/PDF on package build [02:29:45] it lets you do things like http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html [02:29:56] It's kinda cool. It's also refusing to compile. Baaah. [02:32:03] Gotcha. [02:32:07] * Ironholds sighs, throws at StackOverflow [02:32:17] it's for the use case of "you want to provide documentation that's more extensive than ?function" [02:32:27] worst-case, I can throw it in ?reconstructr but I'd rather make it independently checkable. [02:37:53] Gotcha. [02:38:25] http://stackoverflow.com/questions/27338970/vignette-creation-on-package-build-fails-with-the-error-failed-to-locate-the-w *grumbles* [02:42:53] that is, I think, my brain's hint to halt work on this for the time being [02:43:19] but I've extensively hammered on all the C++ and fixed a few bugs (including one really weird one. REALLY weird. Apparently the number 8 is significant in C++), everything has unit tests, and every function has documentation. [02:43:40] The only enhancement I can think of other than semi-automated threshold calculation is including time_on_event() [02:45:00] oh, and semi-automated detection of an appropriate padding value [02:45:28] also, the library has exclude_single_event and exclude_last_event flags in session_length, so we can adhere to Product's needs when we need to while also blowing raspberries at them ;p [02:48:20] Exclude last event? [02:51:12] so, google analytics excludes the final event in a session from consideration when computing session time, because...because I guess they find calculating an appropriate padding value to be hard. [02:51:33] Now, I think this is bollocks, but if we want something analysts can pick up and run with whatever the policies of the company they work for, we should include it as a flag. [02:51:36] even if we don't use it ourselves [02:52:28] (and then at some point write a paper on metrics extractable from user sessions, exploring whether it's possible to accurately predict the value for the final event based on the overall intertimes in the dataset. Which is trivial.) [02:52:47] and then hopefully google accepts we are right and gives me a load of money to...scold them for using a 30 minute timeout. Or something. [02:57:32] What. They couldn't be dropping the last event. [02:57:42] Are they just imaginging the padding applying to the other side? [02:57:50] e.g. page loads up and then user does stuff? [02:58:06] Where in edit sessions, we were looking at a save event [02:58:15] e.g. editor does stuff and then clicks save. [02:58:25] The "do stuff" is on the other side of the timestamp. [03:00:10] yup [03:00:26] so, we have event1, event2, event3 in a session. You wanna calculate the session length. [03:00:46] So that's the same as just taking last_timestamp - first_timestamp = session_duration [03:01:02] we would say "time between 1 and 2, time between 2 and 3, and then padding equal to the [type] mean applicable to this distribution" [03:01:18] Whereas, we want (last_timestamp - first_timestamp) + padding = session_duration [03:01:18] they would say "time between 1 and 2, time between 2 and 3, and nothing else, ignore the time spent on the last page entirely, that shit is hard to work out" [03:01:37] No prob. [03:01:40] yup [03:01:45] we can do it. It's easy! [03:01:45] It's the same without padding either way. [03:01:53] what do you mean? [03:02:11] If you just allow the function to apply it's own padding, you're set. [03:02:30] agreed! [03:02:38] but nope. Google definition says no padding. [03:02:43] Yeah. [03:02:53] It had signal. [03:02:55] *has [03:03:03] It just lacks real world applicability. [03:03:16] more == more [03:03:19] amount != amount [03:04:06] heh [03:04:15] see https://support.google.com/analytics/answer/1006253?hl=en - wait, fuck. [03:04:29] the reason they discount time after last event is because "event" for them is a JS-recognised interaction with the page [03:04:43] i.e., they /are/ counting time on the last page. They just use JS voodoo we don't have access to. [03:04:48] so we should be including this. [03:05:09] Gotcha. We could use JS to check this. [03:05:20] There are focus() and unfocus() events on page. [03:05:29] We can see when the user is doin stuff. [03:05:36] And just record the last timestamp. [03:05:38] yeah, we could. Let's wait for that until after the UUIDs, though. [03:05:43] but we should totally pitch it. [03:05:51] last([fucos, keypress, scroll, click]) [03:05:53] and now we know we can use "hey, everyone else in the industry is doing it" as an argument [03:05:59] +1 for waiting. [03:07:01] in the meantime, we have the functionality to include or discount [03:07:09] I'm happy to discount for apps if it will make the product folk happy. [03:07:25] that way I get to kick this out of the door and not deal with it again [03:08:31] +1 for Oliver getting back to the cool stuff. [03:08:36] I mean Ironholds [03:08:41] :) [03:08:47] heh [03:08:53] I have so many ideas [03:09:09] the cache test, the probability distribution for edit and donate attempts in sessions [03:09:22] ooh, halfak, one idea I had last night - I'm not sure what it's applicability would be except that it would generate shit-hot graphs. [03:09:51] session length, dependent on the localised hour-in-day the user began the session i [03:09:52] *n [03:10:10] I mean, one application could be simply looking at the viability of expecting to increase engagement on platforms that tend to peak at certain points in the day. [03:12:38] Write it up! [03:12:55] I will, when I have time [03:13:01] You know when I was most productive? Thanksgiving [03:13:04] I added this earlier today: https://trello.com/c/laFdvt8e/200-newcomer-experience-dashboard [03:13:10] I'm seriously thinking of taking a week-long holiday for one purpose and one purpose only [03:13:20] to release a metric ton of FOSS libraries around all the things I do in my day-to-day. [03:13:31] and tie off every research project I have so I have free time. [03:13:42] Like, it won't be a holiday. It'll be "I get to work and y'all can't ask me for new stuff" [03:14:24] I can relate, but you've got to be careful to strike a balance. [03:14:43] Also, hackathons. [03:14:55] Next hackathon, this stuff is my topic of pursuit. [03:15:00] session metrics? [03:15:17] More like -- all of the libraries I should build out so that I can work faster. [03:15:17] and yeah, I do need an actual holiday at some point [03:19:19] ahh [03:19:31] well, none of this is that, really. I have all these libraries already [03:19:39] they're just in one big smushed-together thing in WMUtils [03:19:55] I'm splitting out the session stuff, the cryptography stuff will be split out as soon as Jeroen and I get openssl released to CRAN [03:20:09] the geoip stuff, I'll need to either hack at the legacy library or talk to people about why we can't use the modern one [03:20:17] but after that, building something generic, sticking it on CRAN. [03:39:35] alright, heading to the pub [03:39:37] later, halfak! [03:39:45] Or rather, hopefully not. You should go to bed at a reasonable time ;p [03:46:28] o/ [16:11:40] morning halfak_ [16:11:51] G'morning Ironholds :) [16:12:01] NO UNDERSCORES [16:13:39] haha [16:18:51] I have a 64GB, 3 column(int, int str) TSV file. [16:18:57] How many rows do you think it has? [16:27:18] hmmn [16:27:22] what's the str? [20:27:39] Ironholds, sorry to drop off. The str is a page title. [20:28:05] halfak, that's okay! I'm hacking with protonk [20:28:10] hmn. 64GB... [20:28:40] 512m? [20:29:41] I was going to guess the same, because that would make sense./ [20:29:49] I can't get a wc on it to finish. [20:30:02] But the size seems even too big for 100s of millions. [20:30:11] wow [20:30:13] what is it? [20:30:40] It's a page link dataset I'm creating for enwiki. [20:30:49] But I'm suspecting some funny business. [20:30:59] hhuh [20:31:06] what's the query? [20:31:18] It's an XML dump processing scriot [20:31:20] pt [20:40:26] huh