[00:05:51] okay, signing off for the evening. Still kinda ill. Someone text me if something breaks. [00:07:30] halfak: just dropped in. Happy to discuss any questions you might have [00:07:40] Hey natematias! [00:07:50] ping YuviPanda (if you're still around) [00:07:55] I am! [00:07:58] Woot. [00:07:59] Hi YuviPanda! [00:08:03] hi natematias! [00:08:43] :) natematias, YuviPanda is the author of Quarry (http://quarry.wmflabs.org/) and all around an ally of good SCIENCE infrastructure. [00:08:55] oh awesome! [00:09:12] :D [00:09:24] Thanks for raising questions about our request [00:09:35] So, I think the biggest question we have is: Did you know about Tool Labs -- a project in WMF Labs? [00:09:47] And would working within that project serve your needs? [00:10:26] I think it might. At the time, I think I might have mistaken one for the other :-) [00:10:37] Heh. It's very confusing. [00:10:43] can you explain the difference? [00:10:51] Tool labs is a WMF Labs project. You need to join WMF Labs in order to join Tool labs. [00:11:02] :) [00:11:03] Tool labs is where you can get direct access to Wikipedia's databases. [00:11:19] ahh, ok. In that case, I think I thought I was signing up for Tool Labs [00:11:22] They also manage some other infrastructure like web servers and studd. [00:11:28] *stuff [00:11:43] However, you can shun Tool Labs and start your own project and manage your own VMs. [00:11:47] I do that for Snuggle. [00:11:59] That means I get zero support, but I also get a few VMs to myself. [00:12:35] ahh, ok. For this project, I think that Tool Labs would be more useful. [00:12:47] Great. YuviPanda what do? [00:12:51] cool! [00:12:51] http://tools.wmflabs.org/ [00:12:59] 'useful links' [00:13:03] natematias, see https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request [00:13:19] yeah, just make a request there, and I'll just add you guys right now [00:13:29] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help has documentation [00:13:53] Thanks! So that's different from this: https://wikitech.wikimedia.org/wiki/User_talk:Rubberpaw#Your_shell_access_was_granted [00:14:25] natematias: indeed! [00:14:29] ok, cool [00:15:26] natematias, it's very confusing. I wrote a guide to help people who participate in our research hackathons. https://meta.wikimedia.org/wiki/Research:Labs2/Getting_started_with_Tool_Labs [00:15:29] It might be helpful. [00:15:36] But you're right at the last step :) [00:15:56] oh cool, thanks halfak ! [00:16:26] godspeed sir :) [00:16:39] I will go ahead and add my ssh key... [00:23:14] Thanks YuviPanda! [00:23:26] natematias, FYI persistence analysis is 95% done. I have stats on every token added in the revisions you linked. [00:23:29] natematias: yw! You should be all set now after you setup your ssh key :) [00:23:53] I plan to take a second pass to generate per-revision statistics (e.g. words_added, words_persisting, etc...) [00:23:55] Off to do some more data analysis so I can get halfak a second batch off revids [00:24:01] Woot! [00:24:18] Also, it looks like I can process a set the size you gave me in about 8 hours. [00:24:21] Oh, thanks halfak (just got the latest messages) [00:25:43] halfak: that's encouraging. The second set is likely to be smaller. I am hoping that the comparison group revids will be in your inbox when you arrive tomorrow [00:25:47] ttys [00:26:03] Sounds good [00:26:04] o/ [16:10:48] Hey hey science people. [16:10:51] o/ Ironholds [16:11:12] * YuviPanda is pseudoscience [16:11:17] o/ YuviPanda [16:11:18] :) [16:11:27] sudo-science [16:11:31] hehe [16:11:32] haha [16:11:33] :P [16:11:43] hey halfak! [16:11:52] If you're looking for christmas gift ideas, btw, I found the BEST THING [16:11:59] Woot! [16:12:02] Is it in your listy list? [16:12:02] (not for me, but for your science-minded friends and family) [16:12:10] Oh! tell me more [16:12:19] https://www.etsy.com/listing/153629289/chisquareatops-t-shirt?ref=shop_home_feat_2 [16:12:26] the entire store is genius, but...goddamn. That shirt. [16:12:38] alternately, for the more injoke-appreciating people, https://www.etsy.com/listing/183486650/gosset-original-t-shirt?ref=related-3 [16:12:49] * halfak adds to cart [16:12:51] (if you don't get the reference there, bad statistician! Bad! :D [16:13:05] actually it's a fairly obfuscated reference [16:13:27] My best friend here is a professional scientist and even she had no idea. [16:14:36] I didn't either. Required googling. [16:14:46] I figured his real name wasn't Student [16:14:49] heh [16:14:56] Then again, I know a Dr. Doctor. [16:15:17] the best thing, actually, is https://www.etsy.com/listing/71739287/collection-of-10-distribution-plushies?ref=shop_home_active_19 [16:15:32] I feel like we need those in R&D [16:16:32] I have a couple of those :) [16:16:46] I think I have the inverse chi-squared. [16:17:03] Uniform is boring [16:17:39] what's your favourite distribution? [16:17:57] actually, this should genuinely be a screener question for researchers [16:18:01] what's your favourite distribution and why? [16:19:48] beta [16:20:20] Because it let's me model proportions effectively at low observations. It was the first distribution outside of normal that I really grok'd. [16:20:37] I guess the t distribution might really be the first one outside of normal, but t is less interesting to me. [16:20:45] Also, I wrote a love letter to beta. [16:20:54] I read it! [16:21:00] http://www-users.cs.umn.edu/~halfak/etc/I_heart_Beta/ [16:21:06] For others. [16:21:12] Mine would be log-normal [16:21:22] Yeah, that's a good one too :) [16:21:29] What's your reasoning? [16:21:48] all the fun stuff I study follows it. Sessions, general performance information and cache hits, [16:22:13] it's really convenient to test for, because you just apply normal tests to log(value) [16:22:32] and it appears everywhere interesting! [16:23:00] see also Gibrat's law, which you introduced me to [16:23:22] (indirectly. You were talking about animal growth.) [16:26:20] and that is why log-normal is my favourite. [16:26:31] ragesoss, what's your favourite probability distribution? [16:27:17] NOOO The wikimedia shop is closed? [16:27:29] How do I buy wiki swag for my folks? [17:06:17] whee,, I think I worked out how to read in a gzipped sampled log file entirely through C. [17:06:24] mo C++, less system() calls, mo money. [18:07:53] Ironholds: my favorite is probably the Poisson distribution. [18:09:40] ragesoss, whyso? [18:10:08] Years ago, I made a shirt on Zazzle or whatever and got it for my mathematician brother. The front was a bell curve, and it said "normal". The back had a squiggly line along the bottom of the curve, and it said "paranormal". [18:11:18] but the Poisson distribution, I like just because it's a nice way to think about lots of stochastic systems. [18:11:36] :) Random events in time. [18:11:41] Exponential time between events. [18:11:41] +1 [18:11:44] Poisson counts [18:12:04] I dunno [18:12:04] always smelt a bit fishy to me [18:13:30] you're thinking of the piscesson distribution. [18:14:10] no, I was making a French joke, not a greek joke :D [18:14:15] but same pun! /me high-fives [18:16:08] * ragesoss high-fives back, as late to that as to intended pun. [18:54:52] halfak, http://isocpp.org/blog/2014/12/c-has-become-more-pythonic [18:56:53] Oooh [20:17:24] DarTar, I've worked out how to efficiently calculate the average in-session time for [sessions > 1 page] without sacrificing efficiency. [20:18:02] so what I'd suggest is we use that as padding and note it in the file so that we can subtract if necessary. [21:01:25] argh. this is going to be such a pain. [21:01:31] I have to completely rebuild how we calculate sessions. [21:04:23] Bummer. What happen? [21:05:13] we don't want to attempt to estimate for one-event sessions/one-event users, which is fine. [21:05:45] But we do want to be able to provide a useful estimator of the average intertime within a session, so we can append it to the end of sessions for calculating session length [21:06:09] i.e., for the dataset overall [21:06:17] and that requires...pretty much a complete rebuild, because prior to this I'd very deliberately only looked at things in terms of actions by users [21:06:21] not sessions in datasets. [21:07:03] My proposal: take a pass of the data noting the number of actions in a session and the timestamp of the first and last events. [21:07:28] yeah, that's basically what I'm doing [21:07:33] but it involves dividing things into sessions to do it. [21:07:36] Take a second pass over the dataset looking for sessions with more than one action -- increment actions and time between start and end events. [21:08:02] Learn global mean (which I expect to be the mean of the left-most distribution. [21:08:07] Really, we could just use that. [21:08:11] I have the params in the paper. [21:08:11] And running into edge cases of "okay, if intertime[9] is > threshold we should split, but if intertime[10] is > threshold we should also append a new vector containing of the value we've picked for a null here" [21:08:19] *consisting [21:08:36] because the utopian output contains "and this is how many sessions we're missing" [21:08:45] "missing"? [21:09:12] one-event sessions we can't usefully estimate the counts around. [21:09:54] Hmm.. I guess I'm missing the problem with that bit. Want to get on a call quick and brain together? [21:09:58] we tried approximating session length for people who only visited 1 page, from people who visited multiple pages, and the resulting number gives me the whillies [21:10:09] I would but my brain is currently toast :(. [21:10:11] Oh. meh. It's an approximation. [21:10:25] I'm not sure I'd be any use, or capable of ingesting much of your usefulness [21:10:48] And a good one. It assumes that people who view just one page in a session are doing similar things on that page as people who visit multiple pages. [21:10:59] * halfak feels conflicted about being ingested. [21:11:02] :P [21:11:04] haha [21:11:48] I'm actually going to call it done, after the research group meeting. Braining is not my strong point today. [21:32:33] halfak, Ironholds, few minutes late. connection issues [21:32:43] kk [21:32:51] plotting against you guys [22:01:48] halfak, can I check I'm not a crazy person? [22:02:02] so we want to calculate the average intertime within a session, to use as padding. [22:02:20] What would happen if I just calculated intertimes and then avg(intertimes[intertimes <= session_break_threshold] [22:02:21] ) [22:04:10] Sounds good to me. [22:04:21] In fact, I think that might be how I did it. :S [22:07:23] * Ironholds nods [22:07:29] that...is a lot easier. Okay. Cool. [22:07:52] then I don't have to write much new code at all. Wheee.