[00:01:20] halfak: sweet [00:01:35] I pass this dataset along to a bunch of people at the workshop [00:01:52] http://datasets.wikimedia.org/public-datasets/enwiki/etc/pmids.pmcs.articles.20141008.tsv [00:01:56] including the person who runs Europe PMC (the EU-based mirror of PubMed Central) [00:01:59] Sweet [00:02:10] they are interested in adding backlinks to Wikipedia [00:02:19] That's a brilliant idea. [00:02:23] which made my day [00:02:50] I can cc you on the thread if you’re interested [00:08:09] heh. I just thought [00:08:18] one of my annual review goals was "learn Python and get better at programming" [00:08:25] do y'think they'll accept C++ as a valid substitute? :P [00:31:03] argh. Stupid C++. [00:31:20] find works based on indices. Does substring work based on indices? ONLY SORT OF. [02:05:44] why does Carson Sievert sound familiar as a name [15:02:29] hey YuviPanda. The time has come. I'm working on a celery tutorial. [15:02:57] halfak: woah. [15:03:07] you’re making a celery tutorial or reading one? [15:03:45] reading one. [15:03:59] I'm going to have to think in distributed queues. Celery seems like a good place to start. [16:19:00] halfak: ah, interesting. [16:19:03] halfak: what for, btw? [16:19:28] https://meta.wikimedia.org/wiki/Research:WikiCredit [16:19:30] :) [16:19:49] halfak: aaaah, nicee :) [16:20:07] halfak: perhaps it was me being ‘spoilt’ by java, but celery is surprisingly nice :) [16:20:26] although it could be better from an ops perspective (monitoring broken) [16:20:32] but otherwise quite nice [16:20:50] halfak: I’m puppetizing Magnus’ WDQ now, we should do that for this too once you’re somewhere close to deployment [16:20:54] assuming this even runs on labs [16:22:12] It will :) [16:25:21] yay :) [16:25:28] halfak: although right now we’re constrainted in labs on CPU power [16:25:40] Understood. That's a problem I'm working on. [16:26:04] halfak: but, we’ve 3 new machines adding about 1.2TB of *RAM* plus another 72 cores, so should be fine [16:26:24] halfak: no, I meant, we have too many VMs for the underlying hardware to handle [16:26:54] the machines should be online in a few days / next week [16:27:32] Gotcha. [16:29:43] bleeh [17:15:58] Eugh. Sleep is not my friend. [18:26:27] sweet! [18:26:35] halfak, got the C++ to output a list of lists of vectors [18:26:39] IOW we can do per-user parsing [18:27:13] Does that mean that you have user --> sessions --> revisions? [18:27:47] user sets --> user --> sessions [18:27:55] user sets? [18:28:13] I throw in 300 distinct users' events, as a list of vectors, each one consisting of userN's events [18:28:21] at the moment, I get back a list of vectors, each containing a session [18:28:39] Why not just have the function handle a single user's events and let me look? [18:28:39] now we get back a list of lists, each representing one user, containing vectors representing each session [18:28:41] *loop [18:28:50] because looping in R is slow as shit [18:28:58] although that wouldn't be much of a pain to set up. Yeah, I could do that. [18:29:23] Wait, am I responsible for grouping a user's events before I hand off the data? [18:30:09] yes? [18:30:19] split(data$timestamp,data$userid), in R. [18:33:22] I'll see how fast it is with lapply() [18:33:26] that'd be a good way of doing this [18:35:35] I'm curious about that lapply. [18:36:02] But I can see how the other way can make sense too. Does split() end up duplicating the data in memory? [19:00:21] halfak, I mean, sorta? Depends if you overwrite the initial object [19:00:24] and lapply is crazy-fast [19:00:29] so we're probably good. [19:25:41] Ironholds, is the general filter description at R:Page_view up to date? [19:26:28] I mean, sort of? It's deliberately veeery vague, because toby wanted an easy-read summary that didn't list every regex [19:28:15] I think that's fine. :) [19:28:19] I just wanted to check. [19:28:26] * halfak hacks up email to analytics. [19:31:17] https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters [19:31:36] Ironholds, ^ looks like it filters internal requests. I thought we weren't going to do that. [19:33:34] Ironholds: in case you’re wondering, check-in with mobile is canceled since Maryana is away [19:33:45] I told Dan to reach out if they have any question [19:34:13] DarTar, oh. [19:34:18] I've been sat here for the last 10 minutes [19:34:26] if a crucial attendee is not coming, could we just /cancel the meeting/? [19:34:30] halfak, I'm trying to join the call and it's telling me I can't because I'm not on G+. [19:34:34] because the email notification does not tell you who's there or not [19:34:50] Ironholds: sure, sorry about that [19:35:04] I was actually not sure the meeting was canceled until a minute ago [19:36:35] thanks [23:13:05] LOL, feels familiar? http://meta.serverfault.com/questions/6701/server-fault-needs-professional-quality-questions-not-just-questions-from-profe?cb=1 [23:37:30] huh, that's interesting [23:55:20] Ironholds, what is the state of this work? https://meta.wikimedia.org/wiki/Research:Session_analysis [23:55:45] reason asking is that if there are newer findings, we may want to use it in session analysis Bob is doing [23:56:49] http://arxiv.org/abs/1411.2878 is the most recent thing [23:57:03] the current status is that we need to get pageviews done so Aaron and I have cycles to work on the standardised documentation [23:57:15] I am distinctly, in my spare time, building a standardised toolkit you are of course welcome to use :D [23:58:44] hokay! :D so, the tl;dr is that we should use 1 hour according to your research, for readers? [23:59:03] it depends! [23:59:18] for the mobile web and the desktop, yes, we found about an hour