[00:29:53] Ironholds, halfak, I have received a calendar invite for WMF research group meeting for Thursdays, exact same time as our other research meeting. Do you know what's going on? [00:30:04] Nope. [00:30:15] It's bob! [00:30:24] :D [00:34:36] halfak: you're the second person I know to say "It's bob". what gives/ [00:34:36] ? [00:34:56] we know someone called bob? [00:35:05] also, C++, /why do you have so many unspoken rules/. [00:35:09] Oh. Leila asked about a duplicate calendar event set up by Bob West. [00:35:17] I am supremely glad that I compile with "treat warnings as errors" set to true. [00:35:34] in related news, halfak, getting the framework together tonight :) [00:36:08] "the framework"? [00:36:17] I just checked with him halfak. apparently things got confuzzled while copying the event to local calender. he's gonna remove the extra event. [00:36:28] halfak, today I set up the class system and all the objects for the C++ utility [00:36:33] and make sure they don't immediately explode [00:36:35] leila, no worries. I figured. [00:36:36] tomorrow I write the logic in. [00:36:43] wooo! [00:36:51] and what logic it is [00:36:54] * Ironholds grumbles at WSC [00:37:01] If we have that running by the weekend, we'll have no problem getting the difference analysis done by Friday :) [00:37:20] well, why do we want it done by Friday? The query won't go live until at least then. [00:38:00] Because that's the only substantial concern I have seen raised. [00:38:19] that we won't have a tool for interpreting the results? [00:38:57] No -- just that we don't have empirical observations of the differences between WSC and the new def. [00:40:06] ahhh [00:40:30] Okay. I mean, I don't think that's must-be-done-by-the-12th urgent; the two filters will be running in parallel [00:40:37] but it's definitely something we should do, yep [00:44:21] I don't see this as blocking us. [00:44:34] I've been careful not to suggest that it might. [00:45:01] Say... speaking of potential blockers. We (you) are basically done with the definition, right? [00:45:08] pretty much [00:45:11] I mean: I know it's broken. [00:45:16] I even have some suspicions as to where. [00:45:27] But I want to do this whole "empiricism" thing I've been hearing about [00:45:38] evidently some dude called Bacon came up with it? I dunno. Bit new-fangled for me. [00:45:49] Bacon!? Must be good. [00:49:38] haha [00:49:50] halfak, did I ever tell you that the web engagement lead at the Royal Society is genuinely called Francis Bacon? [00:49:52] It's sort of incredible. [00:51:06] What does the Royal Society do? [01:11:22] halfak, they are the pre-eminent scientific institution of the United Kingdom! [01:11:33] https://en.wikipedia.org/wiki/Royal_Society I wrote their article [01:11:43] What does a pre-eminent scientific institution do? [01:11:54] Are they a big government research lab? [01:32:53] halfak, sorta...a lot. [01:33:01] so, lots of grants for scientists, they run several prominent journals [01:33:08] their members act as formal scientific advisors to the government [01:33:27] and they have a 400 year archive of everything from Newton's personal possessions to Hooke's secret journals. [01:33:29] * Ironholds slavers [01:34:18] also they invented peer review [01:39:18] * Ironholds blinks [01:39:33] rdaiccherlb, where did the ideas for the all our ideas survey come from? [02:08:25] Hi Ironholds, the ideas came from the pre-existing gadget and tools lists. We entered in some from lists that users have made that were requests for enhancements (I am fighting a cold & about to get offline bt could pull them later) [02:08:37] rdaiccherlb, okay, cool. [02:08:44] So basically, all pre0existing gadgets [02:08:47] * rdaiccherlb sneezes [02:08:48] and, next question: do the responsible departments know these are on the list? [02:09:07] because right now it reads as "if people want us to build out stats.grok.se that'll be on whoever's plate, probably analytics". [02:09:13] and that worries me. [02:09:41] Ah - I will take a look on that - the reference to stats,grok.se was an example. I will look [02:09:52] thanks [02:09:58] Apologies! (I know analytics has a huge workload!) [02:10:03] it's okay! [02:31:55] rdaiccherlb, if you need help with pre-population AOI survey, send me a message. [12:29:30] halfak: I heard something today that if it happens will probably make you very, very happy. [13:44:35] Oh YuviPanda? What's up? [13:46:07] halfak: So... We could replicate our mysql dB's to postgres for labs [13:47:24] Interesting. Is it actually better to have two systems? E.g. is postgres easier to manage for ops? [13:48:22] halfak: well we already have postgres and we are considering moving to a replication system called tungsten which also does mysql to postgres replication [13:48:32] So if we do move to tungsten then very little hassle [13:48:37] Our mysql stuff won't go away [13:50:54] Gotcha. That would be cool! It's been a while since I did my wiki work in postgres. [13:51:26] back in the day, I to process the DB dumps and load that into a postgres DB. [13:52:31] *used to [13:52:42] * halfak still hasn't had his coffee yet. [14:11:31] I hate C++. I hate C++. I hate C++. [14:11:41] Netbeans cannot work out where the boost libraries live. [14:14:24] Just remember how painful this was once you figure things out and make sure you post some stuff on the internet to help out the next programmer. [14:15:35] Speaking of posting things online, Ironholds, your site is awesome. :) [14:17:14] it is? [14:17:21] and I will. and my post will be titled: [14:17:25] R IS SMART, BOOST IS STUPID. [14:17:38] A lot of my C++ problems come from writing in R. [14:18:06] Like, of course there's NSE. Of course it'll construct a makefile for me. Of course all I need to do is specify PKG_LIBS to include a library. [14:19:42] and then you spend 40 minutes trying to find the libboost executable and cry. [14:20:50] Hey Ironholds, I just read http://writing.jan.io/2014/11/21/hateful-people.html as linked from one of your blogs. [14:21:00] And I'm a bit frustrated that the example used is a bad one. [14:21:24] howso? [14:23:15] * halfak moves to PMs so as to not be quoted out of context [14:23:38] /usr/share/? AW MOTHER- [14:23:51] all technology is built on barely-held-together crap. [16:24:34] halfak, where do you document your work on edit quality? [16:25:21] leila, which work on edit quality are you talking about? [16:32:58] leila, ping [16:37:05] sorry halfak. you are working on defining a metric for quality of contribution, right? [16:37:54] More working with one that I first published about in 2009 ;). See https://meta.wikimedia.org/wiki/Research:Content_persistence for some documentation on how it works. [16:38:04] thanks! [16:38:43] See http://www.opensym.org/os2014-files/proceedings/p609.pdf for an evaluation of how many revisions a token should persist before being considered "good". [18:09:48] halfak, sessioniser done :D [18:10:01] Noo Stop calling it that :P [18:10:17] std::list < std::vector < int > > sessionise(std::vector < int > timestamps, int local_minimum = 3600) [18:10:21] function definitions don't lie. [18:10:29] Unless they're defined outside of the class. Then they sometimes lie. [18:11:42] hmn. wait [18:11:47] halfak, I'm an idiot and you're a genius. [18:12:12] eehee [18:12:29] \o/ [18:12:51] if I have a dedicated, vectorised sessioniser, calculating page length and everything else becomes SO MUCH MORE TRIVIAL [18:13:07] wanna work out how many pages in each session? lapply(sessioniser(timestamps),length). [18:13:51] :) [18:14:28] discounting single-page sessions? if(length(x) == 1){return(-1)}. [18:14:39] ..I wanna just rewrite this entire thing now. So I probably will. [18:15:10] Refactor Yak [18:16:13] hmn? [18:18:58] It's a specific subtype of Yak Shaving [18:21:05] ahh [18:21:09] next question, halfak [18:21:14] let's say you had a list of vectors of timestamps. [18:21:23] each vector being timestamps for [user] [18:21:55] would you find it more inconvenient to get back a list of vectors of timestamps, split by session, or a list of (lists of vectors of timestamps, split by session), split by user? [18:22:09] i.e., when you fire a set of timestamps into this, do you care who the sessions belong to? [18:22:26] at the end? [18:22:30] meeting :\ back soon [18:23:47] kk! [19:21:31] Ironholds, forgot to tell you I was back. [19:21:36] * halfak reads [19:56:46] halfak, no thoughts? [19:57:13] Oh yeah.. Forgot to reply. So... I don't think that grouping by user is that valuable to me. [19:57:25] I can do that myself (using your utility) if I want to. [19:58:09] As for list vs. vector, I think that list makes the most sense since you'll be growing it as you compute. [19:58:19] awesome! [19:58:25] that's what I built. [19:58:29] I'd like to talk to you about "generators" to see if there's something equivalent in C++ [19:58:39] you throw in a list of users' timestamps, it outputs a list of sessions, not caring about who they were associated with. [19:58:53] I mean: it cares when sessionising, but not when throwing the sessions into the output hopper. [19:59:12] and if you really, really care, you can lapply(list_of_users,single_sessioniser) and get it that way. [19:59:31] C++ may have em, I just probably don't know they exist :D [19:59:49] Ahh. So I was going to suggest that building and object in memory might not be that useful. [20:00:00] Or rather, building the collection in memory. [20:00:26] e.g. the cluster() function I have in python doesn't build a list. [20:00:30] But you can if you want to. [20:00:51] http://pythonhosted.org/mediawiki-utilities/lib/sessions.html#mw-lib-sessions [20:01:08] It actually just spits out a session as you ask for them. [20:01:30] If you decide you don't care about this particular session, you just ignore it and no memory is used. [20:02:17] I can see how this would be a pain anywhere but python. [20:02:31] Python seems to have invented the generator pattern that I'm discussing. [20:02:38] oh, it'd be hella-convenient in C++. Piping is alive and well. [20:02:47] It'd just not translate to R too well (streaming is not a thing :() [20:02:55] Oh yeah. I forgot about that. [20:03:00] I watched a great Hadley Wickham interview last night talking about R that covered this. [20:03:16] He's a fascinating dude. With a weird accent. [20:03:31] (Looks like Icon might have invented generators) [20:39:40] Ref-yak-toring :) [20:40:29] When you start cleaning up one of your libraries rather than using it to do the thing you were trying to do. [20:40:51] halfak: productive procrastination is a go? [20:41:09] :) o/ fhocutt_ [20:41:21] hey halfak [20:41:21] I heard that you are doing some work on HostBot [20:41:39] I am! Working on making the Co-op space go [20:42:34] J-mo tells me that there's a bunch of groups that want to use HostBot to contact newcomers. We had a brief chat about a 2.0 rewrite. [20:42:54] neat! [20:42:58] I've got a CS grad student at northwestern with some time to devote to it too. [20:43:10] Would you be interested in doing some planning after the holidays? [20:43:22] quite possibly! I'm not sure what I'll be working on at that point. [20:43:34] Gotcha. Is this contractor work right now? [20:43:41] it is. [20:44:28] cool. I guess we'll see :) [20:44:54] yeah, sounds good :) [20:50:34] halfak, do email me when planning starts though! [20:50:48] Totally will do. :) [20:56:10] Ironholds, https://twitter.com/halfak/status/540972862145069056 [20:56:12] Approve? [20:56:19] Ironholds, I may be missing an earlier email thread about this: can you remind me how we handle https requests in the logs? [20:56:23] halfak, nice :D [20:56:34] leila, define "how we handle"? [20:56:55] will we have logs for https requests in webrequest? [20:56:58] they're liable to have their referers stripped, either entirely ("-") or mostly ("http://google.co.uk") depending on whether it's SSL to SSL or non-SSL to SSL. [20:57:08] so yes, we'll have them, but the referer chaining becomes much more of a pain [20:57:14] and sometimes not possible [20:57:55] so, is there a clean way to say if a request was https? like for example, through http_status? [21:00:07] oh, no. [21:00:09] that would be too fun [21:00:17] for some (some) mobile requests, there is a https=1 flag in x_analytics. [21:00:42] for all other requests, your only options are to look at the IP address it came from. If it is a request from an SSL terminator, with an x_forwarded_for field filled in, then it is HTTPS [21:00:54] the problem: we don't /actually have a list/ of what the SSL terminators' IP addresses are. [21:01:00] Welcome to Pageviews. This is why I'm going gray. [21:03:18] Who could we ask about these SSL terminator IPs? [21:03:42] someone in Ops? Last time I asked them they gave me an ever-shifting puppet manifest, and a headache. [21:04:44] * halfak tries again anyway. Advil on hand. [21:14:48] OK. Leila, I have a methodology for obtaining IPs [21:15:51] Presumably we only really care about requests that come in view "text" right? [21:16:06] and possibly mobile? [21:16:16] also, halfak, mind writing the methodology up somewhere for future use? [21:16:31] (not right now, obviously ;p) [21:16:45] GOod call. I'll drop some notes together immediately. [21:17:41] http://etherpad.wikimedia.org/p/ssl_terminators [21:18:03] Now to figure out how to parse puppet [21:18:21] Ha! https://github.com/pradels/puppet-parser [21:33:17] Leila, I forgot to say that I have code for you that will generate content persistence measures using the Wikipedia API. [21:33:31] It's slow (because it needs to work against the API), but it can do about 50k per day. [21:33:44] So you can get a large sample in a somewhat reasonable amount of time. [21:34:06] I'm hoping to have persistence data for whole wikis soon, but that's a couple weeks out at least. [21:37:52] https://en.wiktionary.org/wiki/forget,_when_up_to_one%27s_neck_in_alligators,_that_the_mission_is_to_drain_the_swamp [22:06:25] Ironholds, sorry. I went to a meeting. reading your comments. [22:08:29] halfak, reading your response (gradually going through IRC. ;-) ) [22:09:09] I just jumped into a meeting too. Should be done in ~50 min [22:34:08] leila, no problem:) [22:37:39] Ironholds, you stole the task? [22:37:39] :D [22:37:55] you're great! thanks. [22:37:56] I'm gonna write me a parser. [22:38:04] It's gonna be in C++ and it's gonna be /great/. [22:38:11] well, mostly in C++. The tokenizer will be. [22:38:13] aaaaawesome. [22:42:52] J-Mo, you fail! [22:42:59] you gave AnnaK redshirts. Beat you to it :P [22:43:09] I'm not sure when recommending Scalzi to people became a competition. [22:43:11] yeah, saw that. great minds and all that, I guess [22:43:34] well, he is kind of hipster sci-fi. "I read him back when he was real obscure" [22:49:21] I didn't! [22:49:26] I got into him through his blog being tweeted about [22:56:51] Ironholds: have your read Iain M. Banks? [22:57:10] I only just got into him a few months ago [22:57:23] I read The Hydrogen Sonata, which was GREAT, and since then have been indulging my comic nerd traits. [22:57:39] oh, you're lucky. You have so much to look forward to. [22:58:31] I can recommend some comics to even the load, if that'd help? :p [22:59:56] by all means [23:01:47] Ironholds, do you know what cache_status = hit or miss mean? (you probably do) [23:02:07] yes! [23:02:21] that's just whether the cache had a decent copy available, or if it had to go retrieve one from the PHP backend [23:02:30] not worth paying attention to for pageviews purposes, afaik [23:02:38] got it. thanks! [23:04:40] urgh. This C++ file is getting out of hand [23:04:45] I guess I could split it up. I /guess/. [23:08:10] halfak, you know what I think we need? [23:08:16] I think we need a CRAN library called sessioniser :D [23:08:26] * halfak growls and stomps feet [23:08:42] Just got done with meeting. What'd I miss? [23:09:02] Leila, still interested in edit quality measures? [23:12:25] If you give me a file of rev_ids, I can generate persistence measures for you. Or I can give you code. [23:13:11] hurm [23:13:19] halfak, just thought of a situation where we do care about who owns sessions [23:13:21] sessions-per-user. [23:14:01] I'll keep track of my own index. [23:14:08] I'd use a hash map [23:14:32] And I'd increment some ints [23:14:43] fair! [23:15:47] halfak, also, that XML dump parser? [23:15:49] SUPER USEFUL. [23:15:55] I may have to learn python just to use this. [23:16:32] :) That's the hook. [23:16:35] halfak, thanks! I'm reading through your methodology. At this point, the knowledge of what you have done is all I need. :-) I'll ping you about the code further down the line. thanks for offering to help. [23:16:48] np leila [23:17:14] Ironholds, that xml parser has pulled people to python -- and even more importantly, it's pulled many of them from python 2 to python 3. [23:17:22] Guido would be proud :) [23:25:22] I challenge any of you, by the way, to listen to Hadfield's "Space Oddity" cover and not feel like the universe is an awesome place. [23:30:14] J-Mo, just emailed Scalzi [23:30:24] lol [23:30:29] gonna get an updated photo for wikipedia and hope to gods he doesn't send me the Infamous Buttercream Shot [23:37:42] * halfak wonders what he would do if he had a wiki article that needed an recent photo [23:37:54] I think a silly hat would be in order. [23:38:20] halfak, oh, you haven't seen the buttercream shot? [23:38:45] nope [23:38:51] http://farm9.staticflickr.com/8438/8004421093_82e4c4bd9d_c.jpg [23:39:02] TL;DR Neil Gaiman and a bunch of Derby Girls were bored. [23:39:18] result; John Scalzi, covered in buttercream. [23:39:36] lol what [23:39:38] That's awesome [23:40:29] ...WTF. [23:40:30] WTF R. [23:40:35] DID YOU JUST DO SOMETHING STUPID. [23:41:48] halfak, can you test something for me? [23:41:50] I appreciate the period at the end of that question [23:41:51] as.integer(20140107000001) [23:41:51] Sure [23:42:04] I would like to check I have not, in fact, had a stroke. [23:42:16] NA [23:42:36] Warning message: NAs introduced by coercion [23:42:54] what. the fuck. [23:43:07] Too big for a 32 bit int [23:43:21] ...aaaaugh [23:43:22] 2^33 = 4294967296 [23:43:27] woops [23:43:34] That should have read 2^32 [23:43:45] hmn. So, doubles it is. [23:43:48] I think we might only get 31 for signed ints though [23:43:50] can I throw doubles into C++? [23:43:52] Does R not have longs? [23:43:53] only one way to find out.. [23:43:58] oh. oh oh oh. [23:43:59] :D [23:44:06] so, you know how there's a maximum limit to vector length? [23:44:16] you ever noticed how it's 2^31? Weeeeell.... :D [23:44:22] Negative vector lengths? [23:44:29] Why would you need that? [23:44:59] you wouldn't! In fact, that's the error you get (or used to get) when you tried to create a vector bigger than that. [23:45:08] R has longs. As of, ooh, the most recent release version. They only exist in some functions. [23:45:21] NO NEGATIVE VECTOR LENGTHS, STUPID USER [23:45:32] and so the vector length problem is still there, even though it now uses longs internally, because if you try to pass a long-indiced vector into half of baseR's functions, it'll break. [23:47:12] and now we find out if Rcpp chokes on doubles [23:47:15] * Ironholds purses fingers. [23:49:44] wait, this is an artificial problem [23:49:56] I need the numeric values of the timestamps, not the numeric representation. We're good! [23:53:37] :) [23:53:48] OK. I'm off for the evening. Have a good one folks. [23:53:52] o/ [23:55:16] take care! [23:55:19] oh, halfak, before you go: [23:55:27] did you know underscores are used for assignment in S-Plus? :D [23:55:42] and that's why R sometimes uses foo(param.name) and sometimes foo(param_name). Dating.