[09:09:32] o/ Ironholds [10:25:00] hey halfak :) [10:25:28] I think you should know, my mom thinks you are cool. [10:25:43] your mom? [10:25:52] well, your mom is lovely, so I will accept that very nice compliment. [10:26:12] thank you, wholeaker! [10:26:13] I have been talking to her about working with my collaborators in the UK and she asked, "Your collaborator isn’t Oliver, is it? If you see him be sure to greet him will for me! He’s darling! I could listen to him all day long!" [10:26:20] This is like a reverse your-mom joke. [10:26:21] (I mean, it's half material each, so for you to be a halfaker..) [10:26:30] bahaha. Wow. [10:26:39] I had no idea I'd made such a good impression. Or the accent maybe? [10:26:40] She was impressed by your wikipedianness at my wedding. :) [10:26:43] ahhh [10:26:47] okay, that makes sense. [10:26:54] And yeah, we suck at being mean [10:27:15] "I'm going to compliment your mother!" "Oh yeah? Well I'm going to pass compliments on to you FROM my mother!" "FINE" [10:27:34] So yeah... my mom says hi... [10:27:36] lol [10:27:43] you just made me quite literally lol. [10:34:50] halfak, got a quick moment for a sanitisation think-through? [10:36:04] bah, okay, got to head out for 5. Back in..well, 5. [10:40:24] k [10:40:34] Will be around for 20-40 [10:45:50] back! [10:45:59] OK. What's up? [10:46:07] so: I have a dataset of page - country - user - count [10:46:17] I want to turn it into page - country - percentage(page/country) [10:46:23] in a way that doesn't compromise privacy. [10:46:42] So, the sort of mental rules I've set thus far: any page with <2 users cannot be sanitised and should be removed. [10:46:56] Any page/country combination with <2 users, roll it into the next greatest one and mark "Other" [10:47:25] in /theory/ that should solve for it. But you know the theory much better than I do :D [10:49:26] If I know how many views happened in a certain timespan, then I can use the rate captured in percentage to get the raw number. [10:49:48] Worse: it's edits. [10:50:16] and you could also, e.g., say that you know [editor1] edited these N pages, which [editor2] has in common with them [10:50:46] and look for the countries made available, versus marked in "Other", and use that to identify [editor]1 or [editor2]'s country [10:50:52] these are attacks I'm not sure how to protect against [10:51:06] I mean, it's for Heather and Brent, so we're safe in practise. But I'd like to be safe in theory too. [10:51:18] (and I think we probably want to reopen the how-do-we-get-people-NDAs conversation at some point) [10:52:23] the alternative approach would be; they want to boil it down to how much edits associated with a page in a country come from that country. [10:52:40] So I could always exclude pages with only 1 country represented, and then divide into [country of page]/other [10:53:38] which should protect anonymity at the cost of me having to calculate which bloody country each lat/lon pair is in :/ [10:53:54] I'm confused about your [editor1], [editor2] example above [10:54:22] oh. Yeah, it may not make sense. [10:54:37] So, editor1 edits pageA. It has country codes US, GB, Other [10:54:55] editor1 edits pageB. The only difference there is that editor2, from the same country, also edits that page. [10:55:13] In doing so they bump the number of people from their country up above the threshold that triggers merging into "Other" [10:55:19] US, GB, IN, Other [10:55:25] and now I know where editor1 lives. [10:55:45] this may be entirely my paranoia, mind ;p [10:55:45] Why? [10:55:56] ..oooh, you're right. [10:55:59] I don't see how you learned where editor1 lives. [10:56:09] no, you're right, I'm being silly. [10:56:13] Do we know where editor2 lives before the data? [10:56:16] editor2 could've merged with anyone on the editing population. [10:56:27] we don't. [10:56:45] Although presumably if the only change that adds NEW people is the inclusion of editor2 we now know where they live. [10:56:53] It's whatever country code falls above the threshold that didn't previously. [10:57:01] * halfak imagines triangulation attacks for editor location. [10:57:03] But you'd need two pages with otherwise-identical editing populations. [10:57:21] or, where one is, but for editor2, a subset of the other editing population. [10:58:16] hmmn [10:58:19] * Ironholds headscratches [10:59:30] okay, how does this sound; I'll sanitise by dividing into "in-country" and "other", with percentages, for every article with >2 editor, and provide them that. And then we can start a conversation about NDAs, because in lieu of Reid's solution and a lot of APIwerk from AnalyticsEng, we need an interim solution, and it may as well be paperwork-heavy. [10:59:39] *>2 editos [10:59:42] ..bah. you know what I mean. [11:00:14] +1 for exploring NDAs in the meantime. [11:00:33] I think we should be looking towards a more substantial commitment for those we support for NDAs [11:00:40] what do you mean? [11:00:42] e.g. not just their own analysis [11:00:49] Like in the case of Reid. [11:00:57] He is helping us solve a bigger problem. [11:01:02] Like a free contractor. [11:01:13] gotcha [11:01:16] * Ironholds nods [11:01:45] so, "you get our datasets if you can convince us that the project you're working on is, by happenstance, useful to the both of us, or if you're willing to apply your knowledge and expertise to a problem we both agree is interesting" [11:01:53] that sort of thing? [11:02:00] Yes. [11:02:10] * Ironholds nods. Makes a lot of sense. [11:02:28] This would make it much easier to justify the risk and it bears the scrutiny of non-researchy/academicy review. [11:02:46] We'll need a process for such requests that is like the IEG process. [11:02:57] e.g. Idealabs research projects. [11:03:07] yeah. I mean, it seems a bit...weird. In the sense that I'm used to thinking of us as being very free and liberal with where we spend our time. [11:03:21] "I want to understand consumption patterns across the globe and how they were affected by the release of WP Zero" OK. Cool. We need you. [11:03:24] But I've spent all week dealing with 3 different research requests, one of whom is from a guy I kind of want to slap now. [11:03:31] being free and liberal is REALLY TIRING. [11:03:33] "I want all the dataz for my masters thesis." NO [11:03:48] actually I got the masters thesis guy his data. Without violating privacy! [11:03:51] Agreed that being open is super tiring. [11:03:52] but it took so much energy it was ridiculous. [11:04:03] It's like being a PhD supervisor for a day. [11:04:06] Part of the reason we are trying this whole openness thing is to get *more* work done. [11:04:12] only with added context switching funtimes. [11:04:46] yeah, good point. And it can't be at the cost of us becoming, I guess, nothing more than switches. [11:05:05] If we get 40 people doing work for us for free and I have to spend all my day juggling between them we might be getting more work done but I have succeeded in making myself miserable. [11:05:15] I don't LIKE people. Why do you think I write code? :P [11:05:30] (okay, I like some people. But not lots of new people at once. Anyway.) [11:05:41] Agreed. This is good pushback. We need to be careful about being too open. [11:05:52] yeah. I think WM was a sort of perfect storm for it [11:06:01] So, can we get other external researchers to help us deal with these requests? [11:06:03] lots of awesome awesome researchers overbalancing my workload just enough for me to get annoyed about it ;p [11:06:20] ooh, that's a good question. Like, people who are already NDAd and know their way around? [11:06:24] or..? [11:07:07] E.g. maybe we can have a tier'd process where researchers submit proposals, we have a collection of people helping us review them (Wikipedians, external researchers, NDA'd researchers and us) [11:07:33] When someone looks like it is meeting all the req's then we engage in a formal review. [11:07:43] Lots of process, but it would buffer us a bit. [11:08:06] and also it sounds like it'd add a useful layer [11:08:21] In the meantime, we could start by asking current requesters to post their proposals in public and reserve the right to desk reject. [11:08:27] we can't read every paper and don't know what we should be caring about (or at least, don't know all of what we should be caring about). Having a hivemind that has and is checking for stupid is good. [11:08:33] yeah, makes sense. [11:08:53] do we have a page that is appropriate? Or are we just sort of throwing ideas around here and should start a thread with Leila and Dario and so on? [11:09:02] I think by making people type up a public doc, we can probably improve the quality of the proposal before it arrives on our desk. [11:09:18] +1 for starting a thread. I think we were supposed to do that on r-internal [11:09:25] * Ironholds nods [11:09:43] okay: I'll kick it off with a "so here is the problem" and you can come in with "and here is a solution" [11:09:50] you can explain your idea far better than I can. [11:09:52] Once we have a loose idea of what we like, I'd like to move to meta and pull in people who might get mad at us to comment early. [11:10:03] Sounds good. Thanks Ironholds [11:10:09] I'll be able to comment later today. [11:10:14] * halfak runs back to other work. [11:10:15] cool! [12:44:47] * YuviPanda|brb waves in the general direction of halfak and then lets him get back to work [12:46:02] o/ YuviPanda|brb [12:46:38] ohai J-Mo :) [12:47:12] I'm not really here, Yuvi. Just accidentally autojoined :) see you in a few hours [12:47:12] hey-mo! [12:47:28] hey YuviPanda :) [12:47:34] how's glessgae? [12:47:36] heh, hi Ironholds [12:47:39] aww [12:47:43] Ironholds: it's cold [12:47:48] I scared J-Mo away [12:47:50] yeah, it does that. [12:48:24] You've got to offer it food first. [12:48:50] heh [12:48:57] will remember for next time! [12:49:00] hah [12:49:03] I meant glasgow, but. [12:49:19] Ironholds: this house also doesn't have heating, so doubly cold [12:49:25] yeeek! [12:49:35] social housing [12:49:43] Hey Ironholds, what's with the conference planning event today [12:49:49] thewha? [12:49:51] 4:30 BST [12:49:54] you guys are planning a conference? [12:49:59] * Ironholds looks at calendar [12:50:03] -.o.- [12:50:08] I don't see it [12:50:22] You're on the invite list. [12:50:35] today? :/ [12:50:37] 8:30 PDT [12:50:44] 4:30 BST [12:50:56] ohhh [12:51:04] yeah, forgot to switch my calendar over. oops. [12:51:14] I dunno :/ [12:53:01] hmn. I wonder if lapply + data.table > ddply + data.frame [12:54:06] Ironholds, overlaps with another *recurring* meeting. [12:54:19] Does no one actually look at gcal when they schedule things >:( [13:05:04] apparently, considering the number of schedule conflicts I see in -research and -analytics :D [13:10:25] Sometimes I wonder if it's because I'm not important enough to notify with my schedule is getting f---'d. [13:10:28] * halfak pouts [13:15:50] * YuviPanda pats halfak [13:15:52] halfak, oh, don't worry. [13:16:01] halfak: if you want I can schedule a 3h meeting with you every day *and* notify you :) [13:16:05] the table I was at at WM saw James's calendar, and a reaction was "wow, you're worse than me!" [13:16:09] "I only get TRIPLE-booked!" [13:16:42] heh [13:18:08] aw man, I have to go outside [13:18:11] it's raining hard even for the UK [13:18:35] Sunny here [13:18:47] Saw some clouds come earlier and hurried back from lunch [13:18:52] But then it got sunny [13:21:23] sunny here too [13:24:37] screw the both of you [13:24:41] * Ironholds hopes it clears up by monday. [13:27:38] * halfak imagines a rain cloud above Ironholds. [13:28:23] MariaDB 10 *finally* brings online alter table to mysql [13:28:25] * YuviPanda stabs mysql [13:33:00] lool. staeiou would rejoice. [13:33:10] he did a lot of work to add columns to tables for WSOR'11 [13:33:13] Took forever [13:33:16] yeah [13:33:23] query table had only like, 300 rows [13:33:25] took 10m [13:33:33] I just killed mysql server and upgraded to mariadb :D [14:37:31] halfak: \o/ now it's way more obvious that you can title queries (and you can add description) [14:37:39] now I'm out of easy things to do [14:37:46] YuviPanda, TSV. TSV. [14:37:59] Ironholds: yeah, I did say 'easy' [14:38:14] TSV is hard? [14:38:34] halfak: well, reworking the results backend is slightly more involved :) [14:38:40] TSV would be part of that [14:38:47] why? Store it as JSON by default but convert to TSV on button-push for downloading [14:39:14] Ironholds: right, I can do that now, but what to do when JSON is too big? will kill server process for loadind... [14:39:16] *loading [14:39:27] Stream it? [14:39:40] You can't write out to a stream inside of flask? [14:39:50] halfak: you can't stream JSON *in* [14:39:56] reading is the problem [14:40:07] Oh! Store the files as one row per line. Read it a line at a time [14:40:16] this kills the NFS :P [14:40:27] this is standard practice for json log files, mongo, etc. [14:40:28] halfak: Ironholds the solution me and phuedx came up with was to write each resultset as a .sqlite file [14:40:44] halfak: oh, wait, I thought you meant one file per row :) [14:40:49] :P [14:41:01] halfak: with sqlite, we could also do pagination, etc [14:41:07] This should be *way* easier than everything else. [14:41:10] I'll give on pagination [14:41:15] (i really like the sqlite solution) [14:41:28] However, I already proposed the offset can be used when reading a file. [14:41:28] * phuedx is proud of YuviPanda's and his solution [14:41:33] So you can get pagination that way. [14:42:05] Not line offset, but byte offset for the line to start reading. [14:42:09] That's efficient [14:42:24] right, but at that point you're kinda re-implementing sqlite [14:42:25] ish [14:42:33] if I've to keep an index of ofssets for line [14:42:34] Nope. Files were meant to do this. [14:42:57] just return the offset for the start of the next "page" [14:43:03] hmm, byte offset [14:43:18] Anyway, I don't think pagination is a real use-case anyway. [14:43:30] If the result is greater than a certain size, download it. [14:43:30] true, true [14:44:09] BUT, SQL lite would let us query query-results... sort of. [14:44:21] indeed. and you can trivially offer sqlite downloads [14:44:27] Could be useful for sorting and stuff -- which I'm not convinced is a real use-case. [14:44:35] and tools can hit a URL and get .sqlite files, and do things to them [14:44:41] I'd never download SQL lite. [14:44:56] But I'm not exactly the target user. [14:45:04] true [14:45:08] I kind of doubt that the target user will want to use anything other than excell. [14:45:19] other fun aspect of this would be that we'll use sqlite, mysql (and potentially postgres) in the future [14:45:47] but yeah, sqlite does seem a *little* bit over the top [14:50:25] halfak: hmm, I don't see any particular disadvantages of the sqlite solution over JSON, and it'll probably only take me a few hours to do the sqlite one anyway. let me go do that now :) [14:53:52] Fair enough. Seems overkill, but if it is easy then, it's only regular kill. :) [15:02:18] halfak: :D [15:02:36] halfak: it's one of those ideas that are so clever I *must* use it, and probably pay for it later [15:02:43] uhoh [15:02:45] meh [15:07:58] phuedx: actually, after starting to write code for the sqlite stuff, I think halfak's solution is probably just as good and less complicated. I'm going to switch to that [15:08:30] hasn't looked at the scrollback [15:11:20] would downloading as .sql be a use case? [15:11:26] * phuedx remembers this being mentioned somewhere [15:11:38] byte offsets in a file for pagination is sound [15:11:58] phuedx: yeah, but we can generate sql anyway, same way as we generate TSV / full JSON [15:12:11] yeah, pick yer use case [15:12:36] YuviPanda: implement which you feel is the simplest [15:12:45] and i'll try and implement t'other [15:12:52] and then we'll fight to the death [15:14:01] cool :) [15:14:09] \o/ [15:14:15] I'll take the bets [15:14:26] I'm looking at msgpack instead of JSON too [15:32:23] augh meeting [15:45:13] halfak: phuedx after some back and forth, the sqlite solution actually ended up being easier to do :) done with it now [15:45:31] cool [15:45:44] YuviPanda: easier how? [15:46:05] phuedx: mostly because I didn't have to write a fileformat that lets me read number of rows and number of resultsets without having to scan ane ntire file [17:40:26] guys [17:40:28] guys [17:40:32] you guys [17:40:39] did you meet the most wonderful girl? [17:40:40] We have like a billion people in this room [17:40:47] because I hate to break it to you but she married you, like, 2 years ago. [17:40:51] -research is STRONG [17:40:54] and you should really pay better attention to this stuff. [17:40:57] or that too [17:40:58] wut [17:41:22] The trope of "guys, I met the most WONDERFUL [foo] *swoons*"? Or is that just the old-timey films I watch [17:41:36] * halfak must not watch enough oldtimey films [17:45:13] * halfak runs off to dinner [17:45:15] o [17:45:46] * Ironholds waves [17:47:48] * Ironholds beats head repeatedly into desk [17:51:15] * YuviPanda pats Ironholds [17:51:32] Ironholds: remember, you could've been CL for MediaViewer instead [17:52:57] no I couldn't. [17:53:14] On account of I told Howie that if my contract in late 2013 didn't have "Analyst" in the title I wasn't signing it. [17:53:27] there is no universe in which I am CL. Although there are a lot of universes in which I'm unemployed. [17:53:33] right [17:53:33] or that [18:01:57] Ironholds: I’m in da hangout, chat? [18:02:52] oh! yes :D [18:04:02] uh-oh [19:49:23] phuedx: btw, the sqlite results backend is done. I've an SQLiteOutputReader and SQLiteOutputWriter \o/ [19:49:27] now to just hook it up to things [19:49:34] cool! [19:49:47] i've been fiddling with rcstream for most of the day [19:49:57] YuviPanda: are there any small-ish things you want doing? [19:50:12] i'll have some time over the weekend [19:50:58] phuedx: https://bugzilla.wikimedia.org/show_bug.cgi?id=69544 is small :) [19:51:22] phuedx: and none of my work would affect https://bugzilla.wikimedia.org/show_bug.cgi?id=69037 either [19:51:40] phuedx: also other things - like a lot of text in the interface being wildly out of alignment because I suck at CSS [19:51:48] haha! [19:51:55] ok, i'll take a look at 'em [19:52:00] \o/ [19:52:08] phuedx: writing a README on how to set up a local env would also be great [19:52:22] YuviPanda: how'd you fare with vagrantization? [19:52:53] phuedx: it's a bit complex because of the ssh key forwarding and stuff, and the need for a virtual host, so I'm not very far. [19:53:04] of course! [19:53:18] phuedx: but I talked to bd808 and have solutions to most of my problems [19:53:53] phuedx: I'll also need to work at some point on adding a varnish box so results can be cached properly [19:54:04] that shouldn't be super hard tho [20:11:53] Ironholds, you mentioned stat2 in your email about upgrading stat3. If you think it's reasonable to request that, too, let's do it and be done with it for some time [20:15:36] leila, naw, do stat3 first! [20:15:40] (1) less work, (2) more people use it [20:15:51] (3) I wanna see if it breaks without interrupting all my screen sessions *looks shifty* [20:17:22] makes sense [20:24:32] soo, Ironholds, shall we chat briefly or something? [20:43:21] leila: I also didn't know there was a process, sorry if we all zerg'd otto [20:43:52] np. I mean, it started from a semi-casual conversation. [20:43:54] Ironholds: so I'm adding TSV now. I'll add CSV later, and make it less prominent. [20:44:04] it makes sense to have a process, I just didn't know one existed. [20:44:10] heh, me neither [20:44:33] I wonder why milimetric didn't kick me out and send me to kevinator. [20:44:41] he's just so polite and helping [20:44:51] :D [20:44:53] he is, he is [20:44:57] exceedingly