[00:13:52] halfak, poke :) [15:34:15] * Ironholds yawns, stretches [16:05:23] Hey Ironholds [16:06:14] halfak, you can has datasets! [16:06:56] Saw that. Will look at 'em right away. [16:08:45] yay! [16:14:52] Holy crap dude. You're making my life easy :) [16:25:01] whyso? [16:25:38] you mean the consistent formatting and the fact that you can, should you choose, just data <- do.call("rbind",lapply(list.files("Output",full.names=TRUE),fread)) [16:25:56] and end up with one big-ass data.table you can generate intertimes from by {uuid,type}? [16:27:12] Everything but the loading into R to do intertimes. [16:27:15] :P [16:27:22] Psht. [16:27:31] actually I'd be really interested to see who'd win in benchmarking, R versus Python [16:27:37] I write pretty efficient R. [16:27:49] I'll use less memory :P [16:27:57] that's fair! [16:28:11] Python is the Tomahawk of data analysis, R is the daisycutter [16:28:21] Also, rather than loading it into memory, I'll process the data as fast as I can decompress it. :P [16:28:24] they'll both get the job done, but R is less pleasant for anyone within 500m [16:28:36] Actually, that's not true, I can't keep up with gzip [16:28:58] if you could keep up with gzip I'd be worried [16:29:36] "so, I wrote something in Python that's as fast or faster than really precisely written C authored by Mark Adler" [16:29:50] I think we'd have to declare you the second coming [16:30:06] Well, decompressions is somewhat non-trivial computationally whereas, "compare every two lines" is pretty simple. [16:30:22] Though really, decompression is far faster than compression. [16:30:30] yerp [16:30:40] In both cases, we're pushing the disc a bit. [16:31:21] * halfak accidentally cats a file and watches his terminal go to hell. [16:31:31] :( [16:31:41] ^C ^C ^C! [16:31:45] that's one of my most-appreciated minor data.table features, oddly [16:31:52] +1 [16:31:55] Auto head/tail [16:32:03] "oh fuck I-wait, oh it was a data.table, I can still read" [16:32:29] as opposed to data.frame printing 20000 rows however much you ^C and then throwing a warning that it hit max.length like that's YOUR fault [16:32:43] I've come to simply set max.print to 200. It saves on accidental stupid. [16:33:09] Seems like a good idea. [16:33:35] Say, will any of the IDs cross datasets? [16:34:30] * Ironholds thinks [16:34:33] they shouldn't [16:34:41] if only because the random seed is different each time [16:35:02] OK. Just checking :) [16:35:07] I mean, I guess theoretically if two datasets both picked the same random value [16:35:13] but that really shouldn't happen after six runs [16:35:22] if it does...I'll be switching to Python faster than I thought ;p [16:36:47] :) Sounds solid to me. [16:37:06] Everything is pre-sampled, right? [16:37:14] Seems like the files are too small otherwise. [16:40:46] Ironholds, ^ [16:43:10] halfak, yep! [16:43:14] and..hangon [16:43:24] yeah, pre-sampled, 100k distinct IPs each, as I recall [16:43:31] Cool. [16:43:45] so, I'm talking through dataset releases with Dario on Monday (you want in?) and as part of it I want to stick a big table in the repo readme.md [16:44:02] event type - user type - sampling basis - sample count - hashing method - rows [16:44:14] this will hopefully help (although the raw queries are around if people want to double-check) [16:44:36] Sounds good. Yes, please pull me in for that meeting. [16:44:53] I was just about to ask you to do a quick writeup about the datasets that we can pull into the paper. [16:47:04] cool! [16:47:07] that's a twofer, then [16:47:12] * Ironholds will start up the repo now as an excuse [16:47:54] halfak, this seems like the sort of thing that'd make a good blog post, actually [16:48:04] legal-approved dataset release with a DOI and transparent code around its generation? [16:48:05] +1 I'm down. [16:48:06] hell to the yes. [17:00:15] Hey Ironholds, should I expect these datafiles to be sorted? [17:00:34] * Ironholds thinks [17:00:43] oh damn, I was meant to write that functionality in, wasn't I. [17:00:50] No worries. I can do it :) [17:00:52] I don't think they're sorted. Womp womp :( [17:13:14] halfak, this is probably a really stupid question but [17:13:24] why haven't we just dumped the categorylinks table to file and released it? [17:13:37] Oh. We have. [17:13:41] cool! [17:13:42] It's an SQL dump. [17:13:58] eeexcellent [17:16:20] whoops [17:16:30] * Ironholds just made the fatal mistake of trying to add POSIX timestamps to a data.table [17:30:28] * halfak loads processed, sampled data into R. [17:34:47] I think that search events have some ajaxy things going on. [17:36:02] Same deal with desktop pageviews. I'm not getting the clean fit I have gotten in the past. [17:36:23] This is weird. [17:36:30] I will have plots shortly. [17:45:55] Desktop views look more reasonable. [17:46:38] Wait.. no they don't. [17:46:43] Neither do mobile views. [18:04:50] Edits look really weird too. [18:04:57] Plots uploading. [18:11:32] Say Ironholds, what was your general methodology for gathering edits? [18:15:32] See https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/EpochFail&ilshowall=1 I'll gather them in the document after lunch. [20:35:55] halfak, huh [20:36:04] edits should be fine. Pageviews...*thinks* [20:36:22] let me run some exploratory queries [20:37:10] oohhh, wait. I know. [20:37:24] halfak, this may be one of fundraising's "features"; I'm gonna run some checks [20:38:36] yip. /wiki/Special:RecordImpression [20:38:51] wait, no, that shouldn't be making a difference, that's text/javascript. [20:50:23] halfak, poke me when ye return?