[00:37:45] I'm off. Have a good weekend! [00:37:46] o/ [07:48:05] halfak: harej quarry kill limit increased to 20mins per query btw [13:28:56] yuvipanda, \o/ [14:47:27] Yuvipanda, not sure I ever had queries last longer than a minute lol [14:47:47] At least the ones that are successful. (Note to self: never use the revision table) [14:48:55] harej, yes. "revision_userindex" [14:49:04] Note that there's also "logging_userindex" [14:50:27] I don't think revision_userindex would have helped [14:50:28] That decision to have the indexed table named weird has caused *so* many face-palms. [14:50:28] Oh? [14:50:28] Since I wasn't touching anything involving usernames [14:50:28] * halfak wonders what Harej is querying [14:50:28] Well I just ended up using the RC table instead which was dramatically faster [14:51:32] harej, interesting. RC is not historical. Could it be that limiting by timestamp would have helped? [14:51:44] Either way, I'm glad you got it worked out :) [14:52:01] No, it wouldn't have. I only needed a week's worth of data. [14:52:06] But I have a question for you. [14:52:22] Oh? [14:52:50] Can one use the revert module you built to just figure out, yes or no, if a given revision ID has been reverted, based on a query to labs? [15:01:29] morning all [15:02:19] harej, yes, sort of. [15:02:38] The db module struggles with labs due to the weird table naming and lack of useful indexes. [15:02:46] BUT the API model works and is very fast for the same thing. [15:02:52] Hi Ironholds [15:02:52] :) [15:05:14] Are the table names not the same? [15:05:15] morning halfak :) [15:05:18] how goes? [15:05:30] harej, "revision_userindex" [15:05:49] Ironholds, not too bad. Just got the "Wiki labels" gadget to MVP yesterday night. \o/ [15:05:59] nice :) [15:06:00] That was 1.5 weeks of 14 hour days :) [15:06:11] I released urltools 1.1.0 and am working on iptools 0.5.0 [15:06:17] Nice :) [15:06:31] (iptools is going REALLY well. I can turn a million IPv4 IPs from dotted-decimal to numeric in 180ms) [15:06:51] (this is only useful if you're me or Hadley, but!) [15:07:37] 192168001001 [15:07:56] Ironholds, I think you're really going to like this new hand-coding system. It's like the AFT one I built, but admins no longer need to assign worksets. You can just show up and request one. :) [15:08:03] halfak, niiiice! [15:08:21] harej, 190.26.253.233 [15:08:28] * halfak is going to trust the crowd and build basic moderation tools :) [15:08:41] halfak: everything I am working on involves recent revisions. Could I just replace revision_userindex with recentchanges? [15:08:49] Ironholds: alas, you use a different methodology than i was expecting [15:09:03] harej, what's the IP? [15:09:08] harej, wouldn't work out of the box, but theoretically yes. I recommend the API regardless. [15:09:10] 192.168.1.1 :P [15:09:22] It'll work if you're pulling revisions from the DB and checking them via the API :) [15:09:29] harej, that's 3232235777 [15:09:46] you take each individual sub-component and multiply it by [number of possible bits in this sub-component], basically [15:09:50] See I thought you were just putting zeroes in front of the one- and two-digit octets and then removing the decimal points [15:09:54] ahhh [15:10:20] ironholds: 192*256^3 + 168*256^2 + 1*256^1 + 1*256^0? [15:10:34] halfak, lemme check ;) [15:11:13] 192*16777216 + 168*65536 + 1*256 + 1 [15:11:14] yep [15:11:34] of course, it doesn't yet work for IPv6 [15:11:40] why? Because longs are beyond R's API. [15:11:43] * Ironholds grumbles [15:12:01] I can fit all possible IPv4s in an unsigned int, but IPv6? fuggettabahtit [15:13:23] although one of my friends just published a paper on storing a massively wider range of numbers in unsigned int space [15:13:27] which looks p.interesting [15:15:15] (and wholly disconcerting) [15:15:22] (like, an unsigned int is an unsigned int) [15:16:41] * guillom waves at halfak, harej & Ironholds. [15:17:16] hey guillom :) [15:17:19] Ironholds, what about hex? [15:17:29] Could you change the base and produce a string? [15:17:39] G'morning guillom [15:17:42] in the context of IPv6 or overloading ints? [15:17:56] Ironholds, IPv6 [15:18:02] * halfak is not sure if hex math is fun in R [15:18:04] hmmnm [15:18:07] * Ironholds thinks [15:18:10] yes, I could output as a string [15:18:17] but the user would be unable to manipulate it as a number [15:18:28] Fun story, javascript is actually really good at doing non-base 10 things. [15:18:38] Like, long support is barely existent in base R [15:18:40] Ironholds, in some langs, you can have non-base 10 numbers. [15:18:53] But I suspect that it will be represented as an int internally anyway. [15:18:56] we just (with, I think, 3.2.0) got >2^31 in *some* contexts [15:18:58] So that might not solve any problems. [15:18:59] yeah, that's the problem [15:19:05] like, you can store things in say, scientific notation [15:19:09] it can represent numbers in various forms [15:19:11] Ironholds, you know who has longs? [15:19:14] but the base storage mechanism is always ints [15:19:17] ... python ;) [15:19:17] python has longs :( [15:19:20] :D [15:19:28] yeah, but python also has really slow value lookup so *thbbbt* [15:19:38] And in Python 3, you don't have to notice what is a long and what is an int! [15:19:44] yeah, I know! [15:19:49] I got bored and dug into how python objects are stored [15:19:51] Meh. Just as fast as R if not faster. [15:19:58] it's really fascinating; y'all went for the non-contiguous route. [15:20:11] Oh! That [15:20:16] We have __slots__ for that :) [15:20:16] which makes total sense given "pythonic" programming (it's why shit like append() is an anti-pattern in R but the Way Of Working in Python) [15:20:40] If you want contiguous use of memory, put it in __slots__ rather than __dict__ [15:20:54] or lists ;p [15:20:57] So all of my memory intensive classes use __slots__ [15:21:04] *nods* [15:21:07] Na. __slots__ is a tuple, so it uses memory like an array. [15:21:12] ...huh. [15:21:19] aren't arrays non-contiguous? :/ [15:21:23] or are we not talking C arrays? [15:21:24] Nope [15:21:37] Aren't C arrays contiguous? [15:22:02] hmn; I could swear they were NC. Checking [15:22:48] Ironholds, I'm thinking about doing pointer math to look up things in an array [15:22:52] looks like the answer is "it depends" for arrays generally, but contiguous in C, yeah [15:22:57] Maybe there's an abstraction I'm not familiar with. [15:23:02] Gotcha. [15:23:02] hence 0(1) for lookup [15:23:03] Yeah. [15:23:06] yerright! [15:23:16] tuples are C arrays -- basically [15:23:28] You can't append to a tuple. [15:23:31] You have to make a new one. [15:23:34] ahhh [15:23:36] neat! [15:23:38] :) [15:23:51] here in R land myself and wrathematics have been talking about writing some non-contiguous data structures [15:23:58] So, great for when you want to say, "Use this much memory all together: [....]" [15:24:02] Character vectors are basically Pythonic lists, but almost all other types are contiguous [15:24:09] which makes total sense for say, matrix mathematics [15:24:21] +1. Not for a live system. [15:24:26] You want to be able to append all the time. [15:24:28] but no sense for non(complex mathematical) computations [15:24:34] totally [15:24:43] :) [15:24:48] * halfak high fives Ironholds [15:24:55] * Ironholds high-fives [15:24:59] You're kind of a 1337 programmer these days [15:25:04] how? ;p [15:25:07] digging into memory management issues and all. [15:25:16] I mean, I like seeing how my language is built [15:25:21] the answer appears to be "it was built?" [15:25:59] related; my useR conference talk got accepted! [15:26:00] \o/ [15:26:08] so I'm doing 30 minutes on software engineering standards in the R community [15:26:45] Ironholds, somewhere between the CRAPL and the optimally reproducible study via set of R files? [15:27:10] hahah [15:27:32] amusingly, the study itself is perfectly reproducible using R files, right down to writing the journal article [15:27:35] (I love RPubs) [15:27:45] but this is specifically CRAN packages, not, like, general code [15:28:04] the answer is "oh dear fucking god what" [15:28:23] like, we have biiiiiig packages, in usage terms, that totally lack things like "unit tests" or "accessible upstream repos" [15:29:00] Ironholds, gotcha. [15:29:19] So more like: Somewhere between CRAPL and CRAPL + some documentation. [15:29:57] There needs to me a ThesisPL and MastersThesisPL -- because those are vastly different quality. [15:30:22] MastersThesis == I was there for two years and I realized 3 months from the end that the project needed to be done ASAP. [15:30:50] PhDThesis == I was there for a decade, so I started forgetting how old code worked and I developed better behaviors out of necessity. [15:32:02] hahaha [15:33:11] okay, I'm gonna go game for a bit. Then: iptools 0.2.0 by EOD! *looks determined* [15:33:24] o/ Ironholds [15:39:35] halfak: how long would the API take to check one revision for being reverted? Keep in mind that I will be querying potentially thousands of such revisions. [15:40:03] I can do 20k in about an hour. [15:40:21] :) [15:40:21] Hmm, that's not so bad. [15:41:09] Here's my use case: I am generating reports of talk page discussions going on on different pages in a WikiProject's scope. It uses the RC table. [15:41:24] For women scientists: http://pastebin.com/raw.php?i=UzKwuKt7 [15:41:41] Note: | Rachel_Carson | /* Your Mom */ new section [15:42:01] That talk page section is no longer actually on the page; it was promptly reverted. But it wasn't *deleted*. [15:42:08] It's still a revision in the revision table. [15:45:32] harej, reverts are hard to detect on talk pages because of archiving activities. [15:45:58] I recommend setting the revert_window = 48 hours and revert_radius = 3 [15:46:23] This will constrain the revert detection and make it run substantially faster as well. [15:46:25] Sure, though vandalism tends to be reverted pretty promptly [15:46:30] Indeed. [15:47:03] harej, Nearly all reverts happen within 48 hours http://www-users.cs.umn.edu/~halfak/publications/When_the_Levee_Breaks/geiger13levee-preprint.pdf