[04:04:08] Ironholds: do you know of a nice way to get data in mediawiki tables into R dataframes, and/or vice versa? [05:09:29] jeremyb_, thanks for the email comments! [05:09:56] mako, y'mean a mediawiki database table, or a literal wikitext table? [05:10:28] if the former, RMySQL is always the answer. If the latter, I actually spent my weekend working on an R version of halfak's python utilities that will also incorporate a parser. So, not right now, but soon! [05:11:04] I hadn't thought of a table-to-df parser. Will add to the to-do (in fact, I will probably work on that first: it is a very self-contained bit of work. [05:12:19] https://github.com/Ironholds/MWUtils/issues/1 boop [05:12:46] so far it's just a timestamp-to-POSIX and POSIX-to-timestamp function and a revert detector, but it'll rise, oh yes. [13:32:21] morning Ironholds. [15:57:47] leila, not sure if it's of any use to you, but [15:57:54] I got bored and wrote a really really fast revert detector for R [15:58:03] it can identify 36m edits as reverts/not reverts in 20 seconds [16:00:44] Ironholds, this can be useful. did you use any approximation? [16:00:52] define approximation? [16:01:48] heuristic really. I remember figuring out whether an edit is a revert required the use of some sort of heuristics. [16:02:33] for example, Aaron's code would look at edits in the +/- hours time internval to identify reverts [16:02:35] oh, gotcha. Just Aaron's definition; whether the edit is between two edits with matching SHA1s within a day [16:02:37] exactly [16:02:46] It uses hash tables for speed and win [16:02:47] aa! got it! cool! [16:02:55] now to build this wikitable parser. Grrr.. [16:03:32] (I'll go eat breakfast. will be back) [16:24:04] leila, is anyone working this week except us? [16:24:54] haven't checked the calenders, Ironholds. I'm working today, but won't work for the rest of the week. [16:25:10] All calendars should be updated though [16:25:35] okie-dokes; thanks :) [16:26:51] np, Ironholds. [17:55:20] Ironholds: yes, i meant a wikitext table parser. i wrote an extremely fragile version once [17:55:26] Ironholds: like, it worked on a single table :) [17:55:58] Ironholds: maybe it was a little better? i can dig it up and share if you'd like [17:56:41] please do! [17:56:54] Ironholds: the context was that i was using MW to collaboratively edit a table that i was then parsing in R. but i've found myself wanting this in other situations as well [17:56:57] I'm falling back on C++, because Rcpp is great [17:57:00] but the one disadvantage is it constructs data.frames column-by-column, not row-by-row [17:57:03] mako: Ironholds mwparserfromhell might already parse tables. [17:57:05] yeah, it's a generally useful thing to have [17:57:07] (for code-stealing) [17:57:11] YuviPanda, it does. [17:57:15] It does so in Python-oriented C. [17:57:32] So, wrong mental model [17:57:43] (I'm stealing from it, don't get me wrong, but I don't expect to be able to grab code and find-and-replace) [17:57:58] allow me to take this as the Nth opportunity to curse everyone who wrote MediaWiki for using multi-character delimiters [17:58:07] thus ruining my opportunity to use stringstreams for their obvious purpose [17:58:22] yeah, the other problem i had was that i don't think i actually understand mediawiki tables [17:58:33] nobody does [17:58:33] i mean, i can make them :) [17:58:40] oh, nobody even makes them any more [17:58:47] there is, in Wikipedia, one table. [17:58:58] and for the last 6 years, everyone has, when they needed to write a table, done the same thing: [17:59:08] they've gone to a page that they know contains a table and stolen the syntax [17:59:16] it's almost beautiful, in an infinitely recursing kind of way. [17:59:41] (for me it's https://en.wikipedia.org/wiki/Chief_Justice_of_the_Common_Pleas ) [18:00:10] for me it's https://en.wikipedia.org/wiki/Comparison_of_e-book_readers [18:01:23] y'see?! [18:06:53] Ironholds: http://mako.cc/outgoing/read.mw.tables.R [18:07:00] Ironholds: i'll be suprised if you find anything useful in there :) [18:07:47] * Ironholds flinches [18:07:58] much gsub, many wow [18:08:09] I'll see what I can do. It looks pretty sensible, mind [18:08:37] like i said, this was quick and dirty to parse one (or maybe a couple tables) [18:08:57] but help yourself [18:09:29] I'll take a look! [18:09:30] i have found myself wanting to this do this in a bunch of other situations though [18:09:40] yerp [18:09:52] hmn. I wonder if we could just try my read.delim trick [18:09:59] I worked out how to trick read.delim into thinking an object is a connection. [18:10:27] and sort of thought that somebody (me?) should rewrite this code as "real software" or as a patch to your code [18:10:58] i wrote this a couple years ago [18:11:05] Ironholds: yeah, that's sort of the markdown approach to parsing :) [18:11:25] regex substitution as parser :) [18:12:13] yup! [19:24:18] * Ironholds is very, very slowly automatically crawling google [19:24:22] the shit we do for our science [23:22:51] Looking at tomorrow's calendar, it would appear that only Ellery, Morten and Oliver have accepted the Research Standup....should we still have it or should we cancel it? [23:28:58] we should totally have it! [23:29:07] We can fill in aaron and leila and dartar's entries, as usual [23:30:34] k....thanks!