[10:42:26] Hi, I am doing a research project on wikipedia users analyzing the sentiment on User_talk pages-- I need to get data from the user talk pages in a diff format i.e change (additions and deletions) in every revision. What is the best way to get these diffs preferably using python ? [10:42:34] Thanks for the help in advance! [10:44:30] (I have already used SpecialExport to get revisions- but doing the data becomes really huge and getting diffs on local machine is very time-prohibitive. I notice that there is a web interface which shows diffs eg. https://en.wikipedia.org/w/index.php?title=User_talk%3AGrandia01&diff=632959597&oldid=632942526. I need to know the scripts/queries to get this) [15:03:24] heh [15:03:32] halfak, so my library appears to have snowballed. [15:04:01] first I wrote it. Then Jeroen, who wrote opencpu and jsonlite, wants me to integrate with his openssl library. [15:05:10] then I get an email 5 minutes ago from /Dirk Eddelbuettel/, who wrote Rcpp and digest, suggesting we replace digest's internals with this. [18:10:37] Cool! Good work Ironholds! [18:40:44] Ironholds, https://commons.wikimedia.org/wiki/File:Streaming_vs._In-memory.svg [18:40:51] * halfak is working on a blog post. [19:06:35] halfak, hey! [19:06:44] I've happily handled 32m-row datasets with R [19:07:05] I'll have you know that vectors have 2^31 as a maximum length [19:07:11] Yeah. but memory [19:07:58] You probably *shouldn't* have handled a 32m row dataset with R. [19:08:09] It's like turning a screw with the crow on the back of a hammer. [19:08:16] You can do it if you really want to. [19:11:38] I'm uploading the anon'd session datasets while I write about it :) [19:18:37] don't PHP: a fractal of bad design me, mister! [19:18:50] (let's just take a moment to appreciate that we both referenced the same blog post) [19:19:07] * halfak doesn't know what blog post you are talking about. [19:19:22] Oh wait. I saw the PHP hammer [19:20:48] Heh. I suppose a difference between this and the two clawed hammer is that R does what it is designed to do very well. [19:21:08] It just doesn't do what it's not designed to do so well, but that hasn't stopped people from demanding more memory in the server. [19:21:42] totally [19:22:00] I'm all for more memory in the server, but when an entire dataset is loaded into memory just so that a simple operation can be performed a row-at-a-time, then I shed a single tear. [19:23:01] particularly since the counterargument is [19:23:11] "well, R operations are vectorised! they work faster on big datasets than row at a time!" [19:23:19] and my reply to that is to slap the talker upside the head [19:23:35] the /reason/ they do that is because somewhere in core is a pile of ugly C that consists of function() in a for-loop, and C for-loops are cheap. [19:23:39] If they can beat unix sort & cut, I'll hand in my axe. [19:23:59] (unix sort and cut *are* my axes) [19:24:03] I mean, you could try and load 200GB of stuff into memory, sure [19:24:15] ...or you could just write it in C, at which point the "but VECTORISATION" argument goes away [19:24:56] My priorities are thus: (1) things need to get done, (2) I'd like to not take down the server, (3) I'd like them to be fast. [19:25:04] that reminds me, I should probably start building that damn program already [19:25:11] the "existing definition versus new definition", one. [19:25:31] I imagine it'll have an R interface but be almost entirely C++. I'll just take my sampled log reading code and append two fields to the end. Done. [19:25:37] booleans! God bless booleans. [19:25:46] Why write it in R though? [19:25:53] If you are going to be processing large piles of data? [19:26:03] How about a C utility? [19:26:05] :D [19:26:12] ah, because! [19:26:38] I write a C utility. It runs over a day of sampled logs and produces sanitised, smaller versions with match_old, match_new columns, containing booleans, yes? [19:27:03] and then I want to count that and extract particularly common substrings that are or aren't matching both, or that are matching one but not the other, and so I have to wait for read.delim to bloody read the thing. [19:27:32] alternately, I write a C utility that creates an object of class:DataFrame at the end, call it from within R, and the read time for the processing is the only read time that happens. [19:27:37] Less sitting around. [19:28:17] I demand 2 hand-axes be available at allhands, so that we can get a mad warrior pose of halfak, holding them outstretched, leaning back and yelling his war cry. [19:28:44] one with "awk" written on it and the other with "sed". [19:29:00] the problem there, is that Dario might steal them. [19:29:15] More photos, more goodness! [19:29:28] (why, does he secretly like collecting axes?) [19:29:42] no, he really likes UNIX utils. [19:29:51] Ironholds, but if you write the C utility, then *I* can work with the files. [19:29:57] I want to work with pageviews Ironholds [19:30:10] I'm not willing to load giant things into memory all of the time. [19:30:11] you write R better than I write C++ ;p [19:30:14] but okay, fine, fine. [19:30:15] :P [19:30:23] Or has he read https://www.goodreads.com/book/show/23207986-keeping-warm-with-an-ax too many times, and now Winter Has Come? [19:30:23] what do you need it to do? Just a command line thing? [19:30:31] (it's a good book :) [19:30:44] quiddity, +1 [19:30:53] I'd bring my own axes, but I don't think that TSA'd like that. [19:31:01] :> [19:31:08] urgh, except then I have to work out how to read GZIPd files in C. [19:31:12] * quiddity goes back to looking for coffee.. [19:31:16] Ironholds, yeah. A C utility would be great. [19:31:22] Ironholds, don't read gzipped files. [19:31:31] Just pipe in from zcat :) [19:31:34] Expect text [19:31:38] It's the UNIX way! [19:31:48] yeah, then I need to tie it all together with a shell...thing. eeeeh. [19:31:52] fine [19:32:18] zcat request_logs.gz | ironholds_view_extractor | bzip2 -c > page_views.bz2 [19:32:24] (because bz2 is faster and better) [19:39:08] or I could just get boost's gzip library which does this automatically ;p [19:39:11] * Ironholds sends an email to Andrew [19:39:26] Nooo! What if I send you bzip2'd data? [19:39:40] Or what if I have plain text already? [19:39:50] Then I'd have to gzip it before sending it to your utility. [19:40:34] cat request_logs.txt | gzip -c | ironholds_view_extractor | bzip2 -c > page_views.bz2 [19:40:36] Then my question would be "why are you custom-reading sampled log files"? [19:40:53] Because I want to do custom things with them. [19:41:00] like? [19:41:50] I want to work out the different between generating session information based on all requests vs a sample, so I randomly sample some IP/UA fingerprints to work with. [19:42:19] * Ironholds headscratches [19:42:33] so why would you be processing it /before/ identifying what a pageview is? [19:42:49] Because I want to compare only looking at pageviews vs. looking at all requests. [19:43:01] but my utility is going to output all requests [19:43:13] just with two new boolean fields that indicate which filter(s) the line matched [19:43:58] I guess I haven't read the docs on it. Hard to comment without knowing what you are imagining. [19:44:10] But FWIW, I write custom data processing jobs all of the time. [19:44:23] fair [19:44:39] The more basic and flexible the tool, the more I can do with it. [19:45:24] By writing programs that read uncompressed text and output uncompressed text, you leave me with options about what I want to do with it. [19:47:02] E.g. zcat request_logs.gz | ironholds_page_view | grep -r "1$" | shuf -n 100000 | extract_state -k4 | count > state_count.tsv [19:47:08] fair [19:47:16] sorry, my brain is...this is not a good year for my brain. [19:47:51] The above line would filter only views out of the logs, randomly sample 100k of them, locate the state and save a file of requests per state. [19:48:09] This is something I really wanted to do last week. [19:49:05] The best part of this is that I can do it in hadoop :) [19:49:37] Streaming isn't just a cludgy way for me to run python in hadoop. It's super powerful in that it lets unix operators be distributed in a hadoop environment. [19:49:42] I know! [19:49:50] OK :) [19:49:54] * halfak stops ranting [20:40:06] Blog post complete: http://socio-technologist.blogspot.com/2014/11/fitting-hadoop-streaming-into-my-python.html [20:42:31] yay! [20:42:38] how goeth the dataset uploading? [20:47:12] Not bad. I just realized that I didn't prepare the MS desktop views. I'm getting that together now. [20:52:15] cool! [20:53:51] hey halfak, have you seen the star wars trailer? https://pbs.twimg.com/media/B3itd5bIUAApNOt.jpg [20:54:14] I did. What's with the semi? [20:58:18] .... [20:58:21] damn you [20:58:53] https://en.wikipedia.org/wiki/Semi-trailer [21:01:42] Goddamn. I need to know when one of those is coming. [21:01:56] There needs to be an emote for punny-face or something. [21:02:01] I'll make a sign [21:02:17] *PUN ALERT* halfak have you seen the new star wars trailer *PUN ALERT* [21:02:24] lol [21:02:54] anyway, I'm heading off to a makerspace for the evening. Gonna carve some wood. [21:02:55] * Ironholds waves [21:03:01] have fun! [21:03:02] o/ [21:03:12] Will have datasets uploaded by the time you are back. [21:14:21] Upload complete.