[00:13:56] halfak, I just let something crazy run on stat1003. If you can wait for it to finish before passing something memory intensive to memory, that's be great. [00:14:05] trying to save some data and get out of R [00:14:16] Sure. No problem for now. [00:14:38] I just sent a message about raising priority. I'm curious why that is necessary. [00:14:47] it has been crashing while writing. :-( [00:15:14] Hmm... that shouldn't be priority related. Was it segfaulting? [00:16:03] this is not priority related. agreed. [00:16:13] This one is memory related I think [00:16:27] it's a huge data.frame and R hates to write [00:16:38] I'm converting it to matrix right now, with hopes. [00:16:47] Gotcha. Godspeed. [00:17:05] thanks! Ironholds was picking on my number of columns. ;-) [00:17:23] I was telling him 17M by 200 is normal. ;-) [00:17:33] it's not! [00:17:38] 17m by 14 is normal :D [00:17:47] 17m columns? [00:17:48] :D [00:17:49] wut [00:17:51] rows [00:17:55] Oh. that' [00:17:55] and 200 columns [00:17:56] s fine [00:18:07] Oh. 200 is a little much. Not too crazy though. [00:18:14] really? huh [00:18:15] yeah, I agree with you [00:18:20] I don't think I've ever had an object with more than 20 ;p [00:18:31] you will, sooooon! ;-) [00:19:11] I was just playing around with a datatable that had 35. [00:19:12] Ironholds, just turning it into a matrix is taking forever. I'm giving it a try, if it doesn't happen, I have to split it [00:19:18] *nods* [00:19:19] makes sense! [00:19:24] still, the code runs fast :D [00:19:39] I really made your day, Ironholds. ;-) [00:19:52] uhhuh, that's the order it went in ;p [00:20:12] If it wasn't for this I would've been trying to puzzle out why with_tz objects to running within data.table expressions. [00:22:25] :D [00:22:56] you would be happy to hear that as.matrix has resulted in Error: cannot allocate vector of size 23.8 Gb [00:23:04] aww [01:07:52] halfak, I'm giving up for the next 2 hours. Feel free to use as much RAM as you need [18:22:25] halfak, Ironholds, are you around? [18:23:21] Hey leila [18:23:23] what's up? [18:24:39] I'm back on stat1003, I see that it has 65Gb RAM, but only 15Gb is available. I need somewhere around 26Gb. Do you have an idea how I should free up some RAM on the machine? [18:24:55] I only see you there when I get who, but you're not using much [18:25:03] I checked stat1002, that wasn't much better [18:27:21] * halfak looks [18:28:23] Looks like you are using 49GB right now [18:28:55] I think that the machine will reserve some memory for file caching until you ask for it. [18:29:02] I know that mongo does that like crazy. [18:30:01] I hate to turn into mini-halfak [18:30:10] but this is a real argument for us relying more on things like python [18:30:23] being able to stream is a godsend with tremendously large datasets [18:30:46] WOOOOO!! OH YEAH! "yield" changes the way you look at the world. [18:31:05] * halfak puts the koolaid man back in his cage. [18:31:14] So, even when I stop my current job, it shows me 15Gb free when I type free [18:31:40] soo, how does Python help in this case? [18:32:55] you wouldn't have to load the entire dataset at once, if you structured the file correctly. [18:33:16] you could read in all the events associated with user1, process entirely, pipe to output, user2, process entirely, pipe... and so on. [18:33:40] R is a "you have a TSV, perform vectorised operations on it" language. Python is "a tsv, eh? That's just stdin by another name". [18:33:46] +1 I almost never need much memory. [18:34:21] me, I'm spending my weekend implementing a markov chain algorithm in R. Have yet to decide if that's a good thing. [18:35:29] I see. my main worry is not R right now, but in few hours when I have to feed all the data at once to an algorithm to the prediction [18:36:01] k, thanks for the hints. Let me play with it a bit more. [18:36:31] Indeed. This is an issue. I generally limit the observations I give a model by doing stream sampling in python. But in the end, you're going to have a giant table in memory. [18:36:43] e.g. 100k observations usually does the trick. [18:37:06] * Ironholds nods [18:37:09] that would make sense as an approach. [18:37:12] * halfak still uses R once he gets down to 100k obs [18:37:18] totally! [18:37:20] depending on the problem [18:37:33] awk to filter, python to parse, R to interpret, C++ to interpret quickly. [18:37:37] and in the darkness bind them [18:38:20] on that note, keysplit() now lives in WMUtils. Tell your friends! [18:38:30] keysplit? [18:38:49] so, you want to parallelise an operation over multiple cores or machines. Cool! You'd use a list. [18:38:58] Problem: you probably start with a data frame. Okay, split the list. [18:39:18] Except...crap. It's a table of userID-month-count. All the rows associated with each unique userID need to be stored together. [18:40:04] you pass keysplit a data frame or table, a column that's acting as a key, and the number of list elements you want. It returns a list of data.tables containing the input data.table/frame, split pseudorandomly, with all actions with the same key value grouped together. [18:40:07] * Ironholds jazz hands [18:40:25] and now you can take your session analysis code and fork it off to 25 machines and not worry that you have half of user1's requests on machine 3, and half on machine 12. [18:40:39] trivial to implement but a useful problem to solve for. [18:45:51] Ironholds, how can you generate sessions when you split a user's activity to different threads? [18:46:29] you can't! That's my point [18:46:43] Oh! I get it. [18:46:48] it's a way of splitting a data.table that guarantees all rows associated with a particular key value will remain unified [18:46:50] I'm not good at words [18:46:53] It keeps keys together, but splits between then [18:46:57] yes! [18:46:59] :) [18:47:02] Cool [18:47:05] and randomly sorts them before doing so, as well [18:47:15] so in theory you should end up with roughly equally-sized list entries [18:47:22] regardless of how the dt/df was sorted when you inputted it. [18:47:48] Sounds like a useful bit of code. [18:48:04] download the library, use the library, report bugs or feature ideas! [18:51:31] It's on the list. [18:54:02] cool :) [18:54:13] ow, C++. Why must you be an ass. [18:54:38] too close to the metal. [18:55:20] heh [18:55:26] I realised R didn't have a markov text generator, see? [18:55:41] ...so now I'm building one. It has a legitimate purpose (an art project) [18:56:00] https://pypi.python.org/pypi/PyMarkovChain/ [18:56:02] Just sayin [18:56:39] I know! I used that for SandBot [18:56:59] does Python also have a composite image generator? [18:57:07] those are the two components I need. *strokes beard* [18:57:22] http://stackoverflow.com/questions/2563822/how-do-you-composite-an-image-onto-another-image-with-pil-in-python [18:57:29] aww [18:57:48] if it exists, it's implemented in python [18:57:49] ... [18:57:57] actually has anyone tried just "import worldpeace" and seeing what happens? [18:58:03] heh. [18:58:17] ImportError: No module named worldpeace [18:58:18] bummer [18:58:28] soo, Ironholds, what are you doing with MCs? [18:58:37] I'm making an art project. [18:58:56] and what are the states? [18:59:13] oh, it's going to be a text generator [18:59:50] just a simple "build a map based on the tokenised text and possible follow-on elements, go through chaining them together" [19:00:26] TL;DR I'm making an art project called "Hypothetical Art" that consumes massive amounts of open metadata and images about the art world and, when you hit a button, generates a composite image and an associated, markov'd description [19:01:16] Excellent [19:01:20] meta-art [19:01:23] why? Because it amused me. [19:01:24] it's a cool idea, though I don't see the connection between that and MC but the idea is cool [19:01:36] MC generates the text description [19:01:46] leila, because MCs are a really good way, when you get the order number high enough, of generating plausible but pseudorandom text from an input corpus [19:02:23] I give it the real descriptions, it (hopefully) generates a sensible output. I might move away from that, though, and opt for extracting the valuable bits of existing description text and fitting them into prebuild structures, though. I'll see. [19:02:46] so, "[composite of titles] is a work by [composite of author names] created in [average of creation years]..." so on. [19:02:55] does the Markov property hold in your case? It seems from what you describe that each state doesn't depend only on the previous state, but on the history of states it has been to? [19:03:17] just the previous! Check out pymarkovgen, as halfak linked above; it's a lovely little implementation [19:04:23] :) [19:04:24] ah! checking it