[11:37:58] halfak, You there? [11:38:06] I just saw your ping for me on meta [13:56:59] Hey tos. Sorry to miss you. [13:57:12] Hey [13:57:52] Did any of the others contact you yet [13:59:44] halfak, ping [13:59:57] Nope. I didn't hear from anyone but you. [14:00:54] Maybe you should have pinged them on enwiki/email? [14:01:05] I suggest a group mail to all [14:01:17] So any reply goes to all, like a mailing list [14:01:28] That way, a time could be set up easier [14:01:48] Btw I have plenty of time so consider me available [14:05:34] Awesome. [14:05:43] Yeah. I'm kind of hoping to not become the leader. [14:05:49] I'm spread too thin as it is. [14:06:06] But I don't want to see this project fizzle out. [14:06:33] I suppose mentions on meta don't work as well as mentions on enwiki. [14:07:00] I've got to head to the office. I'll be back online in about an hour. [15:10:39] halfak, Welcome back [15:11:19] Thanks. The ride in was nice this morning. It's almost almost above freezing here. [15:12:30] !coffee halfak [15:12:40] Dang [15:14:26] book wm-bot2 [15:14:30] *boo [15:15:06] BTW, I'm almost ready to start using the library that I've been working on for the back-end of Snuggle. [15:15:15] https://bitbucket.org/halfak/mediawiki [15:15:20] Not much to see yet. [15:15:27] Unless you want to fork the repo. [15:15:41] fork = make a copy to work with [15:15:49] repo = Source code repository [15:15:55] words [15:16:42] (fwiw, I don't mean to patronize -- I just hate it when people knowingly talk in jargon) [15:16:58] tos2: ^ [15:20:22] One sec. Was afk [15:21:32] What language is the code in, halfak? [15:21:45] Python [15:22:00] That's a good thing to note at the top of the README [15:22:02] Gotcha. I should be able to look into the code then [15:22:20] (The page hadnt loaded yet, so I wasnt able to tell :P ) [15:22:34] Right now, it's super under-documented. My focus is on the examples in the "scripts/" directory. [15:23:05] My goal is to design the API that makes the library as easy to use as possible. [15:23:26] So I'm starting by pretending the library is finished and writing scripts that make use of it. [15:24:48] Okay... So what will be there in the library? [15:26:43] There's connectors for querying Wikipedia's API, the database and XML dumps. There's also revert detectors, session caches, timestamp management and title parsing. [15:27:47] The idea is that I solve all of the annoying problems in this library and encourage others to make use of it too. Snuggle's back-end will start using this library too. [15:28:46] So the library is kind of an accessible way to access Wikipedia's API using queries, and manipulating the data recieved, so you can use it on multiple locations, including Snuggle? [15:28:57] Yup. That's a good way to put it. [15:29:19] Got it. [15:29:41] And the current scripts you're working on, what exactly do they help with? [15:30:35] Those example scripts help me think better about constructing an API that's nice to use. [15:30:52] It's sort of like doing mockups of a UI change before making the change. [15:31:02] By thinking about how it will be used before I start writing the code, I make something more useful. [15:31:10] * tos2 does not follow, really [15:31:13] But in this case, the mockup is code too. [15:32:02] Take this script for example: https://bitbucket.org/halfak/mediawiki/src/6d8c770737c4104df83466344858806e8f55c95a/scripts/example.dump.reverts.py?at=default [15:32:30] It demonstrates a really powerful thing that you can do with the library. [15:33:03] The XML dumps that Wikipedia produces are huge. They contain all of the revisions to all articles, ever. [15:33:34] As you can imagine [[Anarchism]] and other large articles represent gigabytes of data. [15:34:29] So, in order to process these dumps, you have to use a "streaming" strategy. This allows you to read the large file a chunk at a time rather than trying to load the whole thing into memory. [15:35:11] The dump library I wrote makes it trivial to do this. All you need is a the dump file and I'll give you a a set of "Page" and "Revision" objects to work with. [15:35:58] Ok so I'm getting to understand it a bit. [15:36:04] Given this nice structure, it's trivial for me to pass the right information to my revert detector to detect the reverts. [15:36:15] Usually the script I linked to would be 500-1000 lines. [15:36:50] But with my library, it's only 12 lines (21 with whitespace and comments). [15:36:58] So the scripts are basically ways to use the library for snuggle specific queries and work around with them? [15:37:40] Captured in this library are a lot of the ways that I think about processing wiki data. [15:38:11] Whether it is a new analysis for a research project or Snuggle. [15:39:35] * tos2 nods [15:39:42] I think I understood most of it. [15:40:30] So is there anything I can do to help at this point? Mostly, what do you think I should be doing to try and understand most of it? [15:41:21] Hmmm. Good question. Right now, I just wanted to talk progress, but it would be great to have a hand. I'll have to get back to you about where I think that the best progress can be made. [15:41:47] Gotcha. [15:41:53] I'm not sure that hoping into this library is a good idea, but making use of the library and talking to me about what breaks/doesn't make sense would be great. [15:42:19] It's not quite ready for general use yet, but that should be coming within a week. [15:42:59] The next big task for Snuggle is exploring redis as a database back-end. Have you had a class that has covered set theory yet? [15:44:00] I assume set theory does not imply the set theory of maths, so no [15:48:37] Yes. Set theory of maths [15:48:39] :P [15:48:53] Math === Computer science [15:49:51] How advanced set theory are we talking of? [15:50:18] I know set theory, but I doubt it's of the level you're talking of [15:50:47] It should be fine. The thing I was worried about is the foundation concepts. [15:50:54] So redis is a database. [15:51:01] It does two weird things. [15:51:23] 1. It stores everything in memory (RAM). Most databases store data on a hard drive. [15:51:53] This makes it super fast, but we also need to be cautious to limit the data that gets to live in redis. [15:52:20] 2. Redis is a key-value store. Most of the time, when I saw "database" I mean a "Relational Database Management System". [15:52:28] (RDBMS) [15:52:51] Relational systems use SQL, to access "relations". [15:53:06] A "relation" is actually just a set with some cool operators. [15:53:32] So when I'm querying the MySQL database behind Wikipedia, I'm really just performing some set operations. [15:53:36] Well, redis is different. [15:53:54] Redis *has* sets, but most things are expressed as key-value pairs. [15:54:28] Keys can get quite complex and that's good for us. Let me try an example. [15:54:47] What programming language are you most familiar with? [15:55:05] JAVA [15:55:21] * halfak hopes you don't mine the impromptu lecture on important bits of redis.  [15:55:41] No i dont. I just plan to google most of the stuff I didnt understand [15:55:44] OK. Damn. Stupid Java. How do you make a hash map in Java? [15:56:28] I have no idea, but let me google what a hash map is [15:56:34] OK. Here: db = java.util.HashMap; [15:56:49] so we can say db['foo'] = "bar" [15:56:59] And then ask for db['foo'] later to get "bar" [15:57:27] Does that work in Java? [15:58:11] Are you talking of pointers? [15:58:24] Oh no. Well sort of. Everything in Java is a pointer. [15:58:36] Rather, a reference. [15:59:00] Okay [15:59:04] I just want something that stores key->value pairs. [15:59:22] Got it. [15:59:25] It looks like java wants me to be more explicit. I'll start over again. [15:59:39] db = HashMap() [15:59:46] db.put("foo", "bar") [15:59:59] db.get("foo") // returns "bar" [16:00:03] Make sense? [16:00:23] Yes [16:00:48] I can get basic concepts. So what about hash maps [16:01:36] OK. So in redis, the whole system is one giant hash map. [16:01:56] But we'd like to store a lot of different kinds of things in it -- like data about specific users. [16:02:05] So we use some conventions to produce interesting keys. [16:02:45] db.put("users:123456789:user_name", "EpochFail") [16:03:15] db.put("users:123456789:registration_date", "2013-01-10T03:04:05Z") [16:03:19] etc. [16:03:51] Got it [16:03:53] So we can use this type of structure to store all of the things we care about with regards to users. [16:03:56] And pages: [16:04:24] db.put("pages:12:title", "Anarchism") [16:04:30] OK That's enough. [16:04:38] So all the details about your username etc is linked to a given ID [16:04:44] Or pages [16:04:52] Exactly. But how do you query the users? [16:05:02] Right now, you need to know the key name in order to get the values. [16:05:35] So, redis also has sets. I'm going to switch from java land to redis commands. [16:05:49] Ok [16:06:17] the way this works is " " [16:06:27] the command to add a value to a set is "SADD" [16:07:07] "SADD users:ids 123456789" Adds the user ID for "EpochFail" to the set located at "users:ids"; [16:07:36] Got it [16:08:58] Using these datatypes. I think that we can increase Snuggle's performance by 100X. But the trick is figuring out how to turn all of the "Relational database" thinking into "Redis key-value + sets" thinking. [16:09:41] So, I've got to run now, but we should plan to sit down again to look at this. Do you have a Wikimedia Labs account? [16:09:49] No i dont [16:10:03] https://wikitech.wikimedia.org/w/index.php?title=Special:UserLogin&type=signup&returnto=Main+Page [16:10:14] Got it [16:10:28] If you get an account, I'll add you to the "Snuggle" project and we can work together on redis testing. [16:10:31] Btw, maybe you'd want a hangout next time, so you dont have to type it all out [16:10:34] Ok [16:10:55] Do you know how public/private keys work? [16:11:57] If we're talking about data encryption then I have a basic, though not a very firm idea of how things work [16:12:31] What operating system are you using? [16:12:52] Windows 7 [16:13:06] https://wikitech.wikimedia.org/wiki/User_talk:TheOriginalSoni [16:14:43] OK. Windows 7 might be a little bit of a struggle, but I know we can generate a key for you somehow. [16:15:01] Have you don't much work with Linux? [16:15:10] None at all. Should I? [16:16:47] Labs is 100% linux, so I'd highly recommend it. [16:17:10] I can help you set up a "Virtual" ubuntu environment on your computer to work with when we meet next. [16:17:34] Ok, got it [16:17:35] If you want to get started without me, google "Install Ubuntu in Virtual Box". [16:17:51] OK. Time for me to get back to WMF stuff. [16:18:05] I'm stoked that you want to help. :) This is fun. [16:18:52] See you later