[11:37:58] <tos>	 halfak, You there?
[11:38:06] <tos>	 I just saw your ping for me on meta
[13:56:59] <halfak>	 Hey tos.  Sorry to miss you.
[13:57:12] <tos>	 Hey
[13:57:52] <tos>	 Did any of the others contact you yet
[13:59:44] <tos>	 halfak, ping
[13:59:57] <halfak>	 Nope.  I didn't hear from anyone but you.
[14:00:54] <tos>	 Maybe you should have pinged them on enwiki/email?
[14:01:05] <tos>	 I suggest a group mail to all
[14:01:17] <tos>	 So any reply goes to all, like a mailing list
[14:01:28] <tos>	 That way, a time could be set up easier
[14:01:48] <tos>	 Btw I have plenty of time so consider me available
[14:05:34] <halfak>	 Awesome.
[14:05:43] <halfak>	 Yeah.  I'm kind of hoping to not become the leader.
[14:05:49] <halfak>	 I'm spread too thin as it is.
[14:06:06] <halfak>	 But I don't want to see this project fizzle out.
[14:06:33] <halfak>	 I suppose mentions on meta don't work as well as mentions on enwiki.
[14:07:00] <halfak>	 I've got to head to the office.  I'll be back online in about an hour.
[15:10:39] <tos2>	 halfak, Welcome back
[15:11:19] <halfak>	 Thanks.  The ride in was nice this morning.  It's almost almost above freezing here.
[15:12:30] <tos2>	 !coffee halfak
[15:12:40] <tos2>	 Dang
[15:14:26] <halfak>	 book wm-bot2
[15:14:30] <halfak>	 *boo
[15:15:06] <halfak>	 BTW, I'm almost ready to start using the library that I've been working on for the back-end of Snuggle.
[15:15:15] <halfak>	 https://bitbucket.org/halfak/mediawiki
[15:15:20] <halfak>	 Not much to see yet.
[15:15:27] <halfak>	 Unless you want to fork the repo.
[15:15:41] <halfak>	 fork = make a copy to work with
[15:15:49] <halfak>	 repo = Source code repository
[15:15:55] <halfak>	 words
[15:16:42] <halfak>	 (fwiw, I don't mean to patronize -- I just hate it when people knowingly talk in jargon)
[15:16:58] <halfak>	 tos2: ^
[15:20:22] <tos2>	 One sec. Was afk
[15:21:32] <tos2>	 What language is the code in, halfak?
[15:21:45] <halfak>	 Python
[15:22:00] <halfak>	 That's a good thing to note at the top of the README
[15:22:02] <tos2>	 Gotcha. I should be able to look into the code then
[15:22:20] <tos2>	 (The page hadnt loaded yet, so I wasnt able to tell :P )
[15:22:34] <halfak>	 Right now, it's super under-documented.  My focus is on the examples in the "scripts/" directory.
[15:23:05] <halfak>	 My goal is to design the API that makes the library as easy to use as possible.
[15:23:26] <halfak>	 So I'm starting by pretending the library is finished and writing scripts that make use of it.
[15:24:48] <tos2>	 Okay... So what will be there in the library?
[15:26:43] <halfak>	 There's connectors for querying Wikipedia's API, the database and XML dumps.  There's also revert detectors, session caches, timestamp management and title parsing.
[15:27:47] <halfak>	 The idea is that I solve all of the annoying problems in this library and encourage others to make use of it too.  Snuggle's back-end will start using this library too.
[15:28:46] <tos2>	 So the library is kind of an accessible way to access Wikipedia's API using queries, and manipulating the data recieved, so you can use it on multiple locations, including Snuggle?
[15:28:57] <halfak>	 Yup.  That's a good way to put it.
[15:29:19] <tos2>	 Got it.
[15:29:41] <tos2>	 And the current scripts you're working on, what exactly do they help with?
[15:30:35] <halfak>	 Those example scripts help me think better about constructing an API that's nice to use.
[15:30:52] <halfak>	 It's sort of like doing mockups of a UI change before making the change.
[15:31:02] <halfak>	 By thinking about how it will be used before I start writing the code, I make something more useful.
[15:31:10] * tos2  does not follow, really
[15:31:13] <halfak>	 But in this case, the mockup is code too.
[15:32:02] <halfak>	 Take this script for example: https://bitbucket.org/halfak/mediawiki/src/6d8c770737c4104df83466344858806e8f55c95a/scripts/example.dump.reverts.py?at=default
[15:32:30] <halfak>	 It demonstrates a really powerful thing that you can do with the library.
[15:33:03] <halfak>	 The XML dumps that Wikipedia produces are huge.  They contain all of the revisions to all articles, ever.
[15:33:34] <halfak>	 As you can imagine [[Anarchism]] and other large articles represent gigabytes of data.
[15:34:29] <halfak>	 So, in order to process these dumps, you have to use a "streaming" strategy.  This allows you to read the large file a chunk at a time rather than trying to load the whole thing into memory.
[15:35:11] <halfak>	 The dump library I wrote makes it trivial to do this.  All you need is a the dump file and I'll give you a a set of "Page" and "Revision" objects to work with.
[15:35:58] <tos2>	 Ok so I'm getting to understand it a bit.
[15:36:04] <halfak>	 Given this nice structure, it's trivial for me to pass the right information to my revert detector to detect the reverts.
[15:36:15] <halfak>	 Usually the script I linked to would be 500-1000 lines.
[15:36:50] <halfak>	 But with my library, it's only 12 lines (21 with whitespace and comments).
[15:36:58] <tos2>	 So the scripts are basically ways to use the library for snuggle specific queries and work around with them?
[15:37:40] <halfak>	 Captured in this library are a lot of the ways that I think about processing wiki data.
[15:38:11] <halfak>	 Whether it is a new analysis for a research project or Snuggle.
[15:39:35] * tos2  nods
[15:39:42] <tos2>	 I think I understood most of it.
[15:40:30] <tos2>	 So is there anything I can do to help at this point? Mostly, what do you think I should be doing to try and understand most of it?
[15:41:21] <halfak>	 Hmmm.  Good question.  Right now, I just wanted to talk progress, but it would be great to have a hand.  I'll have to get back to you about where I think that the best progress can be made.
[15:41:47] <tos2>	 Gotcha.
[15:41:53] <halfak>	 I'm not sure that hoping into this library is a good idea, but making use of the library and talking to me about what breaks/doesn't make sense would be great.
[15:42:19] <halfak>	 It's not quite ready for general use yet, but that should be coming within a week.
[15:42:59] <halfak>	 The next big task for Snuggle is exploring redis as a database back-end.  Have you had a class that has covered set theory yet?
[15:44:00] <tos2>	 I assume set theory does not imply the set theory of maths, so no
[15:48:37] <halfak>	 Yes.  Set theory of maths
[15:48:39] <halfak>	 :P
[15:48:53] <halfak>	 Math === Computer science
[15:49:51] <tos2>	 How advanced set theory are we talking of?
[15:50:18] <tos2>	 I know set theory, but I doubt it's of the level you're talking of
[15:50:47] <halfak>	 It should be fine.  The thing I was worried about is the foundation concepts.
[15:50:54] <halfak>	 So redis is a database.
[15:51:01] <halfak>	 It does two weird things.
[15:51:23] <halfak>	 1. It stores everything in memory (RAM).  Most databases store data on a hard drive.
[15:51:53] <halfak>	 This makes it super fast, but we also need to be cautious to limit the data that gets to live in redis.
[15:52:20] <halfak>	 2. Redis is a key-value store.  Most of the time, when I saw "database" I mean a "Relational Database Management System".
[15:52:28] <halfak>	 (RDBMS)
[15:52:51] <halfak>	 Relational systems use SQL, to access "relations".
[15:53:06] <halfak>	 A "relation" is actually just a set with some cool operators.
[15:53:32] <halfak>	 So when I'm querying the MySQL database behind Wikipedia, I'm really just performing some set operations.
[15:53:36] <halfak>	 Well, redis is different.
[15:53:54] <halfak>	 Redis *has* sets, but most things are expressed as key-value pairs.
[15:54:28] <halfak>	 Keys can get quite complex and that's good for us.  Let me try an example.
[15:54:47] <halfak>	 What programming language are you most familiar with?
[15:55:05] <tos2>	 JAVA
[15:55:21] * halfak  hopes you don't mine the impromptu lecture on important bits of redis. 
[15:55:41] <tos2>	 No i dont. I just plan to google most of the stuff I didnt understand
[15:55:44] <halfak>	 OK.  Damn.  Stupid Java.  How do you make a hash map in Java?
[15:56:28] <tos2>	 I have no idea, but let me google what a hash map is
[15:56:34] <halfak>	 OK.  Here:  db = java.util.HashMap<String, String>;
[15:56:49] <halfak>	 so we can say db['foo'] = "bar"
[15:56:59] <halfak>	 And then ask for db['foo'] later to get "bar"
[15:57:27] <halfak>	 Does that work in Java?
[15:58:11] <tos2>	 Are you talking of pointers?
[15:58:24] <halfak>	 Oh no.  Well sort of.  Everything in Java is a pointer.
[15:58:36] <halfak>	 Rather, a reference.
[15:59:00] <tos2>	 Okay
[15:59:04] <halfak>	 I just want something that stores key->value pairs.
[15:59:22] <tos2>	 Got it.
[15:59:25] <halfak>	 It looks like java wants me to be more explicit.  I'll start over again.
[15:59:39] <halfak>	 db = HashMap<String, String>()
[15:59:46] <halfak>	 db.put("foo", "bar")
[15:59:59] <halfak>	 db.get("foo") // returns "bar"
[16:00:03] <halfak>	 Make sense?
[16:00:23] <tos2>	 Yes
[16:00:48] <tos2>	 I can get basic concepts. So what about hash maps
[16:01:36] <halfak>	 OK.  So in redis, the whole system is one giant hash map.
[16:01:56] <halfak>	 But we'd like to store a lot of different kinds of things in it -- like data about specific users.
[16:02:05] <halfak>	 So we use some conventions to produce interesting keys.
[16:02:45] <halfak>	 db.put("users:123456789:user_name", "EpochFail")
[16:03:15] <halfak>	 db.put("users:123456789:registration_date", "2013-01-10T03:04:05Z")
[16:03:19] <halfak>	 etc.
[16:03:51] <tos2>	 Got it
[16:03:53] <halfak>	 So we can use this type of structure to store all of the things we care about with regards to users.
[16:03:56] <halfak>	 And pages:
[16:04:24] <halfak>	 db.put("pages:12:title", "Anarchism")
[16:04:30] <halfak>	 OK  That's enough.
[16:04:38] <tos2>	 So all the details about your username etc is linked to a given ID
[16:04:44] <tos2>	 Or pages
[16:04:52] <halfak>	 Exactly.  But how do you query the users?
[16:05:02] <halfak>	 Right now, you need to know the key name in order to get the values.
[16:05:35] <halfak>	 So, redis also has sets.  I'm going to switch from java land to redis commands.
[16:05:49] <tos2>	 Ok
[16:06:17] <halfak>	 the way this works is "<command> <arguments>"
[16:06:27] <halfak>	 the command to add a value to a set is "SADD"
[16:07:07] <halfak>	 "SADD users:ids 123456789" Adds the user ID for "EpochFail" to the set located at "users:ids";
[16:07:36] <tos2>	 Got it
[16:08:58] <halfak>	 Using these datatypes.  I think that we can increase Snuggle's performance by 100X.  But the trick is figuring out how to turn all of the "Relational database" thinking into "Redis key-value + sets" thinking.
[16:09:41] <halfak>	 So, I've got to run now, but we should plan to sit down again to look at this.  Do you have a Wikimedia Labs account?
[16:09:49] <tos2>	 No i dont
[16:10:03] <halfak>	 https://wikitech.wikimedia.org/w/index.php?title=Special:UserLogin&type=signup&returnto=Main+Page
[16:10:14] <tos2>	 Got it
[16:10:28] <halfak>	 If you get an account, I'll add you to the "Snuggle" project and we can work together on redis testing.
[16:10:31] <tos2>	 Btw, maybe you'd want a hangout next time, so you dont have to type it all out
[16:10:34] <tos2>	 Ok
[16:10:55] <halfak>	 Do you know how public/private keys work?
[16:11:57] <tos2>	 If we're talking about data encryption then I have a basic, though not a very firm idea of how things work
[16:12:31] <halfak>	 What operating system are you using?
[16:12:52] <tos2>	 Windows 7
[16:13:06] <tos2>	 https://wikitech.wikimedia.org/wiki/User_talk:TheOriginalSoni
[16:14:43] <halfak>	 OK.  Windows 7 might be a little bit of a struggle, but I know we can generate a key for you somehow.
[16:15:01] <halfak>	 Have you don't much work with Linux?
[16:15:10] <tos2>	 None at all. Should I?
[16:16:47] <halfak>	 Labs is 100% linux, so I'd highly recommend it.
[16:17:10] <halfak>	 I can help you set up a "Virtual" ubuntu environment on your computer to work with when we meet next.
[16:17:34] <tos2>	 Ok, got it
[16:17:35] <halfak>	 If you want to get started without me, google "Install Ubuntu in Virtual Box".
[16:17:51] <halfak>	 OK.  Time for me to get back to WMF stuff.
[16:18:05] <halfak>	 I'm stoked that you want to help. :)  This is fun.
[16:18:52] <tos2>	 See you later