[14:21:25] halfak: I'm online tell me you are online [15:06:41] Hey Amir1_ [15:12:33] hey halfak [15:12:36] :) [15:13:20] So. Want to talk Wikicredit? [15:21:33] Maybe not. [15:25:46] :P [15:31:51] halfak: sorry something happend [15:31:58] I was afk and I forgot [15:32:04] No worries. Back now? [15:32:05] please ping me [15:32:07] yes [15:32:20] or private [15:32:35] Amir1_, So I was thinking about the parts of WikiCredit that would be easiest to collab on. [15:32:48] good :) [15:33:17] Right now, no one is even considering the UI/presentation bits. [15:33:33] Have you done much work in wsgi/flask before? [15:33:42] Or a JS framework like knockout? [15:33:53] Maybe a graphing library like D3? [15:34:14] I new a little bit flask [15:34:21] *know [15:34:36] but mainly I'm not good at UI [15:35:03] Gotcha. [15:35:19] my work are always been something computing, like bot operating, pywikibot development, etc. [15:35:51] So, for server-side stuff, we've got two servers that I want to stand up. They perform relatively simple operations. [15:36:06] Diffengine: [recentchanges] --> [diffs] [15:36:28] PersistenceEngine: [diffs] --> [persistence stats] [15:36:58] ok I see [15:37:05] Right now, Diffengine is 90% ready to go. It can keep up with active wikis, but just barely. [15:37:25] It would be nice if we had some something like python's celery behind it. [15:37:53] The PersistenceEngine hasn't been developed, but I have *a lot* of code that we can work from. [15:38:04] hmm [15:38:32] For example: https://github.com/halfak/MediaWiki-Streaming [15:38:50] This set of streaming processors implements the whole pipeline for Hadoop. [15:39:22] * Amir1_ is looking [15:42:12] I should look it more carefully later [15:42:42] but AFAIK I can help in these parts [15:43:49] I'm not familiar with Hadoop but I can learn how the I/O works in hadoop and work with it [15:46:22] As you can tell with these scripts, the land of "Hadoop Streaming" is quite simple. If you can work with pipes (stdin/stdout) in unix, you can hadoop. [15:46:39] Have you ever worked with celery before? [15:46:45] Amir1_, ^ [15:47:10] no, can you give me a link? [15:47:53] http://www.celeryproject.org/ [15:47:55] ? [15:50:07] it seems like a rather easy library http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html [15:55:48] +1 Amir1_ [15:56:02] Nothing too fancy -- just distributed processing queues [15:56:53] ok :) [16:01:01] wbrb [16:01:04] *brb [16:02:11] kk [16:02:19] I'll be around :) [16:30:04] back [16:30:26] halfak: What Can I do? [16:30:40] what do you suggest for me? [16:31:05] One thing that I haven't had time to think about is running the DiffEngine in celery. [16:31:16] Let me show you some bits of the DiffEngine. [16:31:22] goo [16:31:24] *good [16:31:44] Root of repo: https://github.com/halfak/Difference-Engine [16:32:24] So, this is the abstraction that I've been working with: https://github.com/halfak/Difference-Engine/blob/master/diffengine/synchronizers/synchronizer.py [16:33:00] A "synchronizer" takes some feed of data and dumps diffs into an internal store (Mongo at this point, but I think we should switch to Postgres+JSONB) [16:33:24] It gets at least an "engine" which is a thing that knows how to process new revisions. [16:33:33] And a "store" which is a thing that knows how to store new diffs. [16:33:44] I think that a Celery synchronizer is in order. [16:33:58] So, this line might be an issue: https://github.com/halfak/Difference-Engine/blob/master/diffengine/synchronizers/synchronizer.py#L46 [16:34:08] *line --> method [16:34:41] That line gets a PageProcessor (maintains diff state) from the store or it constructs a new one if we have never seen this page before. [16:37:09] So, we'll probably want to group changes by page. [16:37:30] E.g. if there's already a processor in the queue, it would be good if we could just give it more revisions to process. [16:37:41] I'm not quite sure how to think about that yet. [16:43:04] I need a little time like a week to examine the codes [16:43:39] is it okay for you halfak? [16:43:43] Sure! I've been away from this code-base for a bit, but it's time to pick it back up. [16:43:54] So I'd appreciate any notes you have about stuff that doesn't make sense. [16:44:10] I'll do :) [17:51:34] hi, halfak, Amir1_ , I just saw you discussion and wondered if wikiwho / the api we are working on that can contribute to the Wikicredit idea? I have not been following Wikicredit, but it seems to overlap quite a bit [17:52:34] FaFlo, it already has. :) But seriously though, right now I want to get an efficient diff synchronizer in place so that it is easy to play around with persistence metrics. [17:52:49] It seems that the current API only generates data upon request. [17:53:06] I think we need to get good at generating the data before it is requested so that we can have more of a real-time response. [17:54:50] ah, sure [17:55:28] "FaFlo, it already has. " —> are you using the code ? :) [17:56:17] Not really. Remember me showing you http://pythonhosted.org/deltas/? [17:56:24] Specifically http://pythonhosted.org/deltas/detection.html#module-deltas.detection.segment_matcher [17:56:42] It implements a modular version of what you are doing in WikiWho. [17:56:49] You can define new segmenters and all that. [17:57:05] ah, cool [17:57:12] It won't look back more than one revision though. [17:57:19] i see [17:57:30] So it can stay true to the way that diff utilities generally work. [17:57:53] sure, for revision-to-revision comparison that is fine [17:58:17] Indeed. :) And it works really well for tracking content moves. [17:58:24] and do you want to do authorship tracking as well? [17:58:37] Yes, that's done in a second pass because it is far less CPU intensive. [17:59:08] Tracking the entirety of persistence/authorship is super storage intensive. [17:59:19] I know :) [17:59:21] So I am only tracking per-token and per-revision stats. [17:59:39] And I'd like to re-process the diff dataset whenever we want to update/extend the stats. [17:59:50] i see [18:00:06] and so you would run authorship detection on the precomputed diffs for example [18:00:13] or any other metric/analysis [18:00:20] +1 [18:00:21] https://meta.wikimedia.org/wiki/File:Content_persistence.system_architecture.diagram.svg [18:01:30] I was just discussing with Amir1_ how it would be nice if we could horizontally parallelize the diffengine and persistence tracker. [18:01:48] So, if we start getting behind, we can just stand up some more machines. [18:03:12] FaFlo, next time you see Emufarmers around, you should say "hi". [18:03:21] He's really interested in your working API for WikiWho. [18:03:54] Oh wait.. He is online, but away. [18:04:20] Maybe if I say his name three times, he'll appear. Emufarmers Emufarmers EMUFARMERS! [18:05:23] FaFlo, anyway, I hope to get you some users *now* and work together on getting some powerful/real-time systems behind this work in the long term. [18:05:39] I'd love to have you review the SegmentMatcher to see if I'm missing anything. [18:05:44] :) [18:05:56] It seems to work well in practice. :D [18:06:43] +1 [18:06:46] sure [18:07:19] we still have to move this API to a more powerful machine, because big articles tend to cause hickups [18:07:31] right now it's just a showcase [18:08:37] Indeed. There's more capacity in labs these days. [18:08:43] How much machine do you need? [18:08:57] CPU/RAM/etc [18:10:38] We also have some beefy machines that are out of wartantee, but would work fine for testing. [18:10:47] *warrantee [18:11:07] *warranty [18:11:10] * halfak needs more coffee [18:12:35] hehe, ja, let's see [18:13:13] I hop soon I will have time to look into that more [21:02:03] * Emufarmers yawns.