[00:00:39] o/ halfak [00:00:47] I was trying to sleep [00:00:54] Hey dude. Sorry if I woke you [00:00:56] :( [00:01:05] No I couldn't sleep [00:01:15] so I tried reading some email while I'm in bed [00:01:20] OK. I have some annoying drama for you then. [00:01:27] and I ended up with laptop writing this :) [00:01:38] I answered your email :) [00:01:50] I was waiting for this drama [00:02:05] OK. I'm going in full force then. [00:02:06] :) [00:02:16] specially read our last emails [00:02:21] ooooodraaame [00:02:28] Yup. [00:02:33] I even say your name in the email. [00:02:35] ;) [00:02:38] there are reluctant to help and they want to credit :D [00:02:50] * Amir1 gets some popcorn [00:03:00] waah [00:03:06] do I want to know what's going on? [00:03:36] Na. [00:03:42] I'll give you a TL;DR though [00:03:43] YuviPanda: hey :) [00:03:55] heh [00:04:20] also about Arthur's clustering. The last column is reverted status, so I only check revisions that are reverted. Right halfak? [00:04:44] I did this mistake initially and you corrected me :) [00:05:04] Martin and Stefan are researchers who have been looking at vandalism in WikiData. Lydia introduced us to each other. We had a couple of VOIP conversations. No code overlap. I'm honestly not sure if they are making progress. They saw the post about us releasing a model for WikiData and were mad that they were not mentioned for contributing. [00:05:12] Amir1, oh yeah! [00:05:39] YuviPanda, ^ see TL;DR: [00:05:45] :D [00:06:01] aaaah [00:06:11] sounds like academia [00:07:20] they were not making progress [00:07:25] they didn't even working [00:07:33] honestly [00:09:22] I hate ministry of state in USA, more than anything in this world [00:09:25] right now [00:09:42] "There are no available appointments at this time. " [00:10:55] Amir1, if you could get an appointment now, would it work? [00:11:24] not right now, because I have to travel abroad [00:11:28] but yeah [00:11:32] sooner the better [00:11:41] Gotcha [00:11:49] o/ aetilley [00:11:53] hey :) [00:12:15] The gods of chance have woken up Amir1 in the middle of the night, so we have an answer to your Q [00:12:35] :P [00:13:39] Yes I read your comment and replied. [00:13:43] Amir1: ^ [00:14:25] and I'm waiting for halfakto answer that [00:14:27] Was thas task to run sigclust on edits generally or just those with [00:14:34] ok [00:14:46] it was intended to work on reverted edits [00:14:52] not all edits in general [00:14:59] Also wasn't the last column damaging vs. not damaging? [00:15:00] but since halfak is around [00:15:30] I thought it would be good he can confirm, maybe I was wrong the whole time [00:15:46] I think both analyses are going to be good. [00:15:48] Well either way I can easily change to that direciton. [00:15:51] For right now, this is exploratory. [00:16:08] We were originally looking for evidence of good-faith and bad-faith reverts in the pool of reverted edits. [00:16:41] It's quite possible that we also have different types of damage in the damage==True condition [00:16:51] 3:46 AM here, won't sleep at all :( [00:16:56] :D [00:17:09] But it will be good to know if the clusters that overlap damage==True also show up in damage==False [00:17:40] Well Amir let me run sigclust on just the dam=True edits and see what happens. [00:17:53] Here's what I propose. Follow SigClust where it leads you for both the entire set and just dam=True. [00:18:03] ok [00:18:05] Then try to see if you can find matching clusters [00:18:09] If yes, interesting. [00:18:12] If no, interesting [00:18:19] ^ AKA the Academic win [00:18:33] "matching" ? [00:18:42] Yeah. I'd use a set similarity function [00:18:47] Like Jakart coef [00:18:49] gotta run [00:18:50] o/ [00:18:55] o/ [00:19:04] ok [00:46:46] Amir1|afk: I'm having trouble finding anything called a Jakart coefficient [00:47:24] Also do you know which sets Aaron was refering to above to compare by a similarity function? The pair of clusters? [00:48:18] aetilley, https://en.wikipedia.org/wiki/Jaccard_index [00:48:21] * halfak runs away again [00:48:27] Spelling is hard [00:54:23] Cool thanks [00:55:33] Aaron did you want to apply this two the pair of clusters? [00:55:40] halfak: ^ [05:22:54] halfak so what else needs to be done to start the labels campaign for dutch wikipedia? [05:23:29] only item left in the ticket is loading it into labels.wmflabs.org on the ticket [05:23:58] people are kind of excited :3 [14:02:43] o/ joal [14:11:06] halfak: hey [14:11:08] :) [14:11:14] have you seen the anwser? [14:11:17] *answer [15:04:28] Amir1, hey [15:04:30] Yes I did. [15:04:36] This is going to get heated [15:04:41] I'm not worried. [15:05:55] Honestly we can say to them, You didn't help us even though we asked to (see Amir's emails) [15:06:16] if you want to help, just do it and then claim credit [15:07:45] Also, we didn't agree that you would necessarily get 1st author and that we would write just a section of some research paper. [15:07:55] *you --> Martin and Stefan [15:08:13] I'm pretty sure that you (Amir) is going to get 1st author on what we do write :) [15:08:53] agreed [15:09:01] :) [15:09:39] I haven't started yet [15:09:54] I will do some writing once I got my TOEFL [15:10:19] halfak: can you give me a hint on how to start? [15:10:59] If we do it my way (which I'd argue is highly optimized for sanity) we'll do an outline first and talk through the story that we want to tell. [15:11:24] Did we do classification in a new space? Are we part of a larger system? What makes this work interesting to a scholarly audience. [15:11:39] Once we know the story we are telling, we flesh out the outline with the high points we want to hit. [15:11:57] Iterate until we get both like the outline... [15:12:09] Then we flesh out prose around the outline. [15:12:48] Usually I just do this once, but then follow it up with a sequence of cleanup passes -- reading the manuscript and re-writing sentence level things. [15:12:54] Then we'll have a paper. [15:13:16] So, back to the beginning... the first thing we need to do is figure what stories we might tell about what we did. [15:13:54] I think a good strategy for knowing what stories would be interesting is to pick up the recent literature and find some dominant or missing conversation we'd like to reinforce. [15:14:15] E.g. A.G.West has the conversation about the security of open systems like Wikipedia. [15:16:04] what do you mean security? [15:16:18] security as data not being corrupted [15:16:32] or securityy in sense of Ops [15:21:11] West interprets security broadly and talks about Wikipedia as one big attack surface. [15:21:37] STiki (and AI/human-computation generally) is a strategy for managing such a large attach surface. [15:21:43] *attack surface [15:22:06] This is just a way of talking about what we are doing. [15:22:08] A lens. [15:22:16] There are many lenses we can choose from. [15:22:23] They will enable us to tell different stories. [15:27:04] I know there are system to catch attacks by using anomaly detection [15:27:16] Arguably, that's what we are doing. [15:27:18] :) [15:30:34] Amir1, we're working on a structured knowledge base too. There's people who will be interested in the implications of broken structured information. [15:31:31] So, we're not just preventing damage but maintaining structural integrity. We can do analyses of the resiliency of such integrity to damage before and after the introduction of ORES and a standard set of tools that take advantage of it. [15:31:43] We can also show how one can use ORES to detect good-faith newcomers. [15:31:49] So much good stuff. [15:32:09] hmmm, I will think about your suggestions and I will try to figure out best strategy for it [15:32:14] Wikidata's newcomer survival rates are falling ATM. Not as bad as other wikis though. [19:37:23] halfak: yay on the service request ticket [19:39:09] :D [19:39:17] I'm just finishing up my redis queue size hack. [19:39:30] It is icky :D [19:50:13] YuviPanda, can you take a look at https://github.com/wiki-ai/ores/pull/102 when you have a chance. [19:53:55] halfak: left comments [19:54:27] Thanks. Working through them :) [19:56:46] halfak: re: the llen, leaving a comment around it saying 'hey this is for backpressure but celery does not have this natively so we gotta do this instead sucks' [19:56:49] might be nice [19:58:58] kk [19:59:07] This is a pain in the ass to test [19:59:50] halfak: yeah :( [19:59:59] halfak: hmm, does it work if you just not run any celery workers? [20:00:20] Not sure how I can do that *and* have a queue at all [20:00:27] ah interesting [20:02:25] OK. Cool. When I run pre-cached against it, the queue stays at maxsize items. [20:02:39] I just put a time.sleep(50) in the scoring task [20:02:44] ah nice [20:23:24] halfak: doesn't urllib.parse split out the redis url? [20:23:46] I posted the output. I'd still need to parse the path and the netloc. [20:24:27] It'll turn 5 lines of code into 10 [20:24:34] But it can be done that way if you'd prefer. [20:24:48] urlparse will be more robust for the splitting it does [20:24:53] I understand that. [20:24:55] :) [20:25:26] I'm always nervous of regexes but since this isn't actually dealing with parsing user provided input it doesn't matter :) [20:25:58] Just whoever wrote the config file [20:26:05] Darn sysop users ;) [20:26:13] yeah [20:26:15] :D [20:26:54] halfak: ok if we can't find the name of the celery queue then putting a big warning around it (since we're using an internal implementation detail!) would be nice [20:27:09] halfak: imagine they change the default queue name in a future version, and we upgrade and bam everything goos boooooom [20:27:22] This is why I didn't want to hack it into ORES :P [20:27:30] yeah [20:27:44] but between that and getting a page at 3am :D [20:28:09] Yes. [20:28:14] I'm with you. [20:28:22] I hope we can convince upstream to take it [20:28:29] or at least expose the name they use [20:28:36] Also, it took me forever to realize that the default queue was called "celery" because they call it "default" a few times in the docs. [20:28:40] So it has changed before. [20:28:45] right [20:28:45] Seriously [20:29:00] are we the first to deal with backpressure seriously? [20:29:04] You can find my cursing out this somewhere in the public logs ;) [20:29:11] YuviPanda, i know, right? [20:29:12] ok, OCG still doesn't have its backpressure issue fixed and falls over now and then :) [20:29:16] so maybe not? I dunno [20:29:22] err, so maybe? dunno [20:29:24] OCG? [20:29:30] Offlien Content Generator [20:29:30] Open Content Gangster [20:29:35] Oh! [20:29:43] the pdf / epub / etc generator we use in production [20:29:56] nodejs app, similarish architecture to ores (web frontend + queueing backend) [20:29:57] in nodejs [20:30:20] it used to fall over everytime there was a large burst of requests [20:30:27] queue would overload and nothing would complete [20:48:26] ARG! netlog is @host:port [20:48:31] So I guess I need to regex that [20:50:29] Basically, this is splitting my string on "://" and then "/" [20:51:48] ah regexes :D [20:51:55] halfak: this is one of my favorite things about perl6! [20:52:06] I'd write a real parser for this and it'd be as many lines as this regex [20:52:29] YuviPanda, we shouldn't have to do any of that [20:52:47] yeah I know just a wild tangent :_) [20:53:10] plus I seem to have become one of those people who go 'If only this was in this (wildly-impractical-for-current-situation-language)' [20:55:39] OK. Now I have a really complicated function that still has a regex. [21:01:15] halfak: \o/ [21:14:42] YuviPanda, ^ [21:14:49] At least we won't be bitten by renames now [21:24:09] halfak: I'm slowly heading to the office, but LGTM :) [21:24:13] and yay on smaller regex [21:49:12] halfak: http://dumps.wikimedia.org/other/cirrussearch/20151102/ JSON dumps of wiki contents [21:49:29] not historical, only current but can be imported into Elasticsearch directly for fun things [21:50:21] YuviPanda, know where I can find a schema? [21:50:58] * halfak hammers his laptop with a real-world full queue situation [21:51:05] Holy crap that was frustrating. [21:51:13] I usually don't use all of my CPU and then try to do things. [21:51:14] Ha [21:51:23] So, it works. [21:51:42] It'll keep the queue in check if the workers (my little laptop) can't keep up. [21:51:47] Some requests get a 503 [21:51:52] Others go through. [21:55:59] * halfak sends the change to staging. [21:56:15] I'll just let it sit there for a while while I precache the bejesus out of it. [21:57:21] halfak: nice! [21:57:36] halfak: you should check current length of queue in staging and prod too before setting a config variable [21:57:44] halfak: I'll ask them for schema [21:57:50] YuviPanda, always zero [21:57:59] *nearly always zero [21:58:10] We can run through a queue of 100 tasks in ~ 5 seconds. [21:58:32] That's the longest someone should wait for their task to *start* executing. [21:58:34] nice [21:58:37] yay [21:58:40] * halfak did the math [21:59:03] ofcourse :D [21:59:17] halfak: at some point when you're hear I need to sit down with you and learn all your good habits :)