[13:43:57] halfak_: hey, ping me when are around [13:44:04] *you are [15:39:06] o/ Amir1 [15:39:21] sorry for the delay. Today is an uber meetings day. [16:02:44] halfak: I was afk sorry [16:02:48] okay [16:03:02] In next meeting :/ [16:03:16] oh okay, tell me when you are done :) [16:08:21] Amir1, not for hours. [16:08:30] Message away and I'll respond when I can [16:08:31] o/ Shilad [16:08:40] Shilad is the mastermind behind WikiBrainAPI. :) [16:08:56] http://shilad.github.io/wikibrain/ [16:09:18] Indeed. Happy to answer questions about it. [16:09:21] https://meta.wikimedia.org/wiki/Grants:IEG/WikiBrainTools [16:09:28] o/ Shilad :) [16:09:41] How long until we can hit it with HTTP requests? What will be the first end points made available. [16:10:11] By the end of the week! [16:10:14] \o/ [16:10:17] First endpoints are: [16:10:24] hare, ^ [16:10:25] 1) relatedness scores between words, articles, etc. [16:10:28] * Amir1 is reading the IEG [16:10:31] You're going to be interested in this stuff. [16:10:54] What are wiki brain tools [16:11:34] More fun AI services for wikis. Shilad listed out one of the endpoints that is coming first (by the end of the week): Semantic Relatedness [16:11:37] 2) pagerank and distance from articles to arbitrary categories. (can be used for finding pages within a category). [16:11:45] 3) Some basic geo [16:11:50] So, you can ask: How related are "orange" and "apple" [16:11:59] 4) Wikification (add wiki links to free text). [16:12:22] hare, we can use pagerank for importance prediction [16:12:24] Exactly. And what articles in EN are related to the EN word "apple?" [16:15:50] So maybe work recommendation too. [16:17:03] Nice [16:17:24] halfak: I've a wip patch for the precached daemon puppet work [16:17:33] That I can deploy at some point [16:17:38] Prolly not this week heh [16:17:55] No worries. I might just go start the job on the web node while you're away ;) [16:18:04] Sound OK? [16:18:32] halfak: yup. Run it in a screen and !log it on the labs channel [16:18:39] !logs are super useful! [16:18:45] When in doubt, !log [16:18:47] Will do. [16:18:51] Should I log for staging too? [16:18:57] I suppose so. [16:19:06] DO I just think of logs like a commit message? [16:19:17] halfak: yeah. [16:19:23] halfak: but one line only. [16:19:30] Gotcha [16:19:54] halfak: but provide as much detail as you can. Exact command line / hostname and reason for it is usually good [16:20:34] OK. Makes sense. [16:25:58] hare, no strong need to register wikibrainapi stuff now. I'll be digging into it and bugging you with more details later. :) [16:30:01] Mmm yay more ai stuff [16:30:32] YuviPanda, you're gonna be sad though. These guys are 100% java. [16:31:00] o/ Amir1 [16:31:15] looks like this meeting is slow and I shouldn't be here. I can pay attention almost fully :) [16:32:24] halfak: heh that's OK. I've come around on java being useful :) [16:32:38] :) [16:32:41] Although I'll think 2/3 times before volunteering for puppet support there :) [16:32:45] If I ever get asked even [16:33:16] Awight BTW I created a role::aptly for managing our debs :) it is applied on ores-misc-01 and can be used by root. [16:33:19] Just a fyi [16:34:47] shweet. You might have noticed I made a task for finishing my homework :) [16:35:22] awight: I saw you had done all the ones and even started on sklearn!!! [16:35:23] Awesome [16:35:24] It's all 90% over the finish line, builds and stuff, but we need to host in Gerrit, then do the footwork of pushing our objects to a repo [16:35:28] hehe [16:35:38] sklearn is kicking my butt [16:35:39] awight: nah let's out it under wiki-ai for now [16:35:56] If we can't contribute it upstream [16:36:00] well, just leave notes on that task if there's stuff I can do towards that [16:36:03] I convinced mwparserfromhell to accept it [16:36:06] Ok will do [16:37:30] Yeah I got one bite, jsonpify, but the other author who got back to me was much more measured: oh hell no you aren't making me the maintainer, he said. [16:38:15] Heh [16:52:02] I was afk [16:52:05] halfak: hey [16:52:15] first of all the pywikibase is ready now :) [16:52:29] I removed all of unnecessary "site"s [16:52:45] YuviPanda: I have several things to ask you [16:52:51] when do you have time to talk? [16:53:32] Amir1, great! I'll have a look at it and familiarize myself and do some cleanup. [16:53:49] awesome [16:54:02] I forgot the second issue [16:54:08] I will think and then let you know [16:54:11] :D [16:56:21] Wow shilad! [16:57:00] Also, how did Minneapolis end up becoming the Wikipedia research hub? [16:57:38] Just one of 'em. I guess we got in on the boom of Wikipedia research around 2006. [16:58:08] I guess we're now a hub since we decided to stick with it and get several papers into exploring problems/opportunities. [16:58:27] A lot of labs will milk the hot new datasets [16:58:37] and get a lot of pubs and citations doing it :S [17:07:37] Amir1: am packing and travelling. Just ask and I'll respond :) [17:07:43] (Over here) [17:07:53] * YuviPanda plans Minneapolis visit [17:08:05] bon voyage [17:08:07] :) [17:08:24] :) not till later tonight so ask away [17:08:38] YuviPanda: let me know when you plan the Minneapolis visit [17:09:02] one thing that is okay to ask here is "I couldn't find any reasonable python3-based library for mysql, do you know?" [17:09:03] December or October or November [17:09:06] Not sure yet [17:09:10] Amir1: pymysqp [17:09:16] Err [17:09:19] Pymysql [17:09:25] Is pretty nice [17:09:28] Is it installed in labs? [17:09:29] And pytyon3 [17:09:34] awesome [17:09:35] Think so [17:09:35] thanks [17:09:39] Also use virtualenv!!!! [17:09:50] Pymysql is great [17:10:14] halfak: do you have a timeline for the conference thing? [17:10:26] * YuviPanda is eager to help and have it marinate in his head [17:10:32] YuviPanda, our submission is due in sept. [17:10:33] My next question is strange so I ask you private, is it okay? [17:10:38] It -> ways I can hel [17:10:40] *Nov [17:10:40] Amir1: sure [17:10:58] Deadline for WCUSA is August 31 :) [17:11:10] Hmm [17:11:15] * YuviPanda is still unsure [17:11:17] Of wcusa [17:11:21] But probably should [18:45:43] halfak: i bought up ores deploy on the ops meeting \o/ [18:45:52] (nothing was said, but I'm keeping it on ops' radar to make our lives easier) [18:45:58] +1 [18:46:05] I've been bringing it up in Scrum of Scrums [18:46:18] No Ops there last time, so it's good you covered that base. [18:53:11] halfak: yeah, https://etherpad.wikimedia.org/p/TechOps-2015-08-10 I've been mentioning it [21:00:01] * halfak is trying to tackle our multilingual issue and the multiword badword detection strategy at the same time. [21:00:05] It's going Okay [21:02:57] Nice [21:03:27] Soon, I want you to be able to specify different language-based features in parallel [21:04:22] e.g. [english.diff.badwords_added, portuguese.diff.badwords_added] [21:07:27] and generating new labor hours measurements [21:58:39] halfak I'm back from PA. Talk to you friday? [21:58:50] o/ aetilley [21:58:56] Glad you're back :) [21:59:14] Me too. [21:59:14] And sure! [22:01:29] So... I won't be able to meet at the regular time on Friday. [22:01:35] aetilley, Amir1, ToAruShiroiNeko ^ [22:01:44] It turns out I'm traveling again. [22:01:56] Yay travelling [22:02:09] I could meet a half hour earlier on Thursday. [22:02:15] YuviPanda, out to copenhagen again. [22:02:32] halfak: suddenly? [22:02:33] I'm looking forward to biking around there :) [22:02:42] Na. I forgot it was coming up though. [22:02:46] halfak: also woo timezone overlap [22:02:47] Nice [22:02:56] Hey. On that note, how do you all schedule meetings? ping the hell out of one another in IRC? I'm trying to circle the wagons to discuss my initial findings and questions re: https://phabricator.wikimedia.org/T106838 [22:02:58] halfak: where is the cxcw conference btw? [22:03:19] awight: I usually just put that in halfak'w calendar wherever I can find a slot [22:03:35] Ha. Okay, everyone else optional then ;) [22:03:39] awight, good Q. I mostly aim for my overlap with reasonable hours in Belium and Iran and then put something on the calendar and invite people to push back on the timing. [22:03:57] +1 for YuviPanda's strategy [22:04:37] k [22:25:24] halfak: 8:30am Thursday works. [22:25:30] brb [22:25:51] kk. [22:25:55] * halfak moves event boldly [22:27:44] I don't think I have made it to even one of these have i [22:27:45] Sigh [22:31:30] YuviPanda, s'ok. We're mostly just sync'ing up. [22:31:45] Every time someone wants to talk about something interesting, I tell them to wait. [22:31:47] Yeah [22:31:54] Oh heh [22:31:58] Or we'd never get done. [22:32:05] I don't actually know anything about the code anyway [22:32:10] Gotta rip of the bandaid when it comes to coordination work. [22:32:13] :P [22:32:20] :) [22:32:27] * halfak is doing dependency-injection judo now. [22:32:33] Heh [22:33:01] Generalizations on top of a generalization framework are an exercise in deciding between equivalent abstractions based on how intuitive they are. [22:33:12] Double-generalizations! [22:33:21] More like fixing one place that we were being constraining. [22:33:43] When I'm done, you'll be able to specify a language that behaves however the hell you want. [22:34:02] And you won't accidentally call words_added on wikidata and get a big pile of derp. [22:34:19] Heh [22:34:22] Nice?! [22:34:30] But re. languages, we need to support WTF because WTF is what languages are. [22:34:41] Heh :) [22:35:14] We need to have some form of back pressure system in place before this goes to prod [22:35:44] I'm thinking just 'abort current request immediately if celery queue is full' [22:36:17] YuviPanda, that makes sense. [22:36:35] I'd rather drop individual scoring attempts and make the requester retry. [22:37:04] BTW YuviPanda https://github.com/wiki-ai/ores/pull/78 [22:37:37] Yeah ocg is a similar architecture but had a lot of problems from lack of backprssure [22:37:44] acg? [22:37:47] *ocg? [22:38:40] * halfak completes first working example of languages-as-a-setoffeatures [22:40:15] Oooh! I just realized that this will allow our users to not have to link about language dependencies if they don't have to. That's a nice perk. [22:41:04] Nice [22:41:16] halfak: offline content generator. PDF and stuff [22:41:26] Oh! [22:41:27] Gotcha. [22:41:43] Lots of requests come in, queues get overwhelmed, everything dies, we lose full wueue [22:41:45] Queue [22:41:49] So, by backpressure, you mean a good way to encourage people to slow their requests down? [22:42:03] By not serving some of them [22:42:48] We should have something that will work nice with an exponential delay. [22:43:00] It's tempting to just not serve a request. [22:43:06] Oh! BTW, we do have that! [22:43:09] I forgot. [22:43:31] halfak: no, it is for a way for the back end to say 'heeeey sloowdownnn' [22:43:34] It seems to be flask. It will just start killing requests once they have waited 30 seconds. [22:44:08] That's sub optimal, but still backpressure. [22:44:18] halfak: the web layer can handle a lot more requests than the celery layer, so if the web layer was at 100% and passed 100% of things back to celery celery will crash [22:44:22] And so will redis [22:44:51] If you think of it as a pipeline it is to prevent the wide mouth from taking in so much liquid that the narrow bits later on burst [22:44:51] YuviPanda, good point. When I was doing those tests, I forced flask to run single-core. [22:45:13] So, celery's queues could probably hold more. [22:45:23] And flask was were it broke down. [22:45:41] Right. [22:45:55] So, it seems like #1, this should he configurable in celery. [22:46:10] And uwsgi should handle more requests than flask's built in server [22:46:19] And #2, we need to get really good at delivering a "SLOW DOWN, but try this one again" message. [22:46:31] I should open a bug for this as well [22:46:37] Because we error out for a few reasons that are not "Try this one again plz <3" [22:46:44] To the client? [22:46:52] Yeah. [22:47:03] Yeah [22:47:27] Most of the time when something fails, it's never going to work so stop trying or it takes WAY too much CPU to process and please don't try again. [22:47:29] Want me to open a bug for the back pressure thing? [22:47:32] Or would you? [22:47:34] Yeah [22:47:39] Ok [22:47:57] If you look at my PR, I got the bug [22:48:13] halfak: standing in line at airport check in, so can't do or [22:48:14] Pr [22:48:22] No worries. [22:48:28] Have a good trip :) [22:48:30] Over an ocean? [22:50:17] halfak: berlin [22:50:29] And then Rome and then berlin [22:51:13] Have a good one. Sleep well :) [22:53:25] halfak: I can't quite suss out what WikiBrain is. [22:53:49] It's going to be an API, like ORES that will give you some cool query paths that only AI can do. [22:54:07] It will tell you how related two articles are to each other. [22:54:07] Also, I have several ideas beyond WikiProjects that I want to unite together under some brand. [22:54:15] Oh shit. [22:54:21] You can use this to recommend work. [22:54:30] When is it available? [22:54:44] Or for browsing topics or for measuring someone's familiarity with a topic by testing them against the machine O.o [22:54:52] As soon as I can convince Shilad to get an API up. [22:55:09] He said he thinks that we'll have a couple end points by the end of the week. :) [22:55:20] Long live the Apey Eye [22:55:26] He'll also have url to hit to find out the PageRank of an wiki article. [22:55:44] And another that will let you ask for articles that are *near* another article in the category graph. [22:56:24] So you can probably find articles that belong in a WikiProject, but aren't labeled by either doing Semantic Relatedness, Link graph or Category Graph searches. [22:56:42] And then you can use all that to build a bot to write relationships to WikiData. [22:56:51] Yeah... I'm kind of excited for them to get it online. :) [22:57:08] Wikimedia Convergence. [22:57:30] Seriously. I want to add a fourth type of Labs to Labs_labs_labs [22:57:44] The Research and Development kind. :) [22:58:09] And this is (fingers crossed) just the beginning :) [22:58:15] Yesh [22:58:27] halfak, just now I was thinking about how to brand all this. [22:58:29] I also hope to get the kubernetes/marathon thing [22:58:44] We have ORES, part of Wikimedia AI. We have WikiBrain, WikiProject X, and I have ideas down the pipeline for a Wikipedia Bibliography and a WikiSprint tool. [22:58:46] I think quarry will be running on it inside 6 months. And ores hopefully by end of year [22:59:01] hare: we need extension:wikiproject [22:59:06] We have things within WikiProject X as well. [22:59:31] These all have something in common. And we're all working toward a common goal. [22:59:49] Oh god why is the Comm Tech team not working on Extension:WikiProject [22:59:58] And related technologies!? [23:00:01] Our work constitutes the radical arm of Wikimedia Tech. [23:00:06] That would be brilliant. [23:00:36] Whenever I hear the word 'radical' I always think of skateboarding. [23:01:14] halfak: that team is supposed to solve world hunger and bring peace on earth too :) [23:01:24] Community Tech is focused on smaller fixes at the moment. [23:01:41] Also I told them they should make their first priority developing the proper information systems to do their job right :) [23:01:54] That's a good idea. [23:02:17] What did you have in mind? [23:02:20] hare, ^ [23:02:34] When they speak of "communities," they need to know exactly what the communities are and who belongs to those communities. The answer is in the database; they need tools for surfacing it. [23:03:19] I wonder if we have clustering based off of areas of editing [23:03:29] That feels like it'll show communities [23:04:15] Ahh yes. [23:04:32] Haitham did some work like that. He made a network visualization of talk page activity. [23:04:44] It's a project that inspired a lot of people, but I'm not 100% sure we learned from it. [23:05:26] It should information on too high a level. [23:05:28] On the otherhand, if you could identify WikiProject clusters from editing patterns and *browse* it, that would probably teach us a lot. [23:06:12] Indeed [23:06:31] Imagine a hadoop cluster that volunteers could access :) [23:06:49] For Community Tech purposes I think of communities more in terms of functional workflows [23:06:57] The new page patrollers, the article writers, and so on. [23:07:04] YuviPanda, Yeah.. With a good query UI on top of it. [23:07:22] Ahh yes. "roles" in the lit. [23:07:49] Yeah [23:07:58] I'm doing a clustering project right now with a CMU student who is using an experimental edit type classifier to cluster editors. [23:08:05] Like quarry but for hive over content maybe dunno [23:08:25] It turns out that you can learn a lot by the rate at which people interact with various namespaces. [23:08:40] E.g. vandal fighters do a lot of 0 and 3. Way more than anyone else. [23:09:21] But there's a bunch you can see in the edit type classifier play out in the clusters. [23:09:42] hare, what do you think would be an appropriate timespan to use when trying to identify an editor role? [23:10:02] We're using a month right now. [23:10:04] as in, how long must they do a thing? [23:10:08] Yeah. [23:11:06] What would the purpose of identifying this role be? [23:11:32] Oh! We're trying to figure out if there's an optimal combination of roles for improving the quality level of an article. :) [23:12:05] Also for ores - we should do an ores vs cluebot thing [23:12:17] Pick a 100 000 edits from a time period [23:12:19] Yeah. You're right. It should be pretty easy. [23:12:26] Check how many cb reverted [23:12:28] Just got to record some cluebot scores. [23:12:35] Yep [23:12:38] We need the prediction score too. [23:12:50] Yesh [23:12:59] We set our threshold and they set theirs and we can compare if at the same FP rate, can we revert more vandalism. [23:13:08] Yep [23:13:20] I think that's important work because people know cluebot [23:13:30] I've also been calling ores cluebotasaservice [23:13:38] CBaaS [23:13:42] SeaBass [23:13:46] I just read a *peer reviewed* paper on this exact problem that used *ACCURACY* as it's *ONLY* measurement of fitness and showed they could beat CLueBot. [23:13:57] YuviPanda, fair description. [23:14:15] They got a 92% accuracy. [23:14:26] You can get 95% accuracy by never reverting anything! [23:14:33] Haha [23:14:34] AHHHH! [23:14:44] I want to punch all of their reviewers in the face. [23:15:00] But aren't scientists *supposed* to compare to a null hypothesis? [23:16:15] Sorry. They claim 90% [23:16:22] "Moreover, when temporal recency is considered, accuracy goes to almost 90%." [23:16:27] ALMOST 90% [23:16:33] * halfak throws arms up in the air. [23:16:49] So is that one significant figure, or two? [23:16:53] I think this is a very complicated strawman argument. [23:17:18] They say that ClueBot has 85% classification accuracy. [23:17:28] And then claim to get *almost* 90% [23:17:44] When you can get 95% accuracy by just saying "not vandalism" all the time. [23:17:53] Is that a statistically significant improvement? [23:18:00] 85->90? [23:18:47] Just eyeballing it, but that 5% of a proportion over 770k revisions, so it should be *super* significant. [23:18:55] So would 90-95%. [23:18:57] So I WIN! [23:19:10] AND NOW THERE'S NO VANDALISM! [23:19:25] * halfak completes rank and calms down. [23:19:28] *rant [23:23:18] Here's where I am at right now. I could sell the Foundation on another six months of WikiProject X, with some enhancements and expansions to non-enwiki projects, but I have a broader vision than that. [23:24:08] And it seems like my efforts come at the same time as these efforts, and there is synergy between the things. [23:27:37] hare, I'm not sure what the other options you have are, but if you pursue the extension, I'd highly recommend that the enhancements and expansions are the most important thing you could work on next. [23:28:03] Adding features to software is so last tech bubble. You're managing a product. [23:28:28] I don't mean adding features to software; I mean adding products to my portfolio. [23:28:59] Yeah. You've got 6 months. That's a long time to make a new product. [23:30:17] It really isn't! [23:34:56] And I would be spending those six months refining WikiProject X, not making new stuff. [23:35:46] Should I just create an internal roadmap/strategy and then individually try to sell the Foundation on each phase? [23:36:12] Why refine WikiProject X though? [23:36:24] Is that worth a concentrated effort for 6 months? [23:36:41] Isn't there something better that could help boost WikiProjects. [23:37:33] My interest is in boosting WikiProjects by having cross-wiki workflows [23:37:41] And having WikiProjects that are cross-wiki in nature. [23:37:55] Yeah. [23:37:58] FUckin' right. [23:37:59] That [23:38:17] Make some simple things working in labs. [23:38:47] get a few people using them and it'll be really easy to make the case to a Product person. [23:40:53] One thing is for sure -- if your WikiProject X polish doesn't get us more than having you work on something else for the next 6 months, no one wants that. [23:40:56] :) [23:43:36] My interest, as always, is addressing an actual need. [23:44:08] So, the polish that WikiProject X needs is some new machinery [23:44:28] It's really thick, metal-ish polish [23:44:35] And it makes it really big and badass. [23:44:39] There is a need to surface things to do. I know this as an editor. I've been editing for almost 11 years and I am generally clueless as to what I can do. I want to surface to-do lists. Not only that, I want to surface them across projects. We need to make the best of our diversity of content. [23:45:05] Sounds like you and Danny H should hang out. Do you guys hang out? [23:45:19] Mostly in the context of Flow, which I've decided needs work. [23:45:28] Have you seen the Article Request Workshop? [23:51:31] I haven't. [23:51:48] But before I forgot, you basically just described the same set of ideas to me that Danny did at Wikimania. [23:51:50] hare, ^ [23:52:10] Is he making them happen? [23:52:16] Those are some big ideas and it'd probably take a lot more work than a few people can manage :) [23:52:24] So, probably not terribly soon. [23:52:38] Bit I know they are picking them up soon.