[17:19:59] * leila|thewebconf waves to the channel [18:12:25] isaacj, j-mo: Do you know if claudia is on IRC? [18:12:52] she is cwylo [18:12:59] hello! [18:13:10] who dares summon me etc. etc. [18:13:25] halfak dares summon you, apparently [18:13:37] \o/ [18:13:43] * halfak give you a voice flag [18:13:56] Oh wait. We need to set up a "cloak" first. [18:14:08] Do you mind doing a little bit of IRC magic with me for a bit? [18:14:09] oh shoot totally forgot to mention this during the call, BUT: later today I'm giving a short presentation on disinformation on Reddit and Discord [18:14:43] https://meta.wikimedia.org/wiki/IRC/Cloaks [18:16:38] cwylo, what are the details of the presentation? Can we watch live? [18:16:41] thanks, cloak requested [18:16:46] Nice! [18:16:58] +1 to halfak's request [18:17:04] Let me know when you get the cloak and I'll flag you as "someone who knows what's up in this channel". [18:17:05] Not a public presentation, but I'm happy to share slides + notes afterwards [18:17:13] cool [18:17:20] * halfak --> lunch [18:34:33] cwylo Ima interested in yur slides and notes [18:45:42] presentation will be later today so hopefully I'll be able to upload it somewhere this week [18:48:27] The research team works on a task recommendation system, doesn't it? [18:48:50] I'd be interested in learning more how exactly it works, or if there are any plans to productize it or incorporate it into other WMF products [18:49:10] j-mo, bmansurov: is quicksurvey for reader trust live? [18:51:39] harej: yup. an overview here is good? [18:51:45] Sure [18:52:19] harej: most of the task recommendation is happening under the "address knowledge gap" program. [18:52:51] harej: the heavy focus right now is on Wikipedia, we have done some work in Wikidata. The future years can include other projects. We have at least specifically discussed Commons. [18:53:40] harej: on WP, we do research on two types of task recommendations. (borrowing the expression from Bob): we do vertical expansion recommendations and horizontal one. [18:53:55] harej: horizontal focuses on article creation, vertical on article expansion. [18:55:13] harej: in article creation, we find missing articles in a given WP language, prioritize them, and recommend them. There is, for example, recommendation-api https://www.mediawiki.org/wiki/GapFinder/Developers [18:55:53] harej: the focus of this research at the moment is to identify missing articles by looking at inside the wikimedia world only. we don't look at external repositories of knowledge (future work) [18:56:47] harej: recommendation API is used in Content Translation and some of the other community tools today. There are plans to productionzie it, and you can see the list of tasks and what's left at https://phabricator.wikimedia.org/T148129 [18:57:26] bmansurov: if you have a sample of APIs you want to share with harej re article creation, go for it. [18:57:51] harej: bmansurov is working heavily on improving the article creation recommendation API. he can tell you more if you're interested. [18:58:05] * bmansurov is reading [18:58:19] Research: User reporting systems now exists! https://meta.wikimedia.org/wiki/Research:User_reporting_systems [18:58:43] leila|thewebconf: I'm wondering specifically what it does to organize topics. Is there a system that maps relationships between topics? Like "you are interested in African artists, maybe you are interested in Asian artists too" [18:58:58] Or is this more a big flat list? [18:59:24] harej: it's not a big flat list because you can provide a seed to the API [18:59:39] The other day the scoring platform team discussed ideas for building topic networks based on article contents and I'm wondering how that intersects with your task recommendation work. [18:59:58] harej: at the moment, you can provide a seed and the seed uses morelike to find other similar articles. bmansurov is working to expand the morelike feature and improve. [19:00:33] harej: my understanding is that scoring platform is working on topics based on wikiprojects. is that correct? [19:00:34] harej: this is the only productionized API so far: https://en.wikipedia.org/api/rest_v1/#!/Recommendation/get_data_recommendation_article_creation_translation_from_lang_seed_article [19:00:42] more coming [19:02:17] leila|thewebconf we tested it on the Beta cluster yesterday. scheduled to go live on Monday [19:02:21] harej: in the absence of an operationalizable topical model (which is a big need for us, too), the way we deal with the problem of capturing user interests is through morelike. [19:02:59] j-mo: great. so the issues that were identified yesterday are fixed now. great. [19:03:50] leila: it looks like it may not show up on mobile? but we decided that for this first round, we would accept that risk, b/c wanted to get data before EOQ [19:03:50] harej: while having a topical model is important for some applications, in the edit task recommendation is less important at the moment, given that there is a /lot/ of content missing. [19:04:17] harej: for the re-routing of content for patrolling, of course, topical models are important. [19:05:25] Right. We need topical models and it sounds like you haven't gotten there yet. I wanted to see if there was any overlap between existing work and halfak's investigations into doc2vec. [19:05:44] Can you tell me more about 'morelike'? [19:18:04] ^ leila [19:18:10] bmansurov: do you want to take the morelike question? [19:18:19] harej: what is the usecase of topical model for you? [19:19:48] just as a sidenote, i find this all really fascinating. at just a really high-level, you've already got choices to make around what model to use (e.g., topic modeling, doc2vec, many others) and what data source(s) to use (e.g., article content, link graph, categories, user navigation). probably should be page somewhere on machine representations of Wikipedia with anecdotes/intuition on where they work well or fail [19:20:27] In scoring platform's case, being able to direct draft articles to WikiProjects or to some cohort of people who would probably be interested in reviewing the draft. I've also been talking to Marshall Miller on the growth team about the onboarding process, including a new form newcomers fill out where they describe what they're interested in. Right now they use a fixed list of 27 topics but in the future it could be more sophisticated. [19:21:38] harej: yup re scoring platform usecase. [19:22:27] morelike API will help you find articles similar to Book (for enwiki) that are absent from enwiki [19:22:41] harej: re onboarding, and Marshall and I have already talked about this, the topical models may not be a good answer for the long run. basically, you want something like the Neflix approach: show people n canonical videos when they join the website, and based on that, learn about their interests. [19:23:04] It finds Book's Wikidata ID and looks up similar articles on other Wikipedias that are absent from enwiki. [19:23:17] It uses the search team's morelike API [19:23:48] harej: in the case of Wikipedia, we tried the n canonical WP articles, and we couldn't divide up the user interest space enough unless we would show a very long list of articles. [19:24:17] And for ranking, it uses this paper: https://arxiv.org/abs/1604.03235 [19:24:25] harej: the solution we came up with was a questionnaire, where the user answers 20 "A or B" type questions and the answer to those will give us enough info to divide up the space: https://meta.wikimedia.org/wiki/Research:Voice_and_exit_in_a_voluntary_work_environment/Elicit_new_editor_interests [19:25:47] harej: if we can divide the interest space, then we can use that information to do relevant recommendations for newcomers. For those who are around, no need for this system. The edit history is already very telling, or you can ask the editor to provide a seed that tells you what they're interested in and you use morelike to get relevant tasks to them. [19:30:42] dsaez: following up on Scholia from a private conversation (https://tools.wmflabs.org/scholia/), ORCID is one part of where the data is fetched from. [19:31:27] got it [19:31:29] dsaez: so it's not about publications about Wikipedia or cited on Wikipedia. it's about the sum of all bibliographic knowledge. [19:31:57] I see, so publications in Wikidata [19:31:59] dsaez: while we're at it, go populate your ORCID. ;) [19:32:12] I see [19:32:36] * leila checks her own WD. She's pretty sure everything in her case is coming from ORCID. [19:33:45] dsaez: confirmed. In my case it's all from ORCID, and that's why you don't see my back-in-the-day publications. I never connected them. ;) [19:33:59] I see [19:34:04] Good to know [19:34:17] ow, and of course DBLP [19:36:13] dsaez: wikicite is amazing. I'm amazed by the doors it opens in terms of increasing transparency, and also usefulness of the bibliographic data which is now locked up. [19:37:00] dsaez: I can't wait for the day to be able to search all bibliographic data to know who has funded how much research on Sugar, and what else they have funded in the meantime. :) [19:37:15] Dblp? Hmm i have more than one paper there :) [19:37:57] dsaez: I take DBLP back because some of my older stuff are there and they aren't shown in scholia. ORCID. [19:47:17] yeah, i find it confusing. dsaez and i both have a single paper in scholia but none in ORCID and (at least for me, a much more complete record in DBLP). as far as i can tell then, two of my papers are in wikidata but only one actually links to my wikidata item and that's the one that shows up in scholia [19:52:13] isaacj: I see. I see. /me reads as it may only be reading from Wikidata (which goes back to: bring more papers via wikicite) [20:01:17] leila: yep yep, full circle! [21:26:57] oh no j-mo is out! well! @isaacj and halfak here's the link to my presentation + notes https://docs.google.com/presentation/d/1XzoJlN60QPwbLlknIqmKcvaSIgJGdnvnoVCo-CmSI9E/edit?usp=sharing [21:32:33] cwylo: awesome thanks! [22:37:24] cwylo: nice to meet you. [22:37:33] cwylo: I had a quick pass over the slides. Thanks for sharing them. [22:38:22] * cwylo waves [22:38:29] cwylo: question: is the focus of the talk disinformation in the context of harassment specifically? [22:39:04] ehhh lil bit of column a, little bit of column b. Hard to talk about one without the other because of the makeup of said toxic communities [22:39:30] as in, they spread disinformation but also engage in harassment, targeted disruption (trying specifically not to just say "trolling" because it's so broad as to lose all meaning IMO) [22:39:45] and tend to organize around a single issue [22:39:49] cwylo: I see. cuz I imagine disinformation can occur in subtle ways and doesn't necessarily result in harassment. [22:40:16] cwylo: got you. so you're focusing on the space were the two interact with each other, in this talk. understood. [22:40:17] thanks [22:40:19] Oh, for sure, and those kinds of dogwhistles and coded language are things (some) mods know to watch for [22:41:30] cwylo: reading that slide deck made me think of my experience with brocialist twitter and how their activity might not be misinformation per se, but is definitely built around cultivating a particular narrative and shaming/dismissing anyone who asserts otherwise [22:42:04] yeah. Massanari's term, toxic technocultures, are super helpful for talking about those spaces [22:46:30] cwylo: you may have already seen this line of research, but just in case: Kate Starbird does research in the space of disinformation on social media. She and her team have looked at specific recent events in the U.S. and have studied how organized campaigns can enter the activist circles and through a series of concerted events change the way these groups work (or very massively divert them from their direction) [22:46:32] https://scholar.google.com/citations?hl=en&user=C6KSF5gAAAAJ&view_op=list_works&sortby=pubdate [22:47:50] interesting, thanks for the link! I gave it for a lab focused on disinfo, but usually I generally do work on volunteer moderators (hence my focus on those last two slides) [22:48:09] and at WMF I'm a design researcher w/ the anti-harassment tools team [22:48:18] so yeah will definitely pass this around [22:55:46] cwylo: very nice. I trust that you are already aware of the recent research in the area of anti-harassment WMF has done with academia+industry. If you're not, I highly recommend you have a chat with Dario about it. And, I'm sure our paths will cross. :) [22:56:06] specifically, Detox? [22:57:15] idk I got a lot of materials thrown at me to read when onboarding, heh [23:08:52] cwylo: ohoh. I'm sorry to hear that. it happens to all of us. ;) [23:09:27] cwylo: Detox was the earlier research but then there is more research after that which is great if you're aware of. [23:10:00] Let me grab links. but again: Dario can tell you a 30-min version of it, and you can go from there. [23:11:03] cwylo: I think lessoning to June 2018's research showcase may be a good starting point. Check: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#June_2018 [23:31:59] bmansurov: please take a look at my comments here: https://phabricator.wikimedia.org/T210757 [23:32:49] bmansurov: we want to use hadoop to generate datasets mean for consumption by production services not the stats boxes, you can ping us and we can help you create oozie/spark jobs. [23:33:01] bmansurov: let me know if this makes sense.