[17:15:28] leila: o/ here are some results from yesterday's discussion: https://github.com/wikimedia/research-translation-recommendation-models/blob/master/wikidata_item_similarity.ipynb [17:16:08] leila: they look promising, but we need language specific stop words for this to work well. [17:20:01] * leila makes a note to check the page bmansurov gave in 2 hours. she will report back. [17:20:56] In the mean time, I just learned that our ex-colleague Kaity Hammerstein is working on http://publiceditor.io/ which is a really neat project that can change the future of fact-checking and verification on the web. [17:33:17] leila, dsaez: o/ In the synonyms spreadhseet, I see we have 3 sheets per language. What's the difference between them? [17:33:32] Which ones should I ask users to fill out? [17:38:42] bmansurov, we have lots of stopwords assets in ORES. [17:39:10] halfak: great. Do you have the list of languages? [17:39:20] leila: dsaez: Also I'm looking at the list of synonyms, it's obvious that some of the items are not synonyms at all. Do we need experienced editors to label them too? I could go ahead do the easy ones myself. [17:39:31] https://github.com/wikimedia/revscoring/tree/master/revscoring/languages [17:39:58] Most of those languages have stopwords. We have a process for acquiring stopwords for new languages (semi-automated -- requires human review) [17:40:24] halfak: I see. So over time the list will increase? [17:40:27] I'd be interested in splitting these language assets from revscoring too. Seems like they would be generally useful. [17:40:47] Yes. As we get interest from new wiki communities, the first thing we ask for is help generating these language assets. [17:40:53] halfak: agree. We could also use a plain text format. [17:41:51] bmansurov, some of the assets are in the form of regex. Others use curated external resources (e.g. enchant dicts) so it is nice to have a python library. [17:42:05] I see [17:43:38] We can certainly split the plaintext bits from the complex bits if that would be worthwhile. [17:43:53] I'd love to make it easier for Wikipedians to curate and expand the lists. [17:44:50] halfak: yes something like this would be useful: https://github.com/apache/spark/tree/master/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords [17:46:34] Makes sense. I wonder if we could even automate pulling such data from a wiki page. [17:47:11] Either way, I agree re. text files. In the mid-term, I'd be happy to review some changes to revscoring to bring this to life. [17:47:23] In the long-term, I'll try to find time to do it myself. [17:50:02] halfak: ok, I'll help out with the task [17:54:05] Thank you! [17:54:07] * halAFK --> Away [19:06:32] bmansurov: I'd recommend not labeling it yourself. It's best if experienced editors do it, to avoid potential issues with labels and language norms. :) [19:07:14] is there a wikitech or related page for the microsurveys that are run on wikipedia? for instance describing what parameters can be considered in the randomization of who receives a link etc. [19:08:20] isaacj: bmansurov is your first friend there ;) but in the meantime: https://www.mediawiki.org/wiki/Extension:QuickSurveys [19:09:47] oh awesome, thanks! [19:21:20] bmansurov: is there anyway to target microsurveys to specific articles or is it only project-level random sampling at this point? [20:08:21] isaacj: yeah pretty much random if I remember it correctly [20:15:21] bmansurov: thanks but drat. so if we wanted to survey a sample of people who all read the same article, would you say that's a simple change or much more involved change in quicksurveys? [20:18:37] isaacj: I don't think it will be a simple change. [20:19:08] Quicksurveys needs a major change, it was not designed with the requirements you mention in mind. [20:20:06] bmansurov: mmkay, thanks. i will revisit the idea if it seems especially important but in the meantime, i'll think of alternative strategies then [23:10:21] HaeB: can you add me back to the weekly research meeting? :D [23:10:45] oh, did you drop off the event? [23:10:49] HaeB: I just realized that I don't see it in my calendar anymore (unless it was intentional and all those who didn't attend for some time were kicked out, which is fair;) [23:11:14] yeah you have to commit to a 50min presentation to get back in again ;) [23:12:16] if anyone is willing to listen, sure! count me in! [23:12:20] HaeB: ^ [23:12:42] added you in (from next week on) [23:12:49] \o/ Thank you! [23:12:53] ...but now i'm wondering who else got dropped, and when