[17:26:52] bmansurov: thanks for T203263. is it easy to put the article title in the update recommendation API? It can help with eye-balling. [17:26:52] T203263: Measure translation recommendations against the baseline - https://phabricator.wikimedia.org/T203263 [17:27:02] bmansurov: no worries if it's too much work. [17:27:25] leila: I can write a script [17:28:06] bmansurov: it would be great if it's straightforward. [17:28:21] leila: ok, i'll update the task with article titles [17:31:12] bmansurov: thanks. the other question I have is: why is it that you pass n=1000 and you only get ~120 back? Is there some limit in the code of the old recommendation API on the number of pageviews in the source? [17:31:57] leila: there was a limit, but I increased it to 1000. I guess it doesn't have enough results because if you see ru-uz the number is different. [17:32:42] dsaez: ^ for context: we want to see, by eye-balling, if the result of the new recommendation API bmansurov has developed is better than the old one. In the old one, we would find missing articles in es by comparing it with en (for example) and then we would sort by pageivews for the article in en. in the new one, there is a model that predicts pageviews in es if the article is created. [17:34:25] bmansurov: got it. What should be the limit is actually an interesting question on its own. There is so much content missing right now that we decided not to tackle it back then. When you increase the threshold, you provide a larger set to the user to use from which can improve the topical matching results, but then you're introducing articles to the pool that have lower predicted pageviews. [17:34:44] bmansurov: the question is, what is the optimal trade-off here? [17:35:12] bmansurov: we still don't have to address it systematically, yet. just for you to have the context and think about it as you change the thresholds. [17:35:27] leila, i'm not sure what is the test [17:35:54] are you testing the model for prediction? [17:35:59] leila: for testing, we should compare what we have now. For presenting results to the API consumers, we can come up with some number down the road. [17:35:59] dsaez: you have two outputs from two different algorithms for ranking missing articles in es that exist in en. [17:36:07] aha [17:36:14] dsaez: we're asking you to eye-ball the two outputs and tell us which one is better. [17:36:45] better in which sense? [17:36:48] dsaez: per the research in growing wikipedia across languages (https://arxiv.org/abs/1604.03235) we know that the first API is worse than the second, but it's good to double-check. [17:36:51] leila: imo a better test would be comparing suggestions for a given article, but that's blocked on storage [17:37:07] dsaez: good question. relevance of results to Spanish Wikipedia. [17:37:31] leila: right now, we're basically comparing pageviews over 6 months vs page views over 2 days. [17:37:58] bmansurov: oh! that's good to know. and this can skew the results significantly. [17:38:05] so you want my subjective opinion ? [17:39:00] bmansurov: one thing we should not do is to spend a lot of iterations on validation here. We have done the research in the past and we know pageviews in source are worse predictors of relevance than predicted pageviews in destination. all we should do right now is to eye-ball the results and make sure we don't see anything badly wrong. [17:40:14] leila: ok, makes sense [17:40:17] dsaez: how about this? Check the second API results and see if you see major red flags there. Then check the first API results and see if in your subjective opinion the first one is better. If it is, we should discuss. [17:41:02] leila, bmansurov, what I see in both results is that you are giving importance to time-driven importance [17:41:26] for example, asian games seems to be very important [17:41:57] I would bet that this was trained in some moment of popularity of that specific event. [17:42:26] * leila checks fa [17:45:16] if you will consider popularity as signal of importance, you might want to weight by region. What I see in both of them is a lot things about Asia (the most populated continent) and USA (where most of the visits to en.wikipedia comes from) [17:48:24] dsaez: the fa results are better in the second API, even in the current state where the pageviews only capture the past 2 days of pageviews which creates major skews. [17:49:19] dsaez: the first one is giving topics that are although popular in English/US are not necessarily important for a broader fawiki audience (at least from Iran) over the long run. [17:49:27] * leila checks the features used again [17:49:28] bbl [17:51:21] dsaez: there are two main features used for predicting the pageviews in the destination language at the moment. bmansurov: correct me if I'm wrong. These are Wikipedia user pageviews for the article (QID) in the top 50 Wikipedia languages (normalized, log rank, raw), and then site-link counts. [17:52:18] dsaez: for normalization, we're using Eq. 1 in https://arxiv.org/pdf/1604.03235.pdf [17:53:00] dsaez: for the details of the pageview feature, check page 3, in Features sections (Page views) [17:54:00] dsaez: we haven't implemented Geo page views yet which can address your point. one thing we noticed back then was that the geo pageviews don't change the results /that/ much. (sitelinks and pageviews per language were the most important predictors) [17:55:28] bmansurov: (for when you're back) how hard is it to implement the geo page views features? your matrix will increase in columns by 200 or so, so computations can be slower but not by much, as it will be sparse. I think the most expensive part there will be figuring out how to efficiently fetch those numbers per country. [18:02:59] leila, were the most important predictors for pageviews? [18:03:17] * leila in a meeting and will respond in 30 min. [18:04:25] but for pageviews from where? If you ask me what is important for the uz Wikipedia, I would check what the Uzbekistan based visitors are looking for. [18:05:23] Anyhow, if you are question is which of both results looks better, my answer is that most of them doesn't look specially interesting for 'es' [18:23:12] leila: I haven't thought too much about geo page views, but your intuition seems right. [18:34:46] * leila reads [18:38:46] dsaez: got it. thanks for checking. [18:39:38] Amir1: do you have some time for a fa question? :D [18:40:03] (time to get experienced editors input) [18:41:51] leila: sure thing! [18:42:43] Platonides, Amir1: whenever you have some time and if you have some interest to checking recommendation API output result, can you check the description of T203263 for en-es and en-fa and let me know what you think about the result of the second API when compared to the first one? (in a nutshell, what the API does is: take en as source and call all articles available in en but missing in es or fa as missing. Then it ranks the m [18:42:43] T203263: Measure translation recommendations against the baseline - https://phabricator.wikimedia.org/T203263 [18:44:58] Sure [18:45:18] FYI, I'm going to add page titles to that task soon. [18:45:24] dsaez: re geo pageviews, what we did back then was to include pageviews for all countries (each country would be a feature). The idea was that if you have an article in en that receives pageviews from many part of the world, it should probably exist in es if it's missing, while if a topic doesn't have global pageviews/interest then it should get lower priority. [18:45:31] bmansurov: thanks. [18:46:15] leila, I see, so global atention it's geo weighted ? [18:46:34] dsaez: the case that you're saying re geo pageviews is also interesting and the model should be able to pick up if correctly tuned. If a lot of pageviews come from Uzbekistan to an article in en, the article should be prioritized. [18:47:09] dsaez: in the "ideal" model we would implement, yes. not yet, though. [18:47:24] Amir1: thanks. [18:47:41] dsaez: btw, is there a page for the work you're doing with the Ukrainian student? [18:48:10] dsaez: it's in a related space, I want to dig a bit and see if there are things we can reuse from there or they can reuse from here. ;) [18:48:23] dsaez: and it would be great if you can bring them to IRC. :D [18:49:19] leila, I say this just looking to the results. We can discuss if a popular Cricket player should have his pages in es.wikipedia...but the popularity of that article is clearly an artefact of the huge population in the countries were cricket is popular [18:49:52] dsaez: don't you expect that to be normalized to some extent by sitelinks? [18:49:54] leila, not yet, I have a meeting with one of them next Monday and I'll ask her to create the page, same thing in 2 weeks with the other [18:50:25] dsaez: if you have an article in en that is super popular in the US and gets a lot of pageviews, but sitelinks shows that the article exists only in en, it's already a weak signal that is not a global topic. [18:50:42] dsaez: that would be great. Looking forward to read up their work. :) [18:51:05] There is a github, from their course project, wait [18:52:25] https://github.com/olekscode/Power2TheWiki [18:53:05] dsaez: nice. I'll read it soon. [18:54:35] dsaez: have you looked at our hyperlink prediction work in the past? [18:56:04] dsaez: does red page mean redlink? or some other notion? [18:56:37] yes [18:56:38] dsaez: this is a very nice problem. [18:57:02] red page => red link [18:57:17] which hyperlink prediction work? [18:58:53] dsaez: https://arxiv.org/abs/1512.07258 (it may become handy when you work with redlinks.) [19:00:17] bmansurov: I have a wish that we build an API for https://arxiv.org/abs/1512.07258 When we get really bored, let's pick this up. It will be really useful. [19:00:47] will see [19:01:12] leila: haven't seen that paper yet, but the title looks interesting. [19:11:17] leila, dsaez I've fixed the formatting and added titles. Take a look. [19:11:23] bmansurov: on it [19:12:07] bmansurov: was something unusual happening related to Russia in the past 2 days? :D [19:12:23] bmansurov: there is quite a bit of Russia related topics in the second API. [19:12:42] leila: a famous singer died for example [19:12:53] bmansurov: got you. [19:13:08] leila: and that shows up in the first API: Q1980296 [19:13:26] Amir1: I fount a fun example for us. Q795 doesn't have an article in fa. I find this fascinating. :D [19:13:33] and not in the 2nd API (which is expected as the last date for pageviews was 7/31) [19:13:43] bmansurov: got you. [19:13:53] Amir1: found* [19:14:14] well, Wikidata says otherwise: https://fa.wikipedia.org/wiki/%D8%AD%D9%85%D8%A7%D9%82%D8%AA :P [19:14:35] wooot [19:15:22] bmansurov: can you check Q795? the API says its missing in fa, but there is a sitelink to it in Wikidata [19:15:59] leila: we have old wikidata dumps from January, that may be why [19:16:13] let me check the article creation date [19:16:21] bmansurov: yup. April 11 was the time it was linked on Wikidata [19:16:30] ok, makes sense [19:17:59] leila: my analysis: T203263#4549395 [19:18:00] T203263: Measure translation recommendations against the baseline - https://phabricator.wikimedia.org/T203263 [19:27:02] bmansurov: thanks. added input for en-fa [19:27:12] * leila steps away to find a snack [19:58:32] bmansurov: this is super exciting. Thanks for all your work on this front. now we have a good model in place to start thinking about the no-source case. \o/ [19:59:02] yeah, results are looking good. yaya [19:59:06] leila: what is no-source case? [19:59:33] bmansurov: I'm very curious as to why Wikidata dumps are not updated more frequently. I wonder how Lydia_WMDE let this go for this long. ;) [20:00:01] leila: I meant, the newest dumps are not updated on the Analytics cluster [20:00:14] bmansurov: the case where the API consumer will not know a notion of source language. They know only the destination language they're interested in, and we have to do the work in the background to show results to them in their language [20:00:36] leila: aka morelike for missing articles? [20:00:53] bmansurov: oh! that makes sense re Wikidata dumps on Analytics cluster. [20:01:30] bmansurov: suppose we didn't know the seed to use morelike (that's the more basic case). how would we solve that? [20:02:03] leila: oh I see. You're talking about the second use case. [20:02:22] bmansurov: yup. [20:02:24] leila: we'd just get the top N from different language pairs [20:02:43] but, that's the simplest way [20:02:55] we can try more sophisticated options too [20:05:38] bmansurov: agreed. to start top N from language pairs where the specific language is the destination can work. [20:06:32] bmansurov: or retraining the models for this case. [20:07:01] yep [20:07:55] bmansurov: It's an important iteration of the models, so we should be careful about how to aggregate. (we haven't done this in the past so I actually don't know off the top of my head what can go wrong/right). What is nice is that it increases the audience of the API significantly. Suddenly we move from "you have to know at least two languages to talk with the API" to "tell us your language (with some limitations early on)" [20:09:40] makes sense. One problem that immediately comes to mind is that for larger wikis where many articles are viewed, normalized ranks of those articles will be low (because there are many articles) compared to smaller wikis where there aren't many articles. Thus comparing normalized ranks from two different models may not make sense. [20:10:15] bmansurov: agreed. [21:08:51] HaeB: is there a product list in place of wmfproduct@lists.wikimedia.org ? [21:09:26] or anyone else who can help re ^ [21:21:15] * leila gets ready for the overhaul of the documentation at https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour [21:48:35] leila: https://office.wikimedia.org/wiki/Mailing_lists [21:55:00] HaeB: thanks! [22:55:08] dsaez: I'm here. [22:56:05] hi leila, I was working until late, but now I'm going to bed [22:58:26] dsaez: sure. nothing on my end, except that I'm starting the documentation for the reader research, and Sarah is going to help us as well. I'm putting the new structure in place for us to work on it next week. https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour [22:58:36] dsaez: enjoy your weekend. :) [22:58:47] same [22:58:55] yup. :) [23:31:25] leila|lunch: so these are articles that the API thinks we should have? [23:31:59] results from the new api look like more important articles to have [23:32:12] however, something seems wrong [23:32:29] Platonides: roughly. more accurately, it's the ranked list of articles that exist in en and are missing in es, and the ranking is done (in the second API case) by predicting how much pageview the article will receive if it were to be created. [23:32:38] as I'm quite sure we have some of those articles [23:32:50] Lego, Dictator, Whale? [23:32:59] no way we don't have articles for that [23:33:11] missing links, perhaps [23:33:14] Platonides: the result is based on the Wikidata dumps from Jan. 2018, so if the article is created since then, it's not taken into account. [23:33:30] Platonides: and yeah, missing links can be an issue (but I doubt at the Whale level). can you give the QID? [23:33:59] https://en.wikipedia.org/wiki/Lego is "missing" [23:34:12] Platonides: it's good to know that the results from the second API look more important. bmansurov: FYI. [23:34:12] we have https://es.wikipedia.org/wiki/LEGO [23:34:19] but that is linked to The Lego Group [23:35:05] * leila looks at Q170484 [23:35:35] it's indeed not linked from Wikidata [23:35:48] Platonides: what is the Spanish page you would expect to see? [23:36:35] Platonides: ow I see. https://es.wikipedia.org/wiki/LEGO [23:36:53] the problem seems to be that we have one article (LEGO) and en has two similar articles (Lego and The Lego Group) [23:37:56] Platonides: yup. indeed. we have a similar problem with some of the fruits. For example, German is predicted to not have an article about Cherry, while Cherry was covered under Cherry tree that last time I checked. [23:38:09] looking at Whale [23:38:17] the translation is Ballena [23:38:22] * leila checks [23:38:26] redirecting to Balaenidae [23:38:58] the thing is, en Whale article is "an informal grouping within the infraorder Cetacea" [23:39:00] Platonides: the redirects we should catch I believe. [23:39:19] we have both Balaenidae and Cetacea articles [23:39:34] this en article is "in the middle" [23:39:57] not really something missing [23:40:03] but not sure how to tell the API that [23:41:08] Platonides: ok. super helpful. The challenge is that it's hard to solve this at the API level accurately as we don't have the translation of the titles across languages. One thing we can do is to look at Wikidata label description and if that exists, check that in the destination language (es in this case). This would solve the Whale problem. [23:41:49] for Spirit (es: EspĂ­ritu), we have that page redirected to Alma (ie. soul) [23:42:21] plus, there will be articles for close concepts such as ghosts [23:42:23] Platonides: more generally, however, I think the tool that calls the API may try to solve this. For example, if a tool recommends whale as an article to be created in es, the tool should provide a "the article exists" feedback. also when the user, at the tool level, enters the article title in es (and hence provides the translation), the tool should do a search in es and show the top n results to the user, for the user to make [23:43:03] feedback to the model, I mean. [23:47:52] bmansurov: are you still around? (it's super late for you) [23:49:23] Platonides: I /think/ we're getting a simple diff between en and es at the moment, and some of these issues can be resolved if we implement the approach developed in Section 2.1 of https://arxiv.org/pdf/1604.03235.pdf [23:49:42] Platonides: I make a note of our conversation in the phabricator task. thanks for looking into this with me. [23:58:57] bmansurov: nevermind. updated T203263 [23:58:57] T203263: Measure translation recommendations against the baseline - https://phabricator.wikimedia.org/T203263