[14:53:53] hey dsaez. did you see tizianop's email? it seems the results are not improved by the better dataset (which is surprising to me). [14:54:50] leila: yep, I'm not sure why. I also don't think that AUC is a good metric [14:55:20] I would like to see some examples [14:55:35] examples of recommendations, dsaez? [14:55:42] yep [14:55:55] yeah. Tiziano is working on that one. [14:57:18] dsaez: in the mean time: something we discussed yesterday that it's good if we look into at some point: let's represent each articles as a vector of sections. Find other articles that are closer to a given article (with a measure such as cosine similarity). If we cannot find such articles, basically the data is too sparse for factorization. [14:58:05] ok, I'm now uploading the notebook of my previous experiments. I didn't want to push on that direction, but results looks better ... [14:58:24] dsaez: better than what? [14:58:37] than this results [14:58:44] with what measure? [15:00:31] * leila steps to a meeting with miriam [15:00:33] prediction accuracy, on the task that i've defined, that is different. [15:00:41] yeah [15:01:09] but, at least, if you see some examples, recommendations are reasonable [15:01:24] let's review them briefly in standup? [15:02:44] sure [15:11:48] dsaez, I did not have time to explore in details the results. The first thing I noticed is that the system tends to recommend the a small set of sections everywhere :/ [15:12:01] i see [15:12:32] I'm still not sure that the task is correctly definied [15:13:15] I know that i've ask before, but can you refresh exactly what is in the matrix, and how we evaluate the results? [15:13:53] (I'm still thinking that we are learning and evaluating on the wrong sets) [15:16:09] tizianop, so, the matrix is categories vs sections, true? [15:16:25] no, articles vs setions [15:17:24] for categories vs. section we have to define how to represent the "rating" [15:17:49] raw count, probability or something mixed [15:18:04] ok ok... [15:18:14] I see [15:19:05] so, the task will be given an article, see which are similar in terms of sections and try to predict what ... ? [15:19:10] this is why I mentioned that maybe the problem is in using ALS where the rating is only 1 (or missing) [15:20:36] given one article return the list of recommended sections [15:22:11] the format of the dataset I shared is: article_title: [sorted top10 sections #1, #2...] [15:26:40] ok, so AUC here will be ...? [16:07:34] tizianop2, tizianop, my mistake: 7466 sections [16:07:38] is that right? [16:08:42] yes, 7466 section and 34562 articles [16:09:08] good, I was checking another dataset [18:21:35] dsaez: did you and Tiziano figure out what was the source of discrepancy between the number of columns in your data and his? [18:22:06] yep [18:22:42] can you add one line after line 30 at https://etherpad.wikimedia.org/p/stubsExpansion what the problem was? [18:24:41] thanks dsaez [18:25:37] np [23:35:31] anyone here working with the mwdb python library on the stats machines?