[12:46:22] Say I do make an ai of wikipedia, how do you compare it to wikibooks, or something else, say google, or another wikipedia ai. [14:53:44] I'm so confused as to what pressure679 was asking [14:59:34] https://en.wikipedia.org/wiki/Precision_and_recall is a good place to start. [15:47:31] halfak: o/ [15:47:41] * glorian_wd back online on IRC [15:48:24] o/ [15:48:45] * halfak is waist deep in blog writing for WMF Comms [15:48:49] halfak: I want to ask your opinion about something [15:50:18] So, I am now working on the dataset for item quality model. As a first step, I aim to develop a dataset which does not contain the relevant statements which should be obtained from wb_propertypairs [15:50:41] So, this dataset contains basic features such as number of statements, aliases, etc. [15:50:53] halfak: do you think it is a good first step? [15:51:26] no. That should be trivial to do with revscoring. [15:51:35] I think you should leave that to me. [15:51:44] And you should worry about getting signal for property completeness. [15:52:13] halfak: oh I see. Glad that I call you out before digging deep into it! [15:53:38] halfak: For the signal of property completeness, I thought you want to review my clustering result first? [15:54:26] glorian_wd, I did and left a comment saying that I recommended using wbs_propertypairs. [15:54:34] * glorian_wd checking [15:54:46] halfak: did you leave your comment on the wiki page or the phab card? [15:54:48] glorian_wd, didn't we discuss this on Saturday? [15:54:52] The wiki page. [15:55:33] Yeah, but we did not discuss this thoroughly [15:56:50] halfak: hmm which wiki page? I thought you left your comment on https://www.wikidata.org/wiki/User:Glorian_WD/Clustering_Result_v2 [15:59:26] Oh I see your comment on the phab card [15:59:46] ok got the wiki page [19:58:04] halfak: still around? [20:03:44] * halfak is meetings [20:04:20] halfak: are those gonna be finished in the next 30 mins? [20:04:27] otherwise, I will ping you tomorrow [20:04:45] yup [20:05:01] halfak: Ok. poke me when you are done ;) [20:32:15] glorian_wd, poke [20:36:49] halfak: hey [20:37:03] so I have tried to refresh my mind regarding to wbs property pairs table [20:40:55] I guess if we want to engineer feature out of it, could it be simply just evaluating the probability of pid1 and pid2? [20:40:57] For example: [20:41:41] https://usercontent.irccloud-cdn.com/file/tuekNvj1/pid1%3A%2031%20and%20qid1%3A%201865345 [20:42:52] From the above screenshot, we know that if pid1:31 and qid1: 1865345, there should be properties in pid2 (from the example above, pid2 should be 641, 710, 585) [20:44:26] halfak: I wonder if this approach is correct hmm [20:45:56] sounds right to me. [20:46:19] we can also consider the probability. For instance, if the probability of pid1-qid1-pid2 occurence is higher than 0.7, we could take that into calculation [20:46:29] right [20:46:35] meaning, we eliminate everything that have low probability [20:47:01] Initially, I thought I was wrong [20:47:03] :P [20:49:41] Or maybe weight it appropriately. [20:49:54] halfak: if you are agree to this approach, tomorrow I will work on something concrete from the labeled data [20:50:04] weight it appropriately? what'd you mean? [20:53:26] weight it usefully. [20:53:41] E.g. turn the information into a vector that should contain signal. [20:53:48] Some convenient correlation that we value [20:54:03] E.g. number is low when an item is incomplete, and high when it's at least mostly complete. [20:55:13] * glorian_wd is trying to understand what halfak just said [20:56:01] halfak: "E.g. number is low when an item is incomplete, and high when it's at least mostly complete. " how do we know if an item is complete or incomplete [20:56:15] wbs_propertypairs [20:56:23] we only know some properties that have high probability given to pid1-qid1 [20:57:58] halfak: yeah. Take a look at the above screenshot. There are 8 properties that have high probability if there's occurence on pid1:31 and qid1: 1865345. How do you know whether the item contains these 8 properties are complete or incomplete? hmm [21:02:22] glorian_wd, maybe if an item doesn't have items that seem high probability, that's an indication of low completeness ;) [21:02:51] halfak: item doesn't have items? [21:03:58] *properties [21:09:43] halfak: "E.g. number is low when an item is incomplete, and high when it's at least mostly complete. " Probably you meant, weight is low when item is incomplete, and weight is high when the item is mostly complete? [21:40:51] Maybe we could make something like this: if an item only have 3 of 5 properties from wbs_propertypairs, then the weight is 0.6 (from 3/5) [21:41:21] halfak: did I understand you correctly? [21:54:07] Sorry was in next meeting [21:54:09] Back now [21:54:54] glorian_wd, I think that would probably work but it would also lose a lot of signal [21:55:03] I think you'll want to use the proportion in your calculations.