[12:41:05] awight is kicking ass at those debian packages! [12:47:39] halfak: he totally is! [12:48:07] If he didn't volunteer to help out, I don't know how we would have made it this far. [12:48:16] Great timing. [12:56:31] halfak: +1 [15:23:34] o/ Amir1 & ToAruShiroiNeko [15:23:49] I'm going to need to miss the meeting tomorrow. I'm sorry to make a habit of this. [15:23:59] I have travel again this weekend starting before our meeting. [15:24:09] hey :) [15:24:12] np [15:24:23] we can have a make up session [15:25:03] Yes. I'd like to do one early next week since I was late to schedule one for today. [15:25:10] How does the same time on Tuesday sound? [15:26:08] it works for me [15:26:16] let's see if it works for ToAruShiroiNeko [15:26:28] * halfak sends an email [15:26:54] halfak: btw https://github.com/Ladsgroup/Kian [15:27:45] Kian in current shape can add any kind of item-based statements like P31:Q5 if they have some identifiers in categories [15:28:50] I will add tons and tons of statements today for ja.wp AUC for P31:Q5 of ja.wp was 99.95% [15:29:15] * halfak looks through code [15:29:27] Wow. :) [15:33:40] halfak: about clustering, what else I need to do? besides uploading the graphs in commons. I forgot [15:34:55] I think that we'll want to have a conversation about those graphs. Maybe post them on the talk page for revscoring? [15:35:12] It seems that it would be good to do some analysis of the clusters too [15:35:28] I think I only saw your plot of the loss in information as you increased cluster size [15:35:36] *cluster k [15:36:38] hmm I recall now, You asked me to give information of features when the n = 2 [15:38:23] all of them will be done in three or four hours [17:25:36] halfak hello [17:25:45] so this is the time I typically arrive home [17:25:54] so pushing the meeting by two hours later would work for me [17:26:04] mondays-thursdays that is [17:31:05] halfak|FOOD food is good [17:31:10] but you ought to meshi :3 [17:31:21] Can push back two hours. [17:31:26] * White_Cat proposes multi cultural away messages [17:31:30] halfak|FOOD yeah [17:31:51] I will let you know if something happens [17:32:05] something eing a cancelled train or something more stupid [19:00:20] halfak: ToAruShiroiNeko https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Clustering_in_reverted_edits [19:00:31] I highly suggest you to enjoy this :) [19:00:43] Helder: ^ [19:01:12] Amir1, could you produce a dataset labeled with the predicted clusters for each record? [19:01:13] awight: you need to check this for bias research [19:01:36] halfak: of course, which wiki? [19:01:37] I just realized that we don't have rev_ids in that dataset, so it might be a bit of a problem to associate with the actual revisions. [19:01:41] (or all of them) [19:02:17] Once you have the clusters worked out, how hard is it to cluster new observations? [19:02:36] *cluster --> categorize [19:03:10] fairly easy, since I use scikit learn and it can export parameters and we can use it somewhere else [19:03:55] Cool. So I think that what we should do in the short term is just re-extract the features for a random sample of reverted edits and apply the cluster classification to that [19:04:06] So we can arrive at a dataset of [19:04:07] beside that, if you give it to this code, it is pretty easy to handle those too, there are methods for that [19:04:19] From there we can generate statistics and do some qualitative work. [19:04:34] Amir1: Trying to cram learning how to read these graphs... it looks amazing to the untrained eye, so far. [19:05:01] thanks awight [19:05:02] :) [19:05:38] What would you say it means? [19:06:31] halfak: send me new set and I will return <#cluster> to you :) [19:06:54] I don't have an easy way to add rev_id to the feature sets. [19:07:00] But we can re-extract features. [19:07:12] halfak: yeah, that sounds good [19:07:12] I'm reading https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Clustering_in_reverted_edits [19:07:14] awight: we are clustering reverted edits [19:07:19] Do you want me to send the dataset that has ? [19:07:24] So this means, there are definitely clusters cos the elbow is > K=1 ? [19:07:50] we have several elbows [19:08:13] the biggest elbow is 2 [19:08:18] no matter what [19:08:37] but other elbows depend on wiki [19:09:15] It sounds really big... [19:09:30] halfak: rev_id, reverted and features [19:09:55] This shows that a objective judge would say reverted is at least 2 distinct classifications? [19:10:12] * halfak grumbles about generating a new dataset. [19:10:21] Amir1, I won't be able to get to that until early next week. [19:10:26] Tues or Weds [19:10:38] awight, +1 [19:10:52] hot dog. [19:10:59] If that judge's knowledge was purely born of our feature set :) [19:12:25] right :) the objectively blindered judge [19:12:25] halfak: we can run this clustering and get clusters of edits type for not-reverted edits [19:12:27] *types [19:12:37] cool! [19:12:56] Amir1, Ooh! I already have some of that. [19:13:05] Oh wait., [19:13:09] That was supervised [19:13:14] We specified the classes ahead of time [19:13:38] That would be really interesting to compare to reverted clusters to see when our features aren't picking up on a distinction that humanoids are. [19:13:45] halfak: I don't mind but I'm too excited [19:14:17] Amir1, will see what I can do. [19:14:50] (I was talking about extracting features) [19:15:08] Yeah. I want to capitalize on your excitement by getting you data. [19:15:18] Have results from SIMPLE for article -> top category mapping (still waiting for EN to finish loading). [19:15:34] Overall it's very good, but there are some interesting edge cases. [19:15:52] thanks [19:15:56] :) [19:16:12] For example, it gets confused about some historical figures. Are they people or history? Probably people, but the algorithm gets confused. [19:16:12] shilad, cool! Anything online yet? [19:16:25] shilad, that might be a feature more than a bug [19:16:32] :) [19:16:36] Is George Washington really a person? [19:16:41] * halfak feels philosophical [19:17:05] Still waiting for language import to load, but it's nearing completion. [19:17:32] shilad, will you give us closeness measures so that we can tell that George washington is both history and people? [19:19:09] Right now the API returns distance only to the closest category. But I think you could issue a similarity query for what you're asking. [19:20:21] Amir1: Does this analysis give us any idea of the overlap or disjointedness of the clusters? [19:20:26] Yeah. I think that'd be useful. E.g. http://ores.wmflabs.org/scores/enwiki/wp10/677219613/ [19:20:35] ^ Doesn't just tell you the predicted class, but weights. [19:22:53] clusters in kmeans works like this, there are centers and each data point is labeled based on distance to each center (and entry i gets label 1 if closest center is 1, etc.) and cost function is mean of this distance per entry point [19:23:05] yeah, https://github.com/yuvipanda/reference-php-tool/blob/master/public_html/index.php is my reference web tool now [19:23:09] whoops, sorry, wrong channel [19:23:20] Interesting. That's definitely more CPU intensive, but I could support it. [19:24:16] There's also the family of similarity calls, (which are different than category graph distance). They support what you're asking on a more semantic level. [19:24:24] But I've never tested them on categories. [19:24:38] Gotcha [19:34:50] Amir1, can you add a phab card for you clustering work (if you haven't already) and add a card for me to generate you a new dataset that has rev_id in it? [19:35:09] sure thing [19:35:25] I should make some cards for features in wb vandalism too [19:35:49] +1 [19:35:54] We need a tag for that. [20:07:25] halfak: Implemented your suggestion. For George Washington, in simple: [20:07:35] I made them [20:07:38] :) [20:07:39] distances to top-level categories for LocalPage{nameSpace=ARTICLE, title=George Washington (simple), localId=5410, language=Simple English} [20:07:39] 0.254 LocalPage{nameSpace=CATEGORY, title=Category:People (simple), localId=5904, language=Simple English} [20:07:39] 0.501 LocalPage{nameSpace=CATEGORY, title=Category:Religion (simple), localId=6106, language=Simple English} [20:07:41] 0.681 LocalPage{nameSpace=CATEGORY, title=Category:History (simple), localId=6602, language=Simple English} [20:07:43] 0.875 LocalPage{nameSpace=CATEGORY, title=Category:Geography (simple), localId=5834, language=Simple English} [20:07:45] 0.942 LocalPage{nameSpace=CATEGORY, title=Category:Everyday life (simple), localId=5865, language=Simple English} [20:07:47] 1.051 LocalPage{nameSpace=CATEGORY, title=Category:Science (simple), localId=5833, language=Simple English} [20:07:49] 1.662 LocalPage{nameSpace=CATEGORY, title=Category:Knowledge (simple), localId=65947, language=Simple English} [20:10:44] Religion! [20:15:57] Yeah. There must be some path upwards through the graph that lands on Religion in Simple. [20:15:57] But pretty unambiguously People. [20:15:57] A "semantic" query would probably tell you something different. [20:15:57] A more interesting example: [20:15:57] distances to top-level categories for LocalPage{nameSpace=ARTICLE, title=Jesus (simple), localId=219585, language=Simple English} [20:15:57] 0.335 LocalPage{nameSpace=CATEGORY, title=Category:Religion (simple), localId=6106, language=Simple English} [20:15:57] 0.371 LocalPage{nameSpace=CATEGORY, title=Category:People (simple), localId=5904, language=Simple English} [20:15:57] Hopefully I'll have it up for consumption very soon. [20:17:26] Can you run soylent green through the alg? [20:17:27] lol [20:17:32] * halfak wants to know if it is people [20:19:53] Hah! MUST do that. I don't think it's in Simple, though. Need to wait for EN. [20:20:10] :) [20:21:21] I really want to know about Abraham or Noah, since some people consider them human some people consider them fictional character [20:54:31] Amir1, see https://phabricator.wikimedia.org/project/profile/1470/ and https://phabricator.wikimedia.org/project/profile/1468/ [20:55:25] halfak: oh nice [20:57:13] I listed out all the new tags here: https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service#See_also [20:57:21] So, if you need to look them up, that would be a good spot. [20:57:37] wt_vandalism is where I plan to move all of the revert-specific stuff. [20:57:53] Right now, there's a 'label_reverted' script sitting in ORES and it doesn't belong there. [21:05:13] halfak: https://phabricator.wikimedia.org/tag/wb_vandalism/ [21:05:16] https://phabricator.wikimedia.org/tag/bwds/ [21:06:01] :) We should start tagging cards and then start up the workboard. :) [21:06:05] sorry wrong link [21:06:05] https://phabricator.wikimedia.org/project/board/1470/ [21:06:13] I meant I already did that [21:06:31] Sweet :) [21:07:45] * halfak starts on the rev_id/feature extractor [21:11:15] awesome [21:11:40] I add columns and moved tasks to its own column (Done, Active)