[01:01:37] 06Revision-Scoring-As-A-Service, 10revscoring: Implement "thresholds", deprecate "pile of tests_stats" - https://phabricator.wikimedia.org/T162217#3170258 (10Halfak) I was just looking at T159196. I think that we can use this to re-scale our outputs to intuitive values so that 50% really means "50% precision"... [06:37:39] 06Revision-Scoring-As-A-Service, 10MediaWiki-extensions-Translate, 06translatewiki.net: qqq for a wiki-ai message cannot be loaded - https://phabricator.wikimedia.org/T132197#3170461 (10Nikerabbit) >>! In T132197#3168696, @Halfak wrote: >> MediaWiki limits what page titles are valid. > > I see. So all mess... [13:03:06] 10Revision-Scoring-As-A-Service-Backlog: DRAFT: Use rate limiting for ORES Action API score retrieval - https://phabricator.wikimedia.org/T162484#3171224 (10Tgr) > Is it possible to do the following? > # Allow scores to be returned in Action API responses provided there are corresponding records in the recent ch... [15:10:46] o/ halfak [15:11:14] o/ Nettrom [15:11:22] sorry to be away yesterday. It was crazy [15:11:31] noticed you got busy, no worries [15:11:33] I'll be much more present today :) [15:11:46] stop peeking over my shoulder ;) [15:12:11] :) So we were going to talk about WPMED and moving forward, right? [15:13:01] yep, it appears that I’m pushing up against the boundaries of WPMED’s importance ratings, so I’d like to do some evaluation [15:13:25] not sure what approach to take, I noticed your ORES evaluation listed some articles and asked for assessments [15:13:38] (the ORES audit, that is) [15:14:44] Oh yeah. That's mostly unrelated to importance and rather showing that some "external reviews" of Wikipedia are crappy and outdated. [15:14:53] What did you have in mind for a WPMED evaluation? [15:16:43] best-case scenario is perhaps importance assessments of a selection of articles (my test set is 160, but more is better) from multiple participants in order to see if 1: they agree with each other, and 2: they agree with our predictions [15:16:51] but that sounds costly [15:17:34] a lightweight way of gathering feedback on article ratings is perhaps a good alternative? [15:18:25] Nettrom, I wonder if we could use prediction model for something WPMED wants to do to see if it ends up being useful. [15:18:47] E.g. finding articles they have mis-classified by importance or finding articles that should be tagged but are not yet tagged. [15:18:52] gather a set of candidates for reassessment like we’ve done with quality ratings in the past? [15:18:56] Right [15:19:16] I feel like that could evaluate the *usefulness* of the model and do something that is worth volunteer time :) [15:19:27] Our statistician reviewers won't like it through :/ [15:19:45] yeah, they won't [15:20:06] but maybe we should aim to satisfy them later, running something with more than a single project? [15:20:50] +1 [15:21:13] Or just push back against them and say "your statistical validity is simply less important than direct assessments of utility" [15:21:25] Either way :) [15:21:34] * J-Mo is excited to hear the computer scientists chatter about making AI's useful [15:21:40] :) [15:21:52] Nettrom, What do you think about the weird quirks in WPMED's importance assessments? [15:22:13] I'm starting to land on ignoring them and building a generalized model. [15:22:41] well, in this case I think this is also about having a conversation with the target audience… because I’m seeing predictions that push against WPMED’s boundaries of importance… so talking to them about whether they should have only 90 articles of Top-importance is part of it all [15:23:04] We can let query filtering happen after the fact. E.g. "show me all of the articles needing re-assessment that are not people because I know we're weird about that." [15:23:25] how about build two versions and ask people which is more useful? [15:23:57] J-Mo, one version would be mostly unsustainable. :\ But it might be academically interesting to find out if it is more useful. [15:24:11] I suspect that many people who consider themselves "WikiProject Medicine" don't follow the "downgrade biographies" rule. [15:24:38] they have the “society and medicine” task force to hang out in, though [15:24:42] you mean the Med-specific version would be unsustainable? then it sounds like the choice is clear [15:25:14] J-Mo, yeah. I'm trying to think about what it would take to learn those rules and I think that applying them outside of the predictive model is much easier. [15:25:32] And it would give the wikiprojects the flexibility to come up with new rules more easily. [15:25:47] agree. just apply some heuristic filtering after the model runs, to push down the bios? [15:26:18] you have to filter them before the run too, because if you want to treat them specifically, you can’t train on them [15:26:24] ah [15:27:38] +1 Nettrom. I thought about that too. We might be able to just leave to be down out as noise. [15:27:48] Maybe some other project will have a boost for bios? [15:27:50] I dunno [15:28:59] in WPMED there’s enough articles (e.g. about 10% of it is bios) that it should bias the classifier if we leave them in… so basically it means the system needs a sidechain [15:29:46] slightly more than 1/5 of WPMED is by default Low-importance [15:30:04] (or somewhere in that ballpark, I need to ask them if my set of categories is sane) [15:30:09] What do you think it would take to remove these kinds of observations for all wiki projects? [15:34:38] based on my experience with WPMED, I’m hopeful that we can write a discovery algorithm on top of Wikidata to identify majority categories for review… question is, how many do we find? WPMED is reasonably straightforward since we know we’re looking for Low-importance articles [15:35:11] not sure how this would work if we’re looking for combinations of things (at which point it might not make sense) [15:41:26] hm, it’s also a question of how active a WikiProject is. WPMED is large and has active participants, we might not get feedback from other projects [15:41:58] Nettrom, I'm not sure we need strong feedback right now. [15:42:20] My sense is that we should get something that *work* (even minimally) in front of people so that they can use it to do something. [15:42:38] I expect that, if this is useful, we'll have volunteers flocking to us to tell us what isn't working right :) [15:48:40] okay! I propose that I start by training a classifier for WPMED that ignores certain categories, create lists of candidates for reassessment, and solicit their feedback [15:49:02] I’m also working on identifying other projects we can approach to see if they’re interested [15:49:06] +1 that sounds great. [15:49:14] What do you think of adding other wikiprojects into the model? [15:49:27] So that you start with a generalized wikiproject importance recommender? [15:51:40] I suspect that it won’t do as well, because when we’re building project-specific models there’s an opportunity for utilizing the scope of the project to get better information (the project-specific model has an additional predictor) [15:52:04] but, now that we’ve defined a bunch of WPMED as out of scope, I can retest that assumption [15:53:54] It would be great if we could get view rates on a per project basis. E.g. this article gets a lot of views for a WikiProject Mycology article, but not that many views for a WikiProject Medicine article -- so it's probably important to Mycology but maybe not Medicine. [15:54:57] yeah, and that is one of the challenges, that importance is very dependent on scope… and there isn’t really a good notion or dataset of “global importance" [15:57:40] it’s something that I’ve been thinking about for WPMED as well… when it comes to diseases, their importance ratings appear to be based on global prevalence, but that might not correlate with “what people read on Wikipedia” [15:57:55] I suspect we’ll know more about that when I put some reassessment candidates in front of them [15:58:02] Nettrom -- a thought: what if we train a model on within-WikiProject importance and then apply that same model to all of Wikipedia to get global importance? [15:58:28] Yeah. WPMED's weirdness is an independent thing. [15:59:34] halfak: I’ve been thinking along those lines, building a hybrid classifier… question is, should you weight projects equally? I still don’t have a good answer to that. [16:00:01] Don't weight them at all. [16:00:02] :) [16:00:31] Essentially, we'd build a classifier that predicts "What's the importance of Article A in article group G" [16:00:44] Where G is usually a WikiProject, but is sometimes the set of all articles. [16:00:57] We just train it on WikiProjects and see what it says for all articles [16:01:06] if you figure that out, it'd be useful for search relevance too :) [16:01:27] we use a generic popularity score today, which is just page views(article) / all page views [16:01:40] ebernhardson, right, this is one of the goals of this model. [16:01:57] :) [16:03:31] halfak: a potential problem here is that there are enough articles where projects disagree about the ratings to make this just a lot of noise… our dataset on unanimous ratings isn’t very big [16:04:11] Nettrom, I'm being unclear. I'm proposing to base our features on a per-wikiproject measurements. [16:04:37] yeah, but popularity and number of wikilinks are global [16:04:38] And to train on the model's ability to predict the importance of an article to a specific WikiProject where WikiProject is formalized as a set of articles. [16:04:58] Right, but wikilinks can be calculated for within WikiProject [16:05:07] Pageviews -- well, we'll need to adjust for that somehow. [16:05:29] Maybe by comparing to the total views of the articles in the WikiProject of interest. [16:05:48] maybe views are log-normal so we can use standard deviations? that could work [16:07:08] or use view rank instead… hm, might work, I can test it with the WPMED dataset [19:12:17] 06Revision-Scoring-As-A-Service, 10Wikidata, 10Wikilabels: Deploy Wikidata item quality campaign - https://phabricator.wikimedia.org/T157493#3172907 (10Halfak) There was a bug. So I re-deployed again. See http://labels.wmflabs.org/campaigns/wikidatawiki/53/?campaign=stats [19:12:26] 06Revision-Scoring-As-A-Service, 10Wikidata, 10Wikilabels: Complete Wikidata item quality campaign - https://phabricator.wikimedia.org/T157495#3172909 (10Halfak) [20:43:24] 06Revision-Scoring-As-A-Service, 10Wikidata, 10Wikilabels: Deploy Wikidata item quality campaign - https://phabricator.wikimedia.org/T157493#3173311 (10Glorian_WD) @Halfak : could you briefly explain about the bug? [21:06:36] 06Revision-Scoring-As-A-Service, 10Wikidata, 10Wikilabels: Deploy Wikidata item quality campaign - https://phabricator.wikimedia.org/T157493#3173362 (10Halfak) https://github.com/wiki-ai/wikiclass/commit/950f693d789f8512e30f483f18e2d13483d13749 [21:36:59] 06Revision-Scoring-As-A-Service, 10Analytics, 10ChangeProp, 10EventBus, and 3 others: Switch `/precache` to be a POST end point - https://phabricator.wikimedia.org/T162627#3173433 (10Ladsgroup) https://github.com/wiki-ai/ores/pull/192 [21:46:55] 06Revision-Scoring-As-A-Service, 10Analytics, 10ChangeProp, 10EventBus, and 3 others: Switch `/precache` to be a POST end point - https://phabricator.wikimedia.org/T162627#3173436 (10Ladsgroup) https://github.com/wiki-ai/ores/pull/192 [21:59:49] o/ Amir1 [21:59:52] Just saw your PR [22:00:04] halfak: Hey [22:00:19] I'm working on trwiki right now :) [22:00:19] 1, I think we should support both a get param and form body. [22:00:30] Gotcha. I might take a pass at your PR then ^_^ [22:00:46] you can write it here [22:00:49] I will get to it [22:00:53] don't worry :D [22:01:21] OK. I've been working on the thresholds thing a little bit. [22:01:27] I'll have some news about that soon. [22:02:19] For each threshold (max 200), I'll store (recall, precision, accuracy, f1) [22:02:52] I'll also store !precision and !recall [22:03:08] !precision is the precision of a False prediction actually being False at a given threshold. [22:03:08] 04Error: Command “precision” not recognized. Please review and correct what you’ve written. [22:03:09] Key was added [22:03:15] AsimovBot, calm down [22:03:15] 04Error: Command “calm” not recognized. Please review and correct what you’ve written. [22:03:19] lol [22:03:35] AsimovBot, despacito [22:03:35] 04Error: Command “despacito” not recognized. Please review and correct what you’ve written. [22:06:33] Oh yeah and !recall too. [22:06:43] * halfak waits for AsimovBot to get excited [22:07:06] The problem I'm working on right now is re-scaling based on known input set biases. [22:07:30] E.g. with Wikidata, we know that we biasing towards damaging edits when we train and test the model. [22:07:36] We did that on purpose. [22:07:55] Yeah [22:08:51] I'm thinking that we'll add some class weight params when testing and the stats can use those to re-scale. [22:20:45] Regretfully, it also means that I need to do lots of math myself. [22:20:53] Happily, fitness statistics are easy math :) [22:21:02] Just need to not do something terribly wrong :/ [22:45:36] SPAM [22:45:41] _o_ [22:45:46] \o_ [22:46:32] Thresholds is going to be a bit ugly. [22:46:45] OK time for me to head out. [22:46:52] * halfak --> dinner [22:46:58] thanks for working on cleanup Amir1 [22:46:59] o/ [22:47:13] :D [22:47:37] The autolabeler is working so I thought I do some clean up [23:04:16] 06Revision-Scoring-As-A-Service, 10Wikilabels, 10rsaas-editquality, 15User-Ladsgroup: Start v2 editquality campaign for trwiki - https://phabricator.wikimedia.org/T161977#3173712 (10Ladsgroup) https://github.com/wiki-ai/editquality/pull/65 [23:06:47] 06Revision-Scoring-As-A-Service, 10Wikilabels, 10rsaas-editquality, 15User-Ladsgroup: Start v2 editquality campaign for trwiki - https://phabricator.wikimedia.org/T161977#3173733 (10Ladsgroup) I will start the new campaign once the patch is merged.