[01:19:31] 10Jade, 10Scoring-platform-team, 10CommRel-Specialists-Support (Jan-Mar-2020): Design Jade pilot deployment plan with the Scoring Platform team - https://phabricator.wikimedia.org/T246486 (10Johan) @Halfak I'll start poking a few persons next week to see if I can interest them in this. [03:38:18] (03CR) 10Ppchelko: [C: 03+2] extension.json - don't use array syntax when hooks only have 1 handler [extensions/ORES] - 10https://gerrit.wikimedia.org/r/578914 (owner: 10DannyS712) [03:42:21] (03Merged) 10jenkins-bot: extension.json - don't use array syntax when hooks only have 1 handler [extensions/ORES] - 10https://gerrit.wikimedia.org/r/578914 (owner: 10DannyS712) [14:06:45] 10ORES, 10Scoring-platform-team: [Discuss] Future ORES architecture - https://phabricator.wikimedia.org/T226193 (10Ottomata) That's a great article! Nice to learn about [[ https://cloud.google.com/blog/products/ai-machine-learning/introducing-feast-an-open-source-feature-store-for-machine-learning | Feast ]]... [14:06:45] 10[1] 04https://meta.wikimedia.org/wiki/https://cloud.google.com/blog/products/ai%2Dmachine%2Dlearning/introducing%2Dfeast%2Dan%2Dopen%2Dsource%2Dfeature%2Dstore%2Dfor%2Dmachine%2Dlearning [14:35:52] 10Scoring-platform-team (Research), 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10Outreachy (Round 20), 10artificial-intelligence: Proposal (GSoC / Outreachy 2020): Implement an NSFW image classifier with open_nsfw - https://phabricator.wikimedia.org/T247614 (10Chtnnh) [14:40:57] \o/ [14:48:08] halfak, i am working on changing articlequality/extractors of each wiki to make from_template generators as of now [14:48:40] and change extract_template to expect multiple values of labels [14:48:48] will let you know if i hit a roadblock [14:57:58] Sounds great! Thanks. [15:03:23] o/ kevinbazira. Around for retro? [15:57:59] halfak: Hi👋 This is Nikhil. Looking forward to having a chat with you. [16:08:54] Hey nikhil07! In meeting. Out in a couple hours. [16:08:57] Sorry for the delay :( [16:09:58] No problem [16:52:13] Back now for a bit! [16:52:15] nikhil07, ^ [16:52:59] good to hear that [16:53:23] So yeah. Let's talk projects. Tell me more about your interests :) [16:55:07] I am broadly interested in human-ai interaction (human-centered AI). [16:56:22] which includes a broad spectrum of how human interact with AI to understanding its social impact. [16:56:33] We're working on Jade which might be interesting to you. https://www.mediawiki.org/wiki/JADE [17:01:08] It does look to good [17:01:32] but how do I get involved? [17:01:50] to explore further [17:04:09] We're user-testing the initial bits of the UI right now. What kind of experience do you have with UI design and user testing? [17:04:32] In the near future, I'd like to write a systems paper about Jade that includes some case studies from the pilot deployment. [17:08:20] I do have some prior experience of user design and user testting but I dont have any formal training it. But I am competely open to learn new skills based on the problems I will tackle. [17:10:44] Hmm. I'd like to find something that lines up most closely to your interests and skills. Tell me more about Human-centered AI work you have experience with. [17:10:57] I wonder if a more ethnographic or quantitative approach would be more in-line. [17:12:02] could you maybe introduce a quantitative approach to the User testing? [17:13:49] We probably wouldn't do this during user testing. I think we might like a quant approach to what people start submitting into Jade. We'll likely not do the pilot deployment for a while yet. Maybe a month. [17:14:45] Most of my prior work have revolved around Computational Social science and data science project. Like I have worked a project in which I mined large number of english wikipedia articles to train a classifer for article quality. [17:15:53] Human-centered AI something I aspire to explore. But I am open other domains also [17:16:37] So one of the ways I look at Jade is that it gives humans control over labeled data in a new way. [17:16:53] I'd argue that the most important component of an AI is the labeled data you use to train/test/monitor it. [17:17:03] What do you think of that argument? [17:17:35] I completely agree with that statement. Data is the new oil. [17:21:39] truly [17:22:05] Right on. So what kind governance actions do we expect to see when a community has access to the data. How might we use that to track issues with deployed models and use that to re-adjust? [17:22:09] and using deliberate human input to create labelled data is a very smart move [17:23:15] chtnnh, makes a good point. These labels are *explicit* and *curate-able*. [17:23:28] Jade allows their production and maintenance to be intentional. [17:25:19] regarding the governance action, I think the community would try to remove/reduce the underlying bias in the labelled dataset to make better AI for them. [17:27:55] nikhil07, how would we know that they are doing that? What types of bias might that not really work for? [17:27:57] Regarding the usage of , I think there could be some evaluation parameters for the AI like fairness, newcomer experience, trust and based on this metrics we can modify the underlygin labelled dataset to improve the overall experience. I mean this could be made iterative process. [17:28:54] Right on. One of the processes I'm curious about is meta-moderation. Essentially, we could (for our own political reasons) organize a campaign for re-considering edits by newcomers that are marked as bad. [17:28:56] nikhil07, i think in this process the model may tend to overfit [17:29:05] Oooh interesting. tell me more. [17:29:33] if we are reducing bias by giving it more labelled data [17:29:44] generated by deliberate action, [17:29:50] fit vs. overfit in the context of community direction is a really interesting pattern. [17:29:54] there is a slight chance of overfit imo [17:30:06] yes i think so too [17:30:22] One of the things I think about a lot of trying to make sure the testing/training context reflects the work context. This works out to trying to get random samples for training/testing. [17:30:59] Essentially, a stream of edits happening to articles in Wikipedia is adequately modeled by a random sample of all edits. [17:31:14] But with Jade, labeling is self-directed. [17:31:28] exactly why overfit might be a problem [17:32:17] We could find ourselves in a feedback loop pattern too. E.g. ORES saying an edit is bad might increase the likelihood that we get a bad label -- regardless of the True(TM) quality of the edit. [17:32:32] Could work the other way. No flag == more likely to get a "good" label. [17:33:29] true, but how do you counter this problem? [17:36:04] I didn't get the feedback loop pattern point properly? [17:36:14] in the context of JADE [17:37:21] JADE enables the feedback loop [17:37:34] because without JADE it would be only one way transmission [17:37:44] right halfak? [17:37:49] do you mean that the AI response will degrade the human response for it? [17:38:02] it may [17:38:18] cause and effect so to speak [17:38:49] Ok. So, we have labeled data, we train an AI that has some biases. The biases direct human behavior that created labels with the bias infused. Then we retrain the model. And now it is more biased. Rinse and repeat. [17:39:51] Jade records where a label comes from -- i.e., what UI someone was looking at when they submitted it. So we know when someone was seeing an ORES prediction. [17:40:06] We could maybe use that to look for evidence of incoming bias in the labels. [17:40:28] Or use that to run follow-up campaigns to ask people to re-consider old labels without ORES' influence. [17:41:14] I'm less concerned about our "damaging" model. As that has a pretty low precision. people are used to it being wrong. But our article quality models. I worry that people are starting to favor them over their own, human judgment. [17:41:39] Yeah, I agree with that. But you look from the Linus law perpespective, it may happen just the correct way also. I mean as more and more diverse people are able to produce the underlying labelled dataset, the chances of any kinf of bias in it reduces [17:45:09] yeah i agree with nikhil07 on the reproducibility part [17:45:25] but that would take time to achieve dont you think? [17:45:31] nikhil07, I agree. Seems equally likely. [17:45:45] How would we know what scenario we're under? That sounds like a great quant question :) [17:48:17] oh wow [17:48:19] XD [17:49:23] Or may be we can experiment with different mechanism where on one side no one will be shown ORES results and on the other everyone will be shown ORES results. And we try to combine the answers from a large group to find the "correct/unbiased" labels for our dataset. [17:50:16] Oh interesting. We have some tools to do that too. Our current way to get labels removes a bunch of sources of bias when asking "is this damaging?" [17:50:21] something similar to wisdom of crowd effect under no information diffusion and complete information diffusion [17:50:41] We could get a bunch of Jade labels and then do a targeted campaign for a sample of edits that have Jade labels to look for differences in judgment. [17:51:50] Let's say that we do find there is a difference and that getting a second, unbiased opinion is important. We could use this in some interesting ways. E.g., we can have lower quality data train the model and higher quality data we use for testing/looking for bias. [17:52:04] The re-review pattern is expensive so in this scenario, we'd make the best use of it, I think. [17:52:47] I wonder if there is a collaboration pattern we could encourage that would reduce the likelihood of getting a "biased label". [17:52:51] We could build a model! [17:53:00] A model that predicts when a label is likely to be biased ^_^ [17:53:12] It's models all the way down XD [17:54:47] I need to run to eat lunch. nikhil07 I think this is good fodder for a project proposal. if you like this direction, I think that's a good next step towards seeing if we can a project together. I'm also interest in other ideas is something else occurs to you :) [17:57:12] Let me think a little longer. I will definitely get back to you if I have any issues/doubts and my view or plans for this project [18:08:11] wow nikhil07 i really liked this discussion :) [18:09:54] Same here. Suddenly, I am getting a lot cool ideas in this space!😅 [18:10:45] becuase of the discussion [18:11:21] exactly XD [20:45:01] \o/ [20:45:08] woops forgot to change my nick. [20:45:18] Was just in a big long meeting and got back to design work. [20:45:40] Just finishing up the diff/historical revision view of a Jade "diff" entity. [20:45:55] Next I'm working on diffs of Jade content. [21:18:01] OK! I have the diff proposal for both how Jade diffs should look and how the diff page should render. I have to do some formatting to get things to line up. I' [21:18:07] ll finish that up next week. [21:18:12] I'm off. Have a good weekend! [21:44:48] nikhil07, chtnnh, halAFK: That was indeed a very interesting discussion. I have like zero experience in user testing, but here are some thoughts about the bias problem. (Feel free to tell me I don't know what I'm talking about if I'm way off base :P ) [21:46:17] What if users could see *why* the AI assigned the label it did? Would that change our minds about the potential bias the user might be introducing having seen the label? [21:47:56] In other words, if the model's reason was interpretable to some degree, the question we'd be asking the user is, "Does your analysis of this edit agree with the model's reasoning?" [21:52:44] To drive this home, let's imagine that the user is requested to provide a label *before* seeing the label from JADE. Then, once the user has provided a label, the label from JADE is revealed with the corresponding reasoning. Finally, the user is provided with a chance to refine his/her own label. [21:55:17] My hypothesis is that we should be able to infer *something* useful about the quality of the user's label from this. For example, if the user's label disagrees with JADE's label, yet the user chose not change his/her own label, then that may imply the user had a very strong reason for assuming the label that the AI model has not yet captured. [22:00:17] To clarify: the user may have had a very strong reason for assuming the label, which reasoning the AI model has not yet captured. [22:04:12] 10Scoring-platform-team (Research), 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10Outreachy (Round 20), 10artificial-intelligence: Proposal (GSoC / Outreachy 2020): Implement an NSFW image classifier with open_nsfw - https://phabricator.wikimedia.org/T247614 (10srishakatux) @chtnnh Hi!... [22:05:01] I'm inclined to think that such a user's label is of high quality, because it stems from a sort of implicit critical analysis from the model to the user, and the user has to come up with sufficient reasoning to override the model's criticism. [22:50:50] the user could as well be wrong [22:50:58] ignorant (not understanding the issue) [22:51:03] or even alicious [22:51:08] *or even malicious [22:51:35] oh, lazy is an option, too