[01:20:14] 10Scoring-platform-team, 10Discovery-Search, 10Elasticsearch, 10revscoring, 10artificial-intelligence: Improve the performance and quality of tokenization in revscoring - https://phabricator.wikimedia.org/T248480 (10Halfak) Thanks for the reminder. See P10868 A few surprising things to me. (1) I didn... [06:44:13] 10Jade, 10Scoring-platform-team (Current), 10MW-1.35-notes (1.35.0-wmf.27; 2020-04-07): Add a confirmation dialog to Jade - https://phabricator.wikimedia.org/T247462 (10kevinbazira) Thanks @ACraze, I am going to implement it for for all the other api modules too. [08:39:36] 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10Growth-Team, 10MediaWiki-Recent-changes: Expose ORES topics in recent changes filters - https://phabricator.wikimedia.org/T245906 (10dcausse) Using elastic sounds complicated here I think. I don't enough about RCF... [09:44:22] 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Edit-Review-Improvements-RC-Page, 10Growth-Team, 10MediaWiki-Recent-changes: Expose ORES topics in recent changes filters - https://phabricator.wikimedia.org/T245906 (10Pginer-WMF) >>! In T245906#6022206, @kostajh wrote: > @RHo (and @Pginer-WMF as t... [11:21:40] (03PS1) 10Esanders: build: Update eslint-config-wikimedia [extensions/ORES] - 10https://gerrit.wikimedia.org/r/585727 [11:25:49] (03PS2) 10Esanders: build: Update eslint-config-wikimedia [extensions/ORES] - 10https://gerrit.wikimedia.org/r/585727 [13:35:02] hello halfak ! [13:35:20] hey chtnnh [13:35:33] just wanted to talk about text complexity a bit [13:35:37] I'll need to step away for a bit. But share your thoughts and I'll respond when I can. [13:35:42] sure! [13:39:01] when we implemented text complexity, we were expecting incremental improvements in model fitness because the correlation between reading ease and the quality of an article is quite intuitive. We were not able to achieve that. But I think we may have missed out on a few things. To begin with, we did get marginal improvements in B and GA classification. The correlation between reading ease and quality gets stronger as quality goes to either extreme. [13:39:01] maybe instead of adding text_complexity to wp10 directly, we should pre process it to better represent the correlation between complexity and quality [13:50:33] 10Scoring-platform-team (Research), 10Structured-Data-Backlog, 10artificial-intelligence: Implement NSFW image classifier using Open NSFW - https://phabricator.wikimedia.org/T214201 (10Chtnnh) @MusikAnimal Some follow up questions after the preliminary work I have done on the task: 1. What is the maximum th... [14:19:30] back! [14:20:08] chtnnh, I'm not sure what you mean in your proposal [14:20:28] pre-process it and represent the correlation directly? What would that give us? [14:20:41] im unsure myself [14:20:53] but what i am asking for is another shot at text complexity [14:21:01] i think there is something we can achieve here [14:25:03] Gotcha. We only tried Flesch. Are there other measure that have more promise? [14:26:05] there are [14:26:25] theres the flesch kincaid [14:26:41] maybe we need to supply the scores in a different way to the model [14:26:51] apply a function on them maybe [14:37:06] The gradient boosting models are pretty good at dealing with non-linearity. [14:44:55] hmm. I wonder if we measure variability across sections. [14:45:23] e.g. if you split the article into paragraphs and measure the stddev. [14:52:24] yes that could help factor for variation in complexity across the article and in turn quality [14:52:29] halfak ^ [14:54:16] Right. I'm not sure what readability is "good" but I would bet that variable readability is "bad" [14:55:16] right [14:55:34] actually the correlation between readability and quality depends on the article [14:55:37] imo [15:01:14] 10Scoring-platform-team, 10Discovery, 10drafttopic-modeling: Add drafttopic predictions to ElasticSearch index for the Draft namespace where available - https://phabricator.wikimedia.org/T249341 (10Halfak) [15:04:29] 10Scoring-platform-team, 10Discovery-Search, 10drafttopic-modeling: Add drafttopic predictions to ElasticSearch index for the Draft namespace where available - https://phabricator.wikimedia.org/T249341 (10dcausse) [15:15:23] 10Scoring-platform-team, 10Discovery-Search, 10drafttopic-modeling: Add drafttopic predictions to ElasticSearch index for the Draft namespace where available - https://phabricator.wikimedia.org/T249341 (10dcausse) Pinging @EBernhardson. The mediawiki_revision_score schema does include the page namespace, my... [16:24:27] https://etherpad.wikimedia.org/p/textComplexity [16:24:56] hello guys, do check the etherpad out if you are working on anything NLP in ORES^ [17:40:09] posting our async update notes -- [17:40:26] kevinbazira: [17:40:31] Y: [17:40:32] Reviewed Andy's patchsets [17:40:34] - 585315 (Rename ProposalValidator -> EntityValidator) [17:40:36] - 585346 (Rename ProposalTarget -> EntityTarget) [17:40:38] Added MW bubble notification for jade-updateproposal [17:40:40] T: [17:40:42] Reviewed Andy's patchset [17:40:44] - 585575 ( Rename ProposalEntityType -> EntityType ) [17:40:46] Fixed Jade bubble notification reload bug [17:40:48] Added MW message key for jade-updateendorsement [17:40:50] - This is content for the bubble notification displayed when a user successfully updates an endorsement. [17:40:52] halfak: [17:40:54] Y: It was basically all meetings -- mostly about the upcoming tuning session. But I did work with chtnnh on some ptwiki articlequality features and some data pipeline stuff he'll need. I also did some outreach to WikiProject organizers re. topic models and I emailed a group of patroller tool developers about dev-user testing Jade. [17:40:56] T: I'm working on filling in details for the tuning session deck. I'll be submitting a PR for the ptwiki data pipeline I have been working on for chtnnh and I'll be reviewing his work on features. [17:40:58] and me: [17:41:09] Y: Finished cleaning up naming conflicts around writing secondary schema data for Jade, also reviewed Kevin's mw notify patchset for update-proposal [17:41:11] T: Working on fixing the text fixtures for the hooks related to the 2ndary integrations, also more code review for Kevin related to the mw notify [18:09:41] So many meetings [18:10:33] XD [18:18:08] 10Scoring-platform-team, 10Discovery-Search, 10drafttopic-modeling: Add drafttopic predictions to ElasticSearch index for the Draft namespace where available - https://phabricator.wikimedia.org/T249341 (10Halfak) See P10884 for the output of my thresholds script. [18:26:02] grabbing lunch! [18:55:45] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for ptwikipedia - https://phabricator.wikimedia.org/T246663 (10Chtnnh) @GoEThe Apart from the Infobox and Citation needed templates, can we assume the other templates are the s... [20:02:08] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for ptwikipedia - https://phabricator.wikimedia.org/T246663 (10Halfak) I just took a pass and figured a bunch of things out by navigating around. I gave some notes to @chtnnh. [20:28:15] Just started fetching text ptwiki's articlequality model. I should be ready for our feature list on Monday :) [21:04:01] 10Jade, 10Scoring-platform-team (Current): Implement secondary schemas for joining Jade data to other tables - https://phabricator.wikimedia.org/T229977 (10ACraze) Dropping my notes here on what we current state of the secondary integrations. We will need to do some refactoring due to our new jsonschema for J... [21:16:16] wikimedia/revscoring#1881 (readme-badges - 02dab7f : Andy Craze): The build passed. https://travis-ci.org/wikimedia/revscoring/builds/670758167 [21:20:37] Nice, accraze. Just merged. [21:20:45] :) [21:20:56] The badges look great :) [21:21:21] Unrelated, check this out: https://en.wikipedia.org/wiki/User_talk:SD0001#consider_this_a_barnstar [21:21:45] I just got a ping here. Looks like our topic models are having a big impact on new page patrolling via a volunteer's bot using them to build worklists. [21:24:04] very nice [21:27:03] \o/ [21:34:14] halfak: Sorry I missed your message from earlier. Work has ramped up a bit for the past week or so, so I haven't made a lot more progress. This next week should be better. [21:35:02] In terms of where I'm at, I have a kubeflow instance up and running and have started using the pipelines feature to pull in the differents bits of data for training the topic model. [21:35:07] no sweat clemons! I just wanted to make sure you had a chance to connect with accraze. [21:35:20] Cool! [21:36:27] Once you get something that works, I'd love to have you come to one of our meetings to present on what you have learned and how it came together. [21:37:16] BTW, I was hoping to connect you and accraze because he's our senior engineer and he has a lot more experience than I do in high performance ML. He's been interested in exploring KubeFlow too :) [21:39:51] accraze: We should definitely get in touch on some of the details of kubeflow. For example, one concern I have is that kubeflow doesn't seem natively support a way to store pipelines in VCS. [21:40:49] halfak: One nice feature of KubeFlow you might be interested in is its kfserving component. It provides the ability to do canary rollouts and A/B testing with different models. [21:41:10] Oooh. That could be painful. It's a useful observation. I imagine we'd want to follow a code review process for changes to the pipeline. [21:41:56] Interesting! We have nothing like that now. We'd just release two models and do the A/B split when choosing which to consume from. [21:42:56] clemons, one question I had was around memory sharing. E.g. we have some relatively large assets that we'd need to be part of the pipeline. E.g. we have word vectors that can be GBs of data. Does KubeFlow's structure allow for sharing memory between workers? [21:43:17] Like, could we have several workers referencing the same keyed vectors without loading it into memory multiple times? [21:45:11] Since KubeFlow rests on kubernetes, the underlying hardware is abstracted away and it becomes difficult to share resources in that way. [21:45:27] Ideally we'd have a single service that providing the key vector mapping as an API [21:45:34] s/providing/provides [21:46:35] That's interesting. Seems viable. That service could take a list of strings and map them top a list of vectors. [21:47:21] That'd be really handy for sentence parsers too. Those are also memory intensive and that's a big reason why we avoided them. [21:47:31] But you can do really cool things with sentence parsing. [21:47:48] Like, we can map a probability that a particular sentence has a particular issue. [21:51:02] In this case, issue refers to something like writing style? [21:51:13] Right. [21:51:37] Like, "spaminess" or "passive verb-ness" [21:54:12] Gotcha, that would be interesting. I've also wondered about whether syntatic parsing would be useful in deriving word embeddings, since the context of a word doesn't necessarily have to be its immediate window. [21:56:04] That's a good point. I mean, even a subject/object relationship would potentially carry a lot clearer of a signal. [21:58:18] I wonder if that's really an optimization we need to think about for small datasets. With Wikipedia, we've got a heck of a lot of data. [22:00:23] Right. It may require us to manually generate the context rules too, which at that point we're back at rule-based learning. [22:02:56] Right. For a lot of this stuff, I really want to remain as language independent as we can. [22:03:19] It's amazing how well using the raw text and building word embeddings has been working for us in that regard. [22:05:41] Depending on droves of data is definitely dandy :) [22:05:58] (I couldn't resist the opporuntity for aliteration) [22:06:34] ^_^ [22:07:14] By the way, I had a few questions on some basic Wikipedia stuff. What exactly is a "template" used for and why does the topic model use them to normalize the article tags? [22:08:16] Aha. So templates are basically macros. You use them to transclude content from one page to another and you can give them arguments. [22:08:43] So {{citation_needed}} adds a citation needed [22:08:44] 10[1] 10https://meta.wikimedia.org/wiki/Template:citation_needed [22:08:50] Thanks AsimovBot [22:09:18] Anyway, let's say you wanted to add a reference to the text of an article. [22:09:30] You could straight up do my reference stuff [22:10:03] Or you could build a template called "ref" and do something like this: {{ref|my reference stuff}} and have it render just like above. [22:10:04] 10[2] 10https://meta.wikimedia.org/wiki/Template:ref [22:10:28] So when we're looking for tags/images/etc. We also need to scan for common templates. [22:11:29] A good alternative would be to actually parse the page, but that is complicated and slightly intractable -- or at least it was. It seems that there's a new-ish parsing API that could allow us to parse the page and then just count the images/links/references/etc. that way. [22:12:10] It takes some flexibility away from us. Sometimes we want to scan for things that don't get rendered or aren't rendered in predictable ways. [22:12:13] https://www.mediawiki.org/wiki/Parsoid [22:12:15] But it's an option. [22:13:02] It's nice *most of the time* to parse the page directly because it *means something different* to a Wikipedian whether you used a template or not. [22:13:37] In other cases, it's a big pain in the butt because we need to match image inclusion 10 different ways and we still only have partial coverage. [22:13:47] * halfak gets a bit ranty on Fridays. [22:14:10] Is this the "old way" you're referring to? https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/fetch_wikiprojects.py [22:16:28] Oh. That's a... Ha. That's a bit of code we can happily get rid of. It is for turning a bunch of wiki pages into a tree structure. [22:16:40] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Directory [22:16:47] Woops. Wrong one. [22:17:01] This one: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory [22:17:21] We now maintain our own taxonomy here: https://github.com/halfak/wikitax/blob/master/taxonomies/wikiproject/halfak_20191202/taxonomy.yaml [22:20:42] Ah, yes, that taxonomy file is the one I've been looking at. I thought it was generated from fetch_wikiprojects and was confused at how the code didn't match up. [22:20:53] wikitax is definitely much simpler. [22:21:13] wikitax is pretty new. I should do some cleanup of the old stuff ASAP> [22:25:30] Another question about drafttopic: I noticed revscoring is used to vectorize and aggregate the words for an article. I haven't looked into the implementation of revscoring yet, but it seems to just be wrapping gensim and xgboost. Should I continue to use revscoring in the prototype, or would it be preferably to directly use gensim/xgboost? [22:29:28] Hmm. Good question. I think the answer is: skip revscoring if it is slowing you down. Eventually, I'll want to talk to you about some of the things that revscoring does and how we might do those things in KubeFlow. [22:30:07] The main advantage to revscoring's dependency injection system might disappear with KubeFlow. Or we might re-engineer it for KubeFlow. [22:31:06] 10Scoring-platform-team (Current), 10drafttopic-modeling: Remove old drafttopic utilities and update utility docs. - https://phabricator.wikimedia.org/T249385 (10Halfak) [22:31:06] 10Scoring-platform-team (Current), 10drafttopic-modeling: Remove old drafttopic utilities and update utility docs. - https://phabricator.wikimedia.org/T249385 (10Halfak) a:03Halfak https://github.com/wikimedia/drafttopic/pull/48 [22:31:35] I've got to hit the road. Glad to catch up a but clemons! Have a good weekend [22:31:47] You too! Thanks! [23:11:28] just catching up on this conversation, had wandered off to make food :) [23:12:32] clemons: it was my understanding that you can store kubeflow pipelines in VCS using the pipelines sdk [23:13:45] Yeah, I just found that. I was assuming I had to use the UI to create the pipeline altogether, but it seems it's only needed to upload a pipeline that was compiled with the sdk. [23:14:31] And there is an REST API for pipelines, so we could implement a 1-click compile->submit script. [23:15:26] yep that was what i had been envisioning, something like that would dramatically improve our current workflow [23:22:25] In terms of architecture, I'm thinking that each data source should be containerized in a separate component of the training pipeline. For example, there should be a component that produces a tarball of raw wiki text. We can provide configurable options to this component such as "use_cache" or "fetch_anew". The goal is that each component can be as simple or complex as needed, but downstream components [23:22:26] (like the vectorizer or trainer) are completely isolated. [23:40:25] yeah absolutely clemons, i really like the idea of having a number of isolated components to work with