[00:03:29] 10Jade, 10Beta-Cluster-Infrastructure, 10MediaWiki-ContentHandler, 10Patch-For-Review, 10User-DannyS712: Beta cluster: The content model 'JadeJudgment' is not registered - https://phabricator.wikimedia.org/T247476 (10ACraze) >>! In T247476#5962359, @DannyS712 wrote: > So all existing pages should just be... [00:05:13] 10Jade, 10Beta-Cluster-Infrastructure, 10MediaWiki-ContentHandler, 10Patch-For-Review, 10User-DannyS712: Beta cluster: The content model 'JadeJudgment' is not registered - https://phabricator.wikimedia.org/T247476 (10Reedy) Looks to be, at least on beta enwiki `lang=sql MariaDB [enwiki]> select * from p... [00:07:20] 10Jade, 10Beta-Cluster-Infrastructure, 10MediaWiki-ContentHandler, 10Patch-For-Review, 10User-DannyS712: Beta cluster: The content model 'JadeJudgment' is not registered - https://phabricator.wikimedia.org/T247476 (10DannyS712) https://en.wikipedia.beta.wmflabs.org/wiki/Jade:Diff/376901 is fine, but tryi... [02:09:59] 10Jade, 10Scoring-platform-team (Current), 10MW-1.35-notes (1.35.0-wmf.23; 2020-03-10), 10Patch-For-Review: Handle empty Jade pages - https://phabricator.wikimedia.org/T246033 (10ACraze) Ok, so I finally got this working using the [[ https://www.mediawiki.org/wiki/Manual:Hooks/ShowMissingArticle | ShowMiss... [13:01:25] o/ [14:45:12] Async Update 2020.03.13 [14:45:16] T: Reviewed Andy's patchset - 579116 (Show empty facet on non-existing entity pages) - Gave Andy more usability feedback on the empty Jade page based on the latest changes he has made. Aligned diffWidget Visual editor and Wikitext buttons to the right - These buttons had been stacked vertically and this change stacks them horizontally to match wireframes. [14:45:21] No blockers for now. Have a great day everyone 😊 [14:46:37] Nice! thanks kevinbazira! [14:47:27] 😊 halfak 👋 [15:20:43] 10Scoring-platform-team, 10drafttopic-modeling: Compress Gensim models with term hashing - https://phabricator.wikimedia.org/T247523 (10Halfak) [16:40:59] halfak: Hey! I got your email. I actually joined the channel yesterday. [16:41:34] Hey clemons! I'm in a meeting now, but I'll be done in an hour. Will you be around then? [16:42:27] Yup, that should be fine. [17:27:03] o/ clemons! [17:27:09] Finally got to come up for air. [17:27:14] Ha! [17:27:20] Ready to talk projects? [17:31:02] Hey! Sure [17:36:27] halfak [17:37:01] It sounds like you have some substantial experience with data engineering. Tell me more about that. I'd like to learn your interests to help you find a fun project :) [17:41:15] clemons, ^ [17:41:43] I can't provide a lot of details on the most recent project I'm working on, but in a nutshell I've been working on a pipeline to process features extracted from millions of binary procedures. [17:42:21] The goal of the project is to use the extracted features to determine if two binary procedures are the same, even if they've been compiled with different compilers, versions, or optimization flags. [17:43:44] Interesting. I imagine you're using a lot of information theoretic measures as features. Is that right? [17:43:47] The pipeline groups the procedures into related procedures and unrelated procedures, and uses several ML techniques to determine the best subset of features that separate the related and unrelated features. [17:44:04] *related and unrelated procedures [17:45:14] That's right [17:46:12] Cool. The project that we have that sounds most related is topic modeling. Essentially we are using word embeddings to build a finger print for articles in Wikipedia. Then using the fingerprint (a vector) as features, we build a multi-label classifier for a set of known topic labels. [17:46:55] But we're looking for a more label-free strategy. E.g. I want to be able to give a few example articles and a few counter-example articles and have that represent a query to vector space. [17:47:09] Essentially I'd be looking for articles that are close to the examples but far from the counter-examples. [17:47:21] We've been looking at Approximate NN a bit. [17:48:29] See http://nokomis.macalester.edu/wmf_en/static/iui2017.html#cluster/4/0.04/0.00 for one of the things we can do with these vector embeddings. This is a semantic map of Wikipedia topic space. [17:49:06] It's building using embeddings trained on navigation patterns. We're currently working on embeddings trained on text patterns. I'm curious about combining the two. [17:51:46] From an infra perspective, we're also looking at different ML frameworks to invest in. If you have experience with KubeFlow or something like that, we'd love to talk. [17:54:13] That semantic map looks amazing. [17:55:14] The word embedding approach sounds similar to another project I've been involved in: creating vector embeddings for binary functions. [17:55:59] The principle is the same w.r.t. computing embeddings for the individual terms (words or instructions) in order to produce a single embedding for the grouping abstraction (document or function) [17:57:59] How do you do the group embedding? [17:58:16] We've been getting pretty good performance by generating the avg embedding. [18:00:56] It's based on a deep siamese network that uses a self-attention mechanism. [18:01:15] Just a moment, I'll find the paper. [18:01:46] Regretfully, I've got to run for lunch now or I'll never get a chance. I'll be back in an hour or so. [18:02:08] No problem! [18:04:15] Oh! before I leave, here's my async standup! [18:04:36] Y: Ran a quick few rounds of user-testing with Jade and filed a bunch of tasks -- some of which I'll need to do some design work for. I also moved chtnnh and haksoat forward on their respective work -- mostly looking at performance issues. I met with Erika to strategize about re-engineering ORES' service layer. I have a proposal for something to have Andy try when we wrap up the next couple stages of Jade. [18:04:37] T: I'm working on some Jade communications and supporting chtnnh. I did a media training session with the comms team. I have a few proposed projects that I'd like to spec out for potential interns/volunteers. I'll have an overview at staff. Otherwise, I made some progress on Jade diff stuff that I'll hopefully wrap up today. [18:14:00] This is the paper that focuses on binary similarity: https://arxiv.org/abs/1811.05296. It uses this self-attentive sentence embedding approach: https://arxiv.org/pdf/1703.03130.pdf [18:28:57] KubeFlow looks very useful, but I haven't used the entire framework before. I have experience with some of the underlying components/tools: Jupyter, scikit-learn, TensorFlow, containers (Docker). [18:56:27] oh derp, forgot to post my async update too [18:56:56] Y: Finally got non-existing pages in the Jade namespace rendering an empty facet w/ Propose New Label Button. Also looked into a bug report related to old Jade pages on beta [18:58:03] T: Currently fixing an uncaught exception related to the diff widget for jade entities, also reviewing kevin's patchset, writing jade docs and maybe taking a look at the secondary integration stuff for jade [19:11:00] back! [19:13:22] 10Scoring-platform-team, 10NewcomerTasks 1.1, 10Discovery-Search (Current work), 10Growth-Team (Current Sprint): Once the ORES articletopic - ElasticSearch pipeline is set up, update data about all articles - https://phabricator.wikimedia.org/T243357 (10EBernhardson) [19:21:18] 10Scoring-platform-team, 10NewcomerTasks 1.1, 10Discovery-Search (Current work), 10Growth-Team (Current Sprint): Once the ORES articletopic - ElasticSearch pipeline is set up, update data about all articles - https://phabricator.wikimedia.org/T243357 (10EBernhardson) a:03EBernhardson [19:21:27] halfak: Welcome back! So, the topic modeling project does sound pretty interesting. Is it entirely separate from the projects listed on the scoring team homepage? [19:22:21] Not exactly. We use the revscoring framework and ORES to host the classifiers we have in production now. But it seems like an investment into Approx NN or something like that would require totally different infrastructure. [19:22:29] E.g. faiss or something like that. [19:23:36] There's also the pipeline that generates re-usable embeddings. That's outside of ORES/revscoring/wikilabels now. I'm really interested in talking about more general framework for building these kind of pipelines if you have experience with that. [19:23:59] Right now, we're using fasttext to generate the embeddings and gensim to manage their use in production. [19:34:15] Yeah, don't have much experience with all-in-one frameworks that manage the whole ML life-cycle. I've mainly been using DAG-managment frameworks that run locally, and then leveraging Ansible+Docker Swarm to deploy distributed computations. [19:35:21] Something like Kubeflow would be ideal, but it also seems like a heavy investment. [19:37:14] Are the classifiers in production hosted in containers? Is there a CI/CD process for deploying the classifiers? [19:41:22] halfak: How about this for my initial task? I could try prototyping an integration of the topic model training process into Kubeflow and exposing it as an endpoint. It'd give me a chance to work with the topic model and evaluate the Kubeflow experience. [19:41:56] (in a meeting now. will be done in 20 mins) [19:53:04] @clemons can you elaborate a little on the model you were using for the binary similarity [19:53:04] 04Error: Command “clemons” not recognized. Please review and correct what you’ve written. [19:54:59] clemons can you elaborate a little on the model you were using for the binary similarity [20:02:22] clemons, wow. That would be fantastic! [20:03:41] I have a few assets but I wonder if you would like to re-consider how they are generated. I have a dataset of chunks of text with labels. That might be a good place to start. Alternatively, you could start with our vectors and a labeled dataset. [20:03:56] Let me make a task in our tracker and we can work from there. [20:06:00] Yeah, it might be nice to start with the dataset of text chunks and labels. [20:06:23] OK. I'll get you some notes and links on our datasets server. [20:09:12] BRB [20:13:02] 10Jade, 10Scoring-platform-team: Move "alternatives V" to the top of the list of alternative labels. - https://phabricator.wikimedia.org/T247461 (10ACraze) Ah, I see what happened here. Right now we have a single `ProposalListWidget` that contains a number of labels. The first is the "best" or "preferred" labe... [20:13:07] Getting details together in the task and then I'll have a link for you. [20:15:13] 10Jade, 10Scoring-platform-team: Move "alternatives V" to the top of the list of alternative labels. - https://phabricator.wikimedia.org/T247461 (10Halfak) Gotcha. This is how I originally designed it so it makes sense for how we got here. This showed up in user-testing as confusing. It seems like it could... [20:20:28] accraze, thanks for your notes on that task. [20:21:01] It pains me to say, "Hey. This design didn't work in practice. Let's change it." So please push back on stuff that annoying and hard and I can try to get clever. :) [20:24:24] haha ok will do halfak [20:38:34] accraze, it's amazing to have this all running a beta. So easy to just have someone take a look at it and to lead them through a test of the UI. :) ITS HAPPENING :D [20:41:24] awesome! [20:49:41] 10ORES, 10Scoring-platform-team, 10drafttopic-modeling: Experiment with Topic modeling in KubeFlow - https://phabricator.wikimedia.org/T247564 (10Halfak) [20:49:49] clemons, https://phabricator.wikimedia.org/T247564 [20:50:13] I put a bunch of detail in that task. I hope it is helpful. It took me a while to look up links to all the input files. [20:51:14] I focused on getting links to the base input data but I can help with anything above that. E.g. I think "text chunks with labels" might be handy. Let me know and I can get more datasets online. [20:52:01] Also, I should say, accraze is our resident senior engineer. I'm just a researcher who plays an engineer on TV :) [21:07:04] I'm heading out. Back on at 1300 UTC tomorrow. Have a good one, folks! [21:09:41] 10ORES, 10Scoring-platform-team, 10drafttopic-modeling: Experiment with Topic modeling in KubeFlow - https://phabricator.wikimedia.org/T247564 (10ClaytonLLemons) a:03ClaytonLLemons [21:10:38] Thanks halfak: this task looks like a great start. I'll start digging into it some tomorrow. [21:10:48] \o/ [21:10:57] excellent [23:47:43] 10ORES, 10Scoring-platform-team: [Discuss] Future ORES architecture - https://phabricator.wikimedia.org/T226193 (10ClaytonLLemons) I'll be looking into T247564 to try out the Kubeflow, but here's a great [[ https://medium.com/weareservian/the-cheesy-analogy-of-mlflow-and-kubeflow-715a45580fbe | article ]] desc... [23:47:44] 10[1] 04https://meta.wikimedia.org/wiki/https://medium.com/weareservian/the%2Dcheesy%2Danalogy%2Dof%2Dmlflow%2Dand%2Dkubeflow%2D715a45580fbe