[09:40:04] 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team, 10NewcomerTasks 1.1: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559 (10dcausse) I suggest a keyword slightly less ambiguous such as `hastopi... [14:31:21] o/ [14:31:43] Happy new year! [14:45:04] o/ [14:45:16] Happy New Year to you to halfak :) [15:13:42] Bad news. We're getting far worse fitness from the new embeddings. I'm not sure why. I'm digging into it. [15:15:51] Even the 200 cell embeddings are underperforming. They look good for lots of other reasons though. [15:16:30] Also, isaacj has shown more success with a fasttext-based classifier. I want to look into the performance characteristics of that modeling strategy. But if it's good, we can move that direction. [15:19:46] 10Scoring-platform-team, 10articlequality-modeling, 10editquality-modeling, 10revscoring, and 2 others: Add English Language idioms to revscoring - https://phabricator.wikimedia.org/T205545 (10Halfak) I would use `mwapi` and get what I needed from the API directly. ` import mwapi session = mwapi.Sessio... [15:20:48] halfak: i was looking around a bit at how you calculate your model statistics and I think some of the poor performance might just come from how imbalanced the classes are. with mathematics, given that it's such a small proportion of articles, you need near perfect performance to achieve a pr-auc that looks ok because true positives and false negatives are scaled to almost nothing while false positives / true negatives remain unchanged. [15:20:48] this makes your precision look awful even when it's pretty okay. [15:21:27] Oh! Good point. I forgot that we rescale our stats based on population rates. [15:22:11] i also played around with gradient boosting vs. fasttext though and fasttext was wayyyyy faster to train for me. i wasn't using the revscoring code though so it might be that something was off. but the fasttext models train in a few minutes tops [15:22:45] Nice. What are predictions like? [15:22:59] Performance-wise [15:23:22] And any chance you know how much memory the process requires to load the model and make a prediction. [15:23:25] quite fast too though i didn't try to compare against gradient-boosting. because it's just averaging word embeddings and then a single fully-connected layer overtop that. and it's all written in the optimized python [15:26:00] looks like 72MB for the model and my understanding is that includes the word vectors (this was the 50 dimension skipgram) [15:35:01] Holy moley. That sounds pretty good. I would like to cram that into our "SoringModel" concept and see what we get. [15:35:05] Where can I find your code? [15:35:18] (Sorry a bit distracted catching up on email) [15:44:31] preprocessing: stat1007:/home/isaacj/fastText/drafttopic/drafttopic_article_fasttext_preprocess.py [15:44:44] model training/testing: stat1007:/home/isaacj/fastText/drafttopic/drafttopic_article_fasttext_model.py [15:54:35] "WORDNGRAMS = 1"? [15:59:03] only deal with unigrams -- no word vectors for bigrams, trigrams etc. (documentation, though not as great as i'd like, is here: https://fasttext.cc/docs/en/supervised-tutorial.html) [15:59:40] this is where i can't fully follow what they're doing though because i'm not sure to what degree they pay attention to some of those hyperparameters when i provide pretrained word vectors (as i did) [16:00:01] Gotcha. It looks like I might need to work with this a bit to get it into our tuning system, but nothing looks crazy to me. [16:00:14] there definitely is some fine-tuning of the word vectors too that the model does during training and that seems to be adding to the model performance as well [16:01:21] yeah, and i'm not sure if kevin's word vectors include subwords but they suggest that learning subword word vectors (3-6 characters) can be very helpful for model performance too [16:01:40] might be more valuable for the draft article model than full article model [16:02:02] Might be valuable for vandalism detection too as there are commonly mispellings. [17:02:50] accraze, standup! [17:25:14] kevinbazira, test! [17:41:01] oh man, my jade ui branch has a TON of merge conflicts w/ the master branch [17:41:32] Even after rebase? Can git not solve them? [17:42:14] yeah a bunch of these need manual clean up [17:42:20] damn [17:42:33] yeah... anyone know of any tools that might make this a little less gruesome? [17:44:22] My editor make it easier to select "theirs" vs. "mine" on a per-block basis. [17:53:34] nice, just found a vim plugin for git merge conflict resolution [17:53:42] https://github.com/christoomey/vim-conflicted [17:53:46] seems to be working so far [17:57:57] oooh! [22:10:10] Just finished up modifications for the new XML format. [22:10:25] I have some new results for topic modeling. They are better but not amazing. [22:11:52] I managed to get a substantial boost in fitness by increasing our model size to 1.9GB :| [22:11:55] ha!