[09:40:04] <wikibugs>	 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team, 10NewcomerTasks 1.1: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559 (10dcausse) I suggest a keyword slightly less ambiguous such as `hastopi...
[14:31:21] <halfak>	 o/
[14:31:43] <halfak>	 Happy new year!
[14:45:04] <kevinbazira>	 o/
[14:45:16] <kevinbazira>	 Happy New Year to you to halfak :)
[15:13:42] <halfak>	 Bad news.  We're getting far worse fitness from the new embeddings.  I'm not sure why.  I'm digging into it. 
[15:15:51] <halfak>	 Even the 200 cell embeddings are underperforming.  They look good for lots of other reasons though. 
[15:16:30] <halfak>	 Also, isaacj has shown more success with a fasttext-based classifier.  I want to look into the performance characteristics of that modeling strategy.  But if it's good, we can move that direction. 
[15:19:46] <wikibugs>	 10Scoring-platform-team, 10articlequality-modeling, 10editquality-modeling, 10revscoring, and 2 others: Add English Language idioms to revscoring - https://phabricator.wikimedia.org/T205545 (10Halfak) I would use `mwapi` and get what I needed from the API directly.    ` import mwapi  session = mwapi.Sessio...
[15:20:48] <isaacj>	 halfak: i was looking around a bit at how you calculate your model statistics and I think some of the poor performance might just come from how imbalanced the classes are. with mathematics, given that it's such a small proportion of articles, you need near perfect performance to achieve a pr-auc that looks ok because true positives and false negatives are scaled to almost nothing while false positives / true negatives remain unchanged. 
[15:20:48] <isaacj>	 this makes your precision look awful even when it's pretty okay. 
[15:21:27] <halfak>	 Oh!  Good point.  I forgot that we rescale our stats based on population rates. 
[15:22:11] <isaacj>	 i also played around with gradient boosting vs. fasttext though and fasttext was wayyyyy faster to train for me. i wasn't using the revscoring code though so it might be that something was off. but the fasttext models train in a few minutes tops
[15:22:45] <halfak>	 Nice.  What are predictions like?  
[15:22:59] <halfak>	 Performance-wise
[15:23:22] <halfak>	 And any chance you know how much memory the process requires to load the model and make a prediction. 
[15:23:25] <isaacj>	 quite fast too though i didn't try to compare against gradient-boosting. because it's just averaging word embeddings and then a single fully-connected layer overtop that. and it's all written in the optimized python
[15:26:00] <isaacj>	 looks like 72MB for the model and my understanding is that includes the word vectors (this was the 50 dimension skipgram)
[15:35:01] <halfak>	 Holy moley.  That sounds pretty good.  I would like to cram that into our "SoringModel" concept and see what we get.  
[15:35:05] <halfak>	 Where can I find your code? 
[15:35:18] <halfak>	 (Sorry a bit distracted catching up on email)
[15:44:31] <isaacj>	 preprocessing: stat1007:/home/isaacj/fastText/drafttopic/drafttopic_article_fasttext_preprocess.py
[15:44:44] <isaacj>	 model training/testing: stat1007:/home/isaacj/fastText/drafttopic/drafttopic_article_fasttext_model.py
[15:54:35] <halfak>	 "WORDNGRAMS = 1"? 
[15:59:03] <isaacj>	 only deal with unigrams -- no word vectors for bigrams, trigrams etc. (documentation, though not as great as i'd like, is here: https://fasttext.cc/docs/en/supervised-tutorial.html)
[15:59:40] <isaacj>	 this is where i can't fully follow what they're doing though because i'm not sure to what degree they pay attention to some of those hyperparameters when i provide pretrained word vectors (as i did)
[16:00:01] <halfak>	 Gotcha.  It looks like I might need to work with this a bit to get it into our tuning system, but nothing looks crazy to me. 
[16:00:14] <isaacj>	 there definitely is some fine-tuning of the word vectors too that the model does during training and that seems to be adding to the model performance as well
[16:01:21] <isaacj>	 yeah, and i'm not sure if kevin's word vectors include subwords but they suggest that learning subword word vectors (3-6 characters) can be very helpful for model performance too
[16:01:40] <isaacj>	 might be more valuable for the draft article model than full article model
[16:02:02] <halfak>	 Might be valuable for vandalism detection too as there are commonly mispellings. 
[17:02:50] <halfak>	 accraze, standup!
[17:25:14] <halfak>	 kevinbazira, test!  
[17:41:01] <accraze>	 oh man, my jade ui branch has a TON of merge conflicts w/ the master branch
[17:41:32] <halfak>	 Even after rebase?  Can git not solve them?
[17:42:14] <accraze>	 yeah a bunch of these need manual clean up
[17:42:20] <halfak>	 damn
[17:42:33] <accraze>	 yeah... anyone know of any tools that might make this a little less gruesome?
[17:44:22] <halfak>	 My editor make it easier to select "theirs" vs. "mine" on a per-block basis. 
[17:53:34] <accraze>	 nice, just found a vim plugin for git merge conflict resolution
[17:53:42] <accraze>	 https://github.com/christoomey/vim-conflicted
[17:53:46] <accraze>	 seems to be working so far
[17:57:57] <halfak>	 oooh!
[22:10:10] <halfak>	 Just finished up modifications for the new XML format. 
[22:10:25] <halfak>	 I have some new results for topic modeling.  They are better but not amazing. 
[22:11:52] <halfak>	 I managed to get a substantial boost in fitness by increasing our model size to 1.9GB :| 
[22:11:55] <halfak>	 ha!