[00:55:36] <codezee_>	 halfak: i'm playing with new features in the edit intention repo, but everytime i define new features i have to do a new extract most of which is api call. I know that to save time i can extract and save the diff text once then run offline feature extraction on the saved text. I'm not sure which datasource to use for that
[01:01:36] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Tgr) Thanks both! Where is the list of scores defined? I don't see it either in the [[https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/0f22159a1f27ddf7c4da5806c868ce04b9...
[01:05:16] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search: Consume ORES drafttopic data from Kafka and store it in HDFS - https://phabricator.wikimedia.org/T240553 (10Tgr) The relevant conversation happened in {T240549}; I read it as saying that this task is actually a no-op, since the mechanism (which uses EventGate with t...
[01:24:59] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Pchelolo) yeah. AFAIK it's set up in the ORES config. @Ladsgroup knows much more about this.
[08:31:08] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Gehel) >>! In T240517#5734191, @Tgr wrote: > One thing we haven't really discussed is how the fake non-English drafttopic will work. Would that be done within...
[08:58:56] <wikibugs>	 10Scoring-platform-team, 10drafttopic-modeling, 10revscoring, 10artificial-intelligence: Build WikiProject directory topic models for ar, cs, and kowiki - https://phabricator.wikimedia.org/T235181 (10Gehel) Given the various discussions in various channels, it is unclear to me if this is about training new...
[09:05:12] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Gehel) While the [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/score/2.0.0.yaml | current schema ]] seems flexible enough to support...
[09:05:12] <AsimovBot>	 10[1] 04https://meta.wikimedia.org/wiki/https://github.com/wikimedia/mediawiki%2Devent%2Dschemas/blob/master/jsonschema/mediawiki/revision/score/2.0.0.yaml
[14:00:59] <halfak>	 Hey kevinbazira!
[14:01:26] <kevinbazira>	 Hi halfak o/
[14:01:56] <halfak>	 How's hacking going?
[14:03:08] <kevinbazira>	 It's going well. I was able to go through the fasttext unsupervised learning examples, get the wikipedia data and train word vectors.
[14:10:19] <halfak>	 Nice!  Did you get the wikifil.pl script working and everything? 
[14:10:59] <halfak>	 Did the analogies work as expected? 
[14:11:29] <halfak>	 Sorry for the delay.  I looked at an email and spiraled into that because I haven't had my coffee yet :|  
[14:11:47] <kevinbazira>	 hihi that's fine :)
[14:12:12] <kevinbazira>	 Yes the  wikifil.pl script worked fine
[14:13:30] <halfak>	 Nice.  And the analogies? 
[14:16:27] <halfak>	 I think we want to try making a 50 cell vector for English Wikipedia first.  What do you think about giving that a try?  
[14:16:53] <kevinbazira>	 By analogies do you mean "expected output based on the examples"? If so, yes they worked as expected with a few discrepancies in the output and I think it's because the word vectors in the models I created on the stat server aren't exactly the same as those in the examples
[14:19:45] <halfak>	 Right.  That makes sense. 
[14:24:51] <halfak>	 So, about doing a full training run... 
[14:30:07] <halfak>	 kevinbazira, ^ 
[14:31:13] <halfak>	 I'd be down for getting on a call if that'd be easier. 
[14:31:31] <kevinbazira>	 Yep, that's fine with me too.
[14:31:45] <halfak>	 OK great. Call when ready. 
[14:34:29] <halfak>	 kevinbazira, /mnt/data/xmldatadumps/public/enwiki/20191120
[14:54:16] <halfak>	 kevinbazira, https://gist.github.com/halfak/e57fbf53aae29a95c7282c6fd7bb701b
[14:55:34] <halfak>	 kevinbazira, https://fasttext.cc/docs/en/unsupervised-tutorial.html#advanced-readers-playing-with-the-parameters
[14:58:04] <kevinbazira>	 https://github.com/facebookresearch/fastText/blob/master/wikifil.pl
[15:00:15] <halfak>	 https://stackoverflow.com/questions/14922272/perl-while-file-handling
[15:01:48] <halfak>	 bzcat enwiki-latest....xml.bz2 | perl wikifil.pl > text_output.txt
[15:04:14] <halfak>	 bzcat enwiki-latest....xml.bz2 | perl wikifil.pl | bzip2 -c > text_output.txt.bz2
[15:05:51] <kevinbazira>	 $ wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
[15:05:53] <kevinbazira>	 $ unzip v0.9.1.zip
[15:06:59] <kevinbazira>	 $ cd fastText-0.9.1
[15:07:17] <kevinbazira>	 $ make
[15:09:23] <halfak>	 https://fasttext.cc/docs/en/python-module.html#train_unsupervised-parameters
[17:02:38] <codezee_>	 halfak: I'm looking at https://en.wikipedia.org/wiki/?diff=710668715&diffmode=source and found that wikitext.revision.diff.tokens_added gives all the tokens in the added lines while i was expecting the highlighted words to appear. What am i missing here?
[18:03:13] <halfak>	 o/  Sorry was in meetings. 
[18:03:50] <halfak>	 Checking
[18:09:33] <halfak>	 codezee_, indeed this looks strange. 
[18:11:48] <halfak>	 I wonder if the diff isn't processing the lines starting with "|" correctly. 
[18:20:15] <halfak>	 Oh man.  I see what is going on here.  This is a hard case. 
[18:20:36] <halfak>	 This is a weird diff where someone moves a bunch of content around *and* makes minor changes to the content. 
[18:21:42] <codezee_>	 halfak: so if its a move and change, it'll treat them as additions and deletions right?
[18:21:49] <halfak>	 Right.  
[18:22:05] <halfak>	 I'm not quite sure why we aren't processing this better though. 
[18:22:34] <halfak>	 Generally we do better than MW's diff ^_^
[18:22:39] <halfak>	 Here, we are not
[18:23:14] <codezee_>	 and i'm thinking these examples may not be that rare, because moves may happen now and then as part of revisions
[18:25:23] <halfak>	 Moves are usually well handled. 
[18:25:28] <halfak>	 Moves with minor changes are harder.  
[18:25:42] <halfak>	 In general, if a simple diff handles something, we should handle it too. 
[18:26:03] <codezee_>	 okay, also can you tell which parent datasource should i cache locally if i want to regenerate diff features without making api calls every time?
[18:26:20] <halfak>	 So, I don't have time to dig into this right now, but you could take a look at our deltas library to see if you can figure something out. 
[18:26:41] <halfak>	 codezee_, sorry not sure what you're looking at. 
[18:27:04] <codezee_>	 halfak: okay, thats fine, i'll look, 
[18:27:15] <halfak>	 Oh!  Wait.  I do think I understand. 
[18:27:18] <halfak>	 one sec. 
[18:28:37] <halfak>	 from revscoring.dependencies import dig
[18:28:56] <halfak>	 root_datasources = dig(features_I_want)
[18:29:24] <halfak>	 You can then extract and store those root_datasources in the cache and re-use them when extracting diff features and stuff. 
[18:29:40] <halfak>	 Note that it'll be big because it'll contain the full text of a revision and its parent. 
[18:29:53] <halfak>	 Gotta get lunch now while the getting is good!
[18:30:54] <codezee_>	 halfak|Lunch: big is fine, if it saves time :P but thanks for the dig pointer, seems really useful :)
[19:40:43] <halfak>	 HAKSOAT, just saw your question re. running features and responded.  I think the reference I gave will help a lot. 
[19:42:50] <halfak>	 I'm going AFK for a bit to meet up with the minneapolis WMF crew
[19:51:30] <HAKSOAT>	 Okay. I'll check. Thanks.
[20:13:35] <accraze>	 grabbin some lunch, back in a bit
[20:26:14] <Zppix>	 halfak: hey just checking in hows things going?
[20:26:51] <halfak>	 Hey Zppix!  It's good.  We're nearing the end of the "quarter" so we're working on getting a few new things deployed. 
[20:27:17] <Zppix>	 Good, try to make sure you blow up some servers or your not deploying it right halfak 
[20:27:39] <halfak>	 :P well we're not melting any icebergs yet ;)
[20:28:40] <Zppix>	 halfak:  tsk tsk, i guess i need to teach you guys better :P
[20:31:09] <halfak>	 Ha.  Well, there are some proposals floating around for AI-powering iceberg-melting infrastructure. 
[20:31:30] <halfak>	 I'm not a big fan yet because I don't see it giving us any major benefit. 
[20:33:09] <Zppix>	 hah
[20:35:20] <apergos>	 if we had some iceberg-preserving AI-driven prposals I'd be interested in seeing those. Well, no I'd be interested in someone else seeing those :-D
[21:34:04] <halfak>	 lol @ apergos.  I hear you.
[21:34:14] <apergos>	 :-)
[21:34:45] <halfak>	 isaacj, /home/halfak/projects/drafttopic/datasets/wikiproject_to_templates.20191212.yaml
[21:34:56] <isaacj>	 thanks!
[22:26:40] <halfak>	 CODE REVIEW COMPLETE.  Congratulations on your hard work, accraze!
[22:26:55] <halfak>	 I just +2'd 
[22:28:13] <accraze>	 \o/
[22:28:21] <accraze>	 thanks halfak!
[22:48:49] <halfak>	 Ok!  2.6.2 is released. 
[22:48:52] <halfak>	 ^revscoring
[22:48:57] <halfak>	 and I started regenerating models withit
[22:49:10] <halfak>	 *and* I started work on the next ORES deployment. 
[22:49:20] <halfak>	 Including work that we need to do to get drafttopic predictions in the right place. 
[22:49:36] <halfak>	 It looks like we have a complete labeled dataset for the new topic models too. 
[22:49:45] <halfak>	 Feels successful enough for Friday.  I'm outta here. 
[22:49:48] <halfak>	 have a good one, folks!
[22:50:04] <halfak>	 o/
[22:50:17] <travis-ci>	 wikimedia/ores#1385 (main_edit_event - f523510 : halfak): The build passed. https://travis-ci.org/wikimedia/ores/builds/624834762