[00:55:36] halfak: i'm playing with new features in the edit intention repo, but everytime i define new features i have to do a new extract most of which is api call. I know that to save time i can extract and save the diff text once then run offline feature extraction on the saved text. I'm not sure which datasource to use for that [01:01:36] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Tgr) Thanks both! Where is the list of scores defined? I don't see it either in the [[https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/0f22159a1f27ddf7c4da5806c868ce04b9... [01:05:16] 10Scoring-platform-team, 10Discovery-Search: Consume ORES drafttopic data from Kafka and store it in HDFS - https://phabricator.wikimedia.org/T240553 (10Tgr) The relevant conversation happened in {T240549}; I read it as saying that this task is actually a no-op, since the mechanism (which uses EventGate with t... [01:24:59] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Pchelolo) yeah. AFAIK it's set up in the ORES config. @Ladsgroup knows much more about this. [08:31:08] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Gehel) >>! In T240517#5734191, @Tgr wrote: > One thing we haven't really discussed is how the fake non-English drafttopic will work. Would that be done within... [08:58:56] 10Scoring-platform-team, 10drafttopic-modeling, 10revscoring, 10artificial-intelligence: Build WikiProject directory topic models for ar, cs, and kowiki - https://phabricator.wikimedia.org/T235181 (10Gehel) Given the various discussions in various channels, it is unclear to me if this is about training new... [09:05:12] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Gehel) While the [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/score/2.0.0.yaml | current schema ]] seems flexible enough to support... [09:05:12] 10[1] 04https://meta.wikimedia.org/wiki/https://github.com/wikimedia/mediawiki%2Devent%2Dschemas/blob/master/jsonschema/mediawiki/revision/score/2.0.0.yaml [14:00:59] Hey kevinbazira! [14:01:26] Hi halfak o/ [14:01:56] How's hacking going? [14:03:08] It's going well. I was able to go through the fasttext unsupervised learning examples, get the wikipedia data and train word vectors. [14:10:19] Nice! Did you get the wikifil.pl script working and everything? [14:10:59] Did the analogies work as expected? [14:11:29] Sorry for the delay. I looked at an email and spiraled into that because I haven't had my coffee yet :| [14:11:47] hihi that's fine :) [14:12:12] Yes the wikifil.pl script worked fine [14:13:30] Nice. And the analogies? [14:16:27] I think we want to try making a 50 cell vector for English Wikipedia first. What do you think about giving that a try? [14:16:53] By analogies do you mean "expected output based on the examples"? If so, yes they worked as expected with a few discrepancies in the output and I think it's because the word vectors in the models I created on the stat server aren't exactly the same as those in the examples [14:19:45] Right. That makes sense. [14:24:51] So, about doing a full training run... [14:30:07] kevinbazira, ^ [14:31:13] I'd be down for getting on a call if that'd be easier. [14:31:31] Yep, that's fine with me too. [14:31:45] OK great. Call when ready. [14:34:29] kevinbazira, /mnt/data/xmldatadumps/public/enwiki/20191120 [14:54:16] kevinbazira, https://gist.github.com/halfak/e57fbf53aae29a95c7282c6fd7bb701b [14:55:34] kevinbazira, https://fasttext.cc/docs/en/unsupervised-tutorial.html#advanced-readers-playing-with-the-parameters [14:58:04] https://github.com/facebookresearch/fastText/blob/master/wikifil.pl [15:00:15] https://stackoverflow.com/questions/14922272/perl-while-file-handling [15:01:48] bzcat enwiki-latest....xml.bz2 | perl wikifil.pl > text_output.txt [15:04:14] bzcat enwiki-latest....xml.bz2 | perl wikifil.pl | bzip2 -c > text_output.txt.bz2 [15:05:51] $ wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip [15:05:53] $ unzip v0.9.1.zip [15:06:59] $ cd fastText-0.9.1 [15:07:17] $ make [15:09:23] https://fasttext.cc/docs/en/python-module.html#train_unsupervised-parameters [17:02:38] halfak: I'm looking at https://en.wikipedia.org/wiki/?diff=710668715&diffmode=source and found that wikitext.revision.diff.tokens_added gives all the tokens in the added lines while i was expecting the highlighted words to appear. What am i missing here? [18:03:13] o/ Sorry was in meetings. [18:03:50] Checking [18:09:33] codezee_, indeed this looks strange. [18:11:48] I wonder if the diff isn't processing the lines starting with "|" correctly. [18:20:15] Oh man. I see what is going on here. This is a hard case. [18:20:36] This is a weird diff where someone moves a bunch of content around *and* makes minor changes to the content. [18:21:42] halfak: so if its a move and change, it'll treat them as additions and deletions right? [18:21:49] Right. [18:22:05] I'm not quite sure why we aren't processing this better though. [18:22:34] Generally we do better than MW's diff ^_^ [18:22:39] Here, we are not [18:23:14] and i'm thinking these examples may not be that rare, because moves may happen now and then as part of revisions [18:25:23] Moves are usually well handled. [18:25:28] Moves with minor changes are harder. [18:25:42] In general, if a simple diff handles something, we should handle it too. [18:26:03] okay, also can you tell which parent datasource should i cache locally if i want to regenerate diff features without making api calls every time? [18:26:20] So, I don't have time to dig into this right now, but you could take a look at our deltas library to see if you can figure something out. [18:26:41] codezee_, sorry not sure what you're looking at. [18:27:04] halfak: okay, thats fine, i'll look, [18:27:15] Oh! Wait. I do think I understand. [18:27:18] one sec. [18:28:37] from revscoring.dependencies import dig [18:28:56] root_datasources = dig(features_I_want) [18:29:24] You can then extract and store those root_datasources in the cache and re-use them when extracting diff features and stuff. [18:29:40] Note that it'll be big because it'll contain the full text of a revision and its parent. [18:29:53] Gotta get lunch now while the getting is good! [18:30:54] halfak|Lunch: big is fine, if it saves time :P but thanks for the dig pointer, seems really useful :) [19:40:43] HAKSOAT, just saw your question re. running features and responded. I think the reference I gave will help a lot. [19:42:50] I'm going AFK for a bit to meet up with the minneapolis WMF crew [19:51:30] Okay. I'll check. Thanks. [20:13:35] grabbin some lunch, back in a bit [20:26:14] halfak: hey just checking in hows things going? [20:26:51] Hey Zppix! It's good. We're nearing the end of the "quarter" so we're working on getting a few new things deployed. [20:27:17] Good, try to make sure you blow up some servers or your not deploying it right halfak [20:27:39] :P well we're not melting any icebergs yet ;) [20:28:40] halfak: tsk tsk, i guess i need to teach you guys better :P [20:31:09] Ha. Well, there are some proposals floating around for AI-powering iceberg-melting infrastructure. [20:31:30] I'm not a big fan yet because I don't see it giving us any major benefit. [20:33:09] hah [20:35:20] if we had some iceberg-preserving AI-driven prposals I'd be interested in seeing those. Well, no I'd be interested in someone else seeing those :-D [21:34:04] lol @ apergos. I hear you. [21:34:14] :-) [21:34:45] isaacj, /home/halfak/projects/drafttopic/datasets/wikiproject_to_templates.20191212.yaml [21:34:56] thanks! [22:26:40] CODE REVIEW COMPLETE. Congratulations on your hard work, accraze! [22:26:55] I just +2'd [22:28:13] \o/ [22:28:21] thanks halfak! [22:48:49] Ok! 2.6.2 is released. [22:48:52] ^revscoring [22:48:57] and I started regenerating models withit [22:49:10] *and* I started work on the next ORES deployment. [22:49:20] Including work that we need to do to get drafttopic predictions in the right place. [22:49:36] It looks like we have a complete labeled dataset for the new topic models too. [22:49:45] Feels successful enough for Friday. I'm outta here. [22:49:48] have a good one, folks! [22:50:04] o/ [22:50:17] wikimedia/ores#1385 (main_edit_event - f523510 : halfak): The build passed. https://travis-ci.org/wikimedia/ores/builds/624834762