[14:43:58] Hey kevinbazira! How's hacking? [14:45:00] halfak: still want to meet to discuss embeddings/topics? [14:45:01] Hi halfak! Hacking is going well is going well. I'm digging into doc2vec. [14:45:23] Hey isaacj! Yes. [14:45:29] :thumbs up: [14:45:31] It might be nice to pull in kevinbazira. [14:45:37] Could you get on a call right now, kevinbazira? [14:45:49] Yep [16:09:24] 10Scoring-platform-team (Current), 10drafttopic-modeling: Implement English pronoun count features in topic models - https://phabricator.wikimedia.org/T242345 (10Halfak) [16:13:19] 10Scoring-platform-team (Current), 10drafttopic-modeling: Implement English pronoun count features in topic models - https://phabricator.wikimedia.org/T242345 (10Halfak) Re. adding it to drafttopic feature lists: https://gist.github.com/halfak/a2073ae3fd59ad0f0fdbebd5dedcafa3 [16:15:14] kevinbazira, https://phabricator.wikimedia.org/T242345 [16:24:22] Thanks halfak! [16:43:10] Hello halfak Is there any other thing for me to modify on the PR? [16:44:49] hey haksoat! I had a quick look this morning. I think we'll want to get that asset pulled into a feature in languages/english.py. We could do that in a follow-up commit. What do you think? [16:45:13] *follow-up PR [16:45:27] Depends on your appetite for pushing this all the way to production ^_^ [16:48:32] It can be done in this same PR right? Instead of creating another. [16:49:48] I don't fully understand what is meant by "a feature" though. [16:52:06] Gotcha. I can help with that. So we have these other things defined in langauges/english.py using a RegexEctractor. I think we'll want to do the same here. [16:52:11] *Extractor [16:53:38] See https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py#L138 as an example [16:53:59] I think we'll want to follow the same pattern here. [16:54:12] Except we'll be loading in from assets/idioms.txt [16:54:40] I noticed that there are a few lines that start with "Appendix:" and "Citation:" that we'll probably want to filter out. [16:54:58] Okay. I'll study other examples and try to do something. [16:54:58] There are some other ones that use "X" and "Y" as placeholders. [16:55:17] Oh. I missed those. I'll take a look. [16:55:26] Oh wait. That's in "Appendix:Snowclones" [16:55:41] E.g. "Appendix:Snowclones/few X short of a Y" [16:56:59] I think values with a colon in them should be filtered [16:58:03] But some look to have actual idioms in them [16:58:32] Citations: for example [16:59:00] hey halfak & kevinbazira o/ [16:59:15] should we async today due to the staff meeting? [16:59:16] hey accraze o/ [16:59:43] haksoat, +1 We can probably match the prefixes we know before the ":" -- Category, Appendix, Citation, etc. [16:59:47] hey accraze! [16:59:49] +1 for async [17:00:07] cool [17:05:57] Y: Had started implementing native fasttext model in revscoring but faced issues where the way fasttext generates models couldn't fit into the way ORES/revscoring generates models. Aaron advised to look into doc2vec. [17:06:16] T: I've been exploring doc2vec. Got to learn that doc2vec gets inspiration from word2vec and computes vectors for documents as oppossed to only words. Also doc2vec uses PV-DM and PV-DBOW the same way word2vec uses kip-Gram and CBOW. Will continue exploring ... [17:06:34] T: halfak and I also had a meeting today with Isaac so that we can see how to fit fasttext in the ORES pipeline. Isaac shared his knowledge on fasttext, doc2vec and keras. [17:07:50] Y: ORES deployment! Worked with Kevin on vector stuff. Talked to Erika about SWE and EM position. Will have an update for next staff meeting. I produced a new version of the ORES systems paper for CSCW (deadline next week). I also uploaded a CC-BY-SA version of the ORES values paper. See https://commons.wikimedia.org/wiki/File:Keeping_Community_in_the_Loop-_Understanding_Wikipedia_Stakeholder_Values_for_Machine_Learning-Based_Systems.pdf [17:07:50] T: Worked with Kevin on vector stuff. We met with Isaac and discussed investing in gensim, fasttext, and keras. My main todo is to work on a spec for pipelines and APIs I want around embeddings. I put together a follow-up task for Kevinbazira (gender pronouns). I also reviewed haksoat's work on idioms and will continue to advise on next steps. acraze, I'm starting to get close to taking on Jade work. I'll be interested to hear where I can [17:07:51] contribute the most productively. CSS? [17:13:14] ^ yeah most likely css [17:14:33] Y: finished up refactoring JS client for Jade API, wired up Move Endorsement form, squashed more eslint bugs [17:17:10] T: fix edit Proposal note & edit endorsement comment forms (readonly bug), squash more eslint bugs, refactor ui to use i18n messages [21:05:51] @halfak do you use Huggle or STiki labels for training ORES? [21:05:51] 04Error: Command “halfak” not recognized. Please review and correct what you’ve written. [21:06:22] Hey! We don't. Huggle doesn't produce good labels. I haven't reached out to the Stiki dev yet though. [21:07:00] xinbenlv, I've been meaning to ask you where I could grab data from your tool though. I'd love to review it and have one of my grad students try to put it to use. [21:07:38] Why was it not good labels? [21:07:45] Is there a way to access those labels? [21:20:45] xinbenlv, there's no historical dataset. It only has negative labels. [21:20:51] No one marks anything as good. [21:21:00] The only way to get data is to listen in on an IRC channel. [21:26:05] I see [21:27:37] Btw WikiLoop Battlefield currently cross 15K labels, we will soon start to look into using its label for training [21:28:05] no historical data set means not even stored? [21:39:01] Are you up for a meeting for discussing how we can make Wiki loop Battlefield labels useful in ORES training? [21:39:13] @half a [21:39:13] 04Error: Command “half” not recognized. Please review and correct what you’ve written. [21:39:21] Definitely. [21:39:21] @halfak [21:39:21] 04Error: Command “halfak” not recognized. Please review and correct what you’ve written. [21:39:36] Heh. The "@" symbol is not necessary :) [21:39:41] Just say my name and I'll get a ping. [21:39:44] what timezone are you in? [21:43:09] CST [21:43:12] UTC-6 [21:54:08] Nice. All set :) [22:32:46] I got pretty excited about text processing today so I didn't make it to CSS. But I'm planning to set up my vagrant tomorrow. Today I managed to get a pretty fast and effective text processing system in place. I think that we might get better signal from our embeddings with this. I worked it out so that the algorithm will learn embeddings one paragraph at a time. [22:32:56] And with that, I'm off. Have a good one, folks. [22:38:37] * halfak sneakily figured out that we can process the giant Alan Turing article in about 0.02 seconds on a single CPU. That's more than 50 Turings per second! [22:38:47] * halfak actually leaves now.