[00:13:00] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10MMiller_WMF)
[00:59:01] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[02:08:43] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10EBernhardson)
[10:45:11] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[10:48:21] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Tgr)
[10:57:39] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[11:02:39] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[11:02:52] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Tgr)
[11:08:42] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[11:09:24] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[11:23:09] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[11:31:22] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[11:32:27] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr)
[11:34:55] <wikibugs>	 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team, 10NewcomerTasks 1.1: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559 (10Tgr)
[11:35:12] <wikibugs>	 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team, 10NewcomerTasks 1.1: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559 (10Tgr)
[14:15:19] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Ottomata) I think if drafttopic is added to the list of 'precache' scores for changeprop, it will automatically get added.  Ping @Pchelolo
[14:19:02] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Pchelolo) We have the 'revision-score' topic where an event is pushed on every page edit via https://github.com/wikimedia/change-propagation/blob/master/sys/ores_updates.js and calling...
[14:24:22] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Ottomata) I think (hope!) this event will be fine!
[14:24:49] <wikibugs>	 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Ottomata) If you just put this score into the existing topic, it will show up in the event.mediawiki_revision_score hive table.
[15:07:38] <halfak>	 o/ kevinbazira
[15:07:42] <halfak>	 Hey man.  How's hacking? 
[15:17:46] <halfak>	 I just left some feedback on the PR.  Looks like it will probably fail or out some errors in the output. 
[15:17:59] <halfak>	 Did you try a test run with some WikiProjects that are missing templates? 
[15:43:48] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search: Consume ORES drafttopic data from Kafka and store it in HDFS - https://phabricator.wikimedia.org/T240553 (10Aklapper)
[15:56:48] <kevinbazira>	 o/ halfak
[15:57:58] <kevinbazira>	 Thanks for the review on the PR. Yes I did test try it with an example that is missing templates
[15:59:01] <halfak>	 Oh interesting.  Did I guess the behavior wrong? 
[15:59:12] <halfak>	 What happens in the output when a template is missing. 
[15:59:17] <halfak>	 *?
[15:59:23] <halfak>	 kevinbazira, ^ 
[16:00:04] <kevinbazira>	 It was returning the canonical name (i.e the given string) and a list with the string 'error'
[16:00:13] <kevinbazira>	 Forexample:
[16:00:41] <halfak>	 Aha.  We probably don't want "error" in the output file 
[16:01:03] <halfak>	 We probably just want to return the cannonical name with a list containing the canonical name. 
[16:02:34] <kevinbazira>	 Here is an example of the response:
[16:02:35] <kevinbazira>	 WikiProject Women scientistsxxx: ["error"]
[16:04:12] <halfak>	 Right.  Should probably log the error and instead have: 
[16:04:35] <halfak>	 WikiProject Women scientistsxxx: ["WikiProject Women scientistsxxx"]
[16:05:48] <kevinbazira>	 Alright, thanks for the clarification. Let me fix this now.
[16:11:40] <dsaez>	 hey halfak, sorry for missing the meeting, I was in a delayed flight :S
[16:12:10] <halfak>	 No worries!  I proposed that we reschedule for next week.  Does that work for you?
[16:12:20] <halfak>	 Isaac and I had some progress to catch up on. 
[16:12:49] <halfak>	 I'd love to sync up on embedding generation though.  We'll a half-step away from being ready to experiment with different length vectors. 
[16:14:22] <dsaez>	 ook, I'll try to do some work about before the meeting. 
[16:20:37] <halfak>	 dsaez, one other thing that isaacj and I talked about was text cleanup from the XML dumps. 
[16:20:48] <halfak>	 How are you planning to get good raw text to generate embeddings from? 
[16:22:07] <dsaez>	 I havent thought in that, one option is using the python API, and parse on the flight with mwparserfromhell
[16:22:43] <dsaez>	 another option is to use the facebook script to cleanup the xml dumps
[16:24:34] <dsaez>	 I'm searching, I remember there was a pearl script to do the cleaning
[16:25:18] <dsaez>	 https://github.com/jind11/word2vec-on-wikipedia halfak
[16:26:50] <halfak>	 That extractor is pretty complicated. 
[16:27:27] <dsaez>	 I think i've tried the gensim one
[16:27:32] <dsaez>	 https://radimrehurek.com/gensim/wiki.html
[16:27:49] <dsaez>	 "This pre-processing step makes two passes over the 8.2GB compressed wiki dump (one to extract the dictionary, one to create and store the sparse vectors) and takes about 9 hours on my laptop, so you may want to go have a coffee or two."
[16:32:41] <halfak>	 Yeah, so I think we want to generate embeddings using the same text cleanup as we'll use when processing things later. 
[16:33:03] <halfak>	 That makes me lean strongly towards doing it in python and not parsing templates. 
[16:33:08] <halfak>	 *Or* using the MWAPI. 
[16:33:32] <halfak>	 Could we get away with a smaller sample of text for generating embeddings using the mwapi? 
[17:00:09] <dsaez>	 halfak, I'm not sure...it might depend on the sample.
[17:00:25] <dsaez>	 another option is to use the integrate version of word2vec in Spark
[17:00:31] <dsaez>	 that might be the fastest option
[17:12:16] <halfak>	 I'd be down for that.  Whatever gets us vectors we can experiment with in the short term will probably be the fastest way to make progress. 
[17:12:26] <halfak>	 We can switch out our method for generating embeddings later as we see fit. 
[18:57:42] <wikibugs>	 10Scoring-platform-team, 10Research: Extract cross-wiki WikiProject tags - https://phabricator.wikimedia.org/T240273 (10Halfak) OK.  I adjusted this in https://github.com/halfak/wikitax/pull/4  I excluded the following: * wikiproject disambiguation 104354 (Not topical) ** wpdab 2605 ** wp disambiguation 3807 *...
[20:19:40] <wikibugs>	 10Scoring-platform-team, 10Discovery-Search: Produce drafttopic score events on every edit to English Wikipedia articles - https://phabricator.wikimedia.org/T240609 (10Halfak)
[20:49:41] <accraze>	 headin out for lunch and a couple errands, back in a bit
[22:35:30] <halfak>	 So, I think I sleep wrote some code. 
[22:36:33] <halfak>	 I have a hazy recolection of a couple parts of a complicated data processing script.  I don't know when I did it but it must have been some time this week.  Well I went to go actually get it done and I found a complete, functioning, well-documented script.  
[22:36:36] <halfak>	 Horray, I guess? 
[22:36:43] <halfak>	 :D 
[22:38:04] <Platonides>	 horray indeed :D
[22:47:08] <HAKSOAT>	 Hehe
[22:47:35] * Platonides suggests looking at the file modification time
[22:50:51] <halfak>	 Apparently I wrote it on Tuesday!
[22:51:42] <Platonides>	 in the middle of the night?
[22:59:10] <codezee>	 halfak: can you tell the purpose of "wikitext.revision.diff.token_prop_delta_increase"? when there's already a "wikitext.revision.diff.token_delta_increase"
[22:59:40] <halfak>	 Yeah!  So token_delta increase is a raw count of increase in the number of tokens. 
[23:00:09] <halfak>	 Any "prop_increase" or "prop_decrease" refers to the proportional increase of a token.  An example will help:
[23:00:45] <halfak>	 Let's say I add "foo" to the article content "foo bar baz buzz", I've increased the number of "foo"s by 100%, so it would add a prop increase of 1. 
[23:01:16] <halfak>	 But if I added "foo" to the article content "foo bar foo baz foo buzz", this would only result in a 33% increase in the "foo"s
[23:01:47] <halfak>	 This is really useful for badwords.  If I add the word "butt" to the article about "butt" that will likely have a very low proportional increase of the word. 
[23:02:16] <halfak>	 But if I add the word "butt" to the article on Abe Lincoln, that will likely lead to a large proportional increase. 
[23:02:19] <halfak>	 Make sense? 
[23:03:20] <codezee>	 oh, thats nice, thanks!
[23:53:03] <wikibugs>	 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Improve ORES articlequality feature extraction for images - https://phabricator.wikimedia.org/T180822 (10HAKSOAT) Oh. I think "test" is the wrong word here. I meant "run". So how do I run the code and see my changes in action?