[00:13:00] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10MMiller_WMF) [00:59:01] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [02:08:43] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10EBernhardson) [10:45:11] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [10:48:21] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Tgr) [10:57:39] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [11:02:39] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [11:02:52] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Tgr) [11:08:42] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [11:09:24] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [11:23:09] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [11:31:22] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [11:32:27] 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team: Allow searching articles by ORES drafttopic - https://phabricator.wikimedia.org/T240517 (10Tgr) [11:34:55] 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team, 10NewcomerTasks 1.1: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559 (10Tgr) [11:35:12] 10MediaWiki-extensions-ORES, 10Scoring-platform-team, 10Discovery-Search, 10Growth-Team, 10NewcomerTasks 1.1: Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword - https://phabricator.wikimedia.org/T240559 (10Tgr) [14:15:19] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Ottomata) I think if drafttopic is added to the list of 'precache' scores for changeprop, it will automatically get added. Ping @Pchelolo [14:19:02] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Pchelolo) We have the 'revision-score' topic where an event is pushed on every page edit via https://github.com/wikimedia/change-propagation/blob/master/sys/ores_updates.js and calling... [14:24:22] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Ottomata) I think (hope!) this event will be fine! [14:24:49] 10Scoring-platform-team: Configure ORES to publish new drafttopic scores to Kafka - https://phabricator.wikimedia.org/T240549 (10Ottomata) If you just put this score into the existing topic, it will show up in the event.mediawiki_revision_score hive table. [15:07:38] o/ kevinbazira [15:07:42] Hey man. How's hacking? [15:17:46] I just left some feedback on the PR. Looks like it will probably fail or out some errors in the output. [15:17:59] Did you try a test run with some WikiProjects that are missing templates? [15:43:48] 10Scoring-platform-team, 10Discovery-Search: Consume ORES drafttopic data from Kafka and store it in HDFS - https://phabricator.wikimedia.org/T240553 (10Aklapper) [15:56:48] o/ halfak [15:57:58] Thanks for the review on the PR. Yes I did test try it with an example that is missing templates [15:59:01] Oh interesting. Did I guess the behavior wrong? [15:59:12] What happens in the output when a template is missing. [15:59:17] *? [15:59:23] kevinbazira, ^ [16:00:04] It was returning the canonical name (i.e the given string) and a list with the string 'error' [16:00:13] Forexample: [16:00:41] Aha. We probably don't want "error" in the output file [16:01:03] We probably just want to return the cannonical name with a list containing the canonical name. [16:02:34] Here is an example of the response: [16:02:35] WikiProject Women scientistsxxx: ["error"] [16:04:12] Right. Should probably log the error and instead have: [16:04:35] WikiProject Women scientistsxxx: ["WikiProject Women scientistsxxx"] [16:05:48] Alright, thanks for the clarification. Let me fix this now. [16:11:40] hey halfak, sorry for missing the meeting, I was in a delayed flight :S [16:12:10] No worries! I proposed that we reschedule for next week. Does that work for you? [16:12:20] Isaac and I had some progress to catch up on. [16:12:49] I'd love to sync up on embedding generation though. We'll a half-step away from being ready to experiment with different length vectors. [16:14:22] ook, I'll try to do some work about before the meeting. [16:20:37] dsaez, one other thing that isaacj and I talked about was text cleanup from the XML dumps. [16:20:48] How are you planning to get good raw text to generate embeddings from? [16:22:07] I havent thought in that, one option is using the python API, and parse on the flight with mwparserfromhell [16:22:43] another option is to use the facebook script to cleanup the xml dumps [16:24:34] I'm searching, I remember there was a pearl script to do the cleaning [16:25:18] https://github.com/jind11/word2vec-on-wikipedia halfak [16:26:50] That extractor is pretty complicated. [16:27:27] I think i've tried the gensim one [16:27:32] https://radimrehurek.com/gensim/wiki.html [16:27:49] "This pre-processing step makes two passes over the 8.2GB compressed wiki dump (one to extract the dictionary, one to create and store the sparse vectors) and takes about 9 hours on my laptop, so you may want to go have a coffee or two." [16:32:41] Yeah, so I think we want to generate embeddings using the same text cleanup as we'll use when processing things later. [16:33:03] That makes me lean strongly towards doing it in python and not parsing templates. [16:33:08] *Or* using the MWAPI. [16:33:32] Could we get away with a smaller sample of text for generating embeddings using the mwapi? [17:00:09] halfak, I'm not sure...it might depend on the sample. [17:00:25] another option is to use the integrate version of word2vec in Spark [17:00:31] that might be the fastest option [17:12:16] I'd be down for that. Whatever gets us vectors we can experiment with in the short term will probably be the fastest way to make progress. [17:12:26] We can switch out our method for generating embeddings later as we see fit. [18:57:42] 10Scoring-platform-team, 10Research: Extract cross-wiki WikiProject tags - https://phabricator.wikimedia.org/T240273 (10Halfak) OK. I adjusted this in https://github.com/halfak/wikitax/pull/4 I excluded the following: * wikiproject disambiguation 104354 (Not topical) ** wpdab 2605 ** wp disambiguation 3807 *... [20:19:40] 10Scoring-platform-team, 10Discovery-Search: Produce drafttopic score events on every edit to English Wikipedia articles - https://phabricator.wikimedia.org/T240609 (10Halfak) [20:49:41] headin out for lunch and a couple errands, back in a bit [22:35:30] So, I think I sleep wrote some code. [22:36:33] I have a hazy recolection of a couple parts of a complicated data processing script. I don't know when I did it but it must have been some time this week. Well I went to go actually get it done and I found a complete, functioning, well-documented script. [22:36:36] Horray, I guess? [22:36:43] :D [22:38:04] horray indeed :D [22:47:08] Hehe [22:47:35] * Platonides suggests looking at the file modification time [22:50:51] Apparently I wrote it on Tuesday! [22:51:42] in the middle of the night? [22:59:10] halfak: can you tell the purpose of "wikitext.revision.diff.token_prop_delta_increase"? when there's already a "wikitext.revision.diff.token_delta_increase" [22:59:40] Yeah! So token_delta increase is a raw count of increase in the number of tokens. [23:00:09] Any "prop_increase" or "prop_decrease" refers to the proportional increase of a token. An example will help: [23:00:45] Let's say I add "foo" to the article content "foo bar baz buzz", I've increased the number of "foo"s by 100%, so it would add a prop increase of 1. [23:01:16] But if I added "foo" to the article content "foo bar foo baz foo buzz", this would only result in a 33% increase in the "foo"s [23:01:47] This is really useful for badwords. If I add the word "butt" to the article about "butt" that will likely have a very low proportional increase of the word. [23:02:16] But if I add the word "butt" to the article on Abe Lincoln, that will likely lead to a large proportional increase. [23:02:19] Make sense? [23:03:20] oh, thats nice, thanks! [23:53:03] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Improve ORES articlequality feature extraction for images - https://phabricator.wikimedia.org/T180822 (10HAKSOAT) Oh. I think "test" is the wrong word here. I meant "run". So how do I run the code and see my changes in action?