[02:24:05] codezee, how are you reading the file in? [02:24:10] Can I see your code? [03:50:55] halAFK: sorry for the late reply, was away, here's the paste - https://dpaste.org/T34j [03:50:57] its just 4 lines [12:12:27] 10Scoring-platform-team, 10Bad-Words-Detection-System, 10editquality-modeling, 10User-Ladsgroup, and 2 others: migrate bad words detection to editquality repo - https://phabricator.wikimedia.org/T131861 (10jeropbrenda) a:03jeropbrenda [14:21:22] 10Scoring-platform-team (Current), 10revscoring, 10Chinese-Sites, 10artificial-intelligence: Tokenization of "word" things for CJK - https://phabricator.wikimedia.org/T111179 (10Pavol86) @Halfak I need your feedback on following, per out call last week I did following : 1. make the CJK, jap, kor tokenizer... [14:24:56] hey halAFK [14:25:14] Hey chtnnh! [14:25:29] aha! codezee, you'll either need to open that file with gzip.open() or you can use our utility that looks for filename extensions. [14:26:19] mwtypes.files.reader(filename) [14:26:49] halfak: oh, i see, thanks! will use the utility :) [14:27:46] codezee, when you use mwxml.map(), it uses mwtypes.files.reader() internally. [14:27:50] That's what I figured you were doing. [14:28:12] mwxml.map() is way more useful than working with Dump() directly in most cases. [14:28:16] i see, yeah, before using map, i wanted to explore the data in the dump in a simple loop [14:28:34] i'll use map [14:32:12] That's a good use of Dump() then! [14:32:17] Exploration is a good idea :) [14:40:52] 10Scoring-platform-team (Current), 10drafttopic-modeling: Compress Gensim models - https://phabricator.wikimedia.org/T247523 (10Halfak) [14:41:37] 10Scoring-platform-team, 10drafttopic-modeling: Fit more topic models into ORES - https://phabricator.wikimedia.org/T249520 (10Halfak) [14:41:39] 10Scoring-platform-team (Current), 10drafttopic-modeling: Compress Gensim models - https://phabricator.wikimedia.org/T247523 (10Halfak) 05Open→03Resolved a:03Pavol86 We new have models that are built using the compressed vectors. They seem to give us good fitness. [14:42:59] 10Scoring-platform-team (Current), 10Growth-Scaling, 10Growth-Team, 10Serbian-Sites: Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available) - https://phabricator.wikimedia.org/T249382 (10Halfak) a:05Halfak→03HAKSOAT [14:43:38] 10Scoring-platform-team (Current), 10Growth-Scaling, 10Growth-Team, 10Serbian-Sites: Scale: ORES topic models for uk, hu, hy, eu, sr (needed as soon as available) - https://phabricator.wikimedia.org/T249382 (10Halfak) We've managed to compress our vectors and reduce the memory footprint of ORES. That mean... [16:18:07] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Identify articles that should be de-prod'ed. - https://phabricator.wikimedia.org/T258082 (10Halfak) [16:18:49] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Identify articles that should be de-prod'ed. - https://phabricator.wikimedia.org/T258082 (10Halfak) How would we get some good labeled data for this? Is there a log event when an article is Prod'ed that we can look for? [20:36:07] 10Scoring-platform-team, 10Research ideas, 10Research-Backlog, 10Wiki-Loves-Monuments, 10artificial-intelligence: General image classifier for commons - https://phabricator.wikimedia.org/T155538 (10Aklapper)