[09:45:05] 10Scoring-platform-team, 10ORES, 10Operations, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718040 (10hoo) [12:33:21] 10Scoring-platform-team, 10ORES, 10Operations, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3715032 (10ema) >>! In T179156#3717847, @BBlack wrote: > For future reference by another opsen who might be looking at this: on... [13:45:02] 10Scoring-platform-team, 10ORES, 10Operations, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718221 (10BBlack) Does Echo have any kind of push notification going on, even in light testing yet? [14:02:19] 10Scoring-platform-team, 10ORES, 10Operations, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718225 (10BBlack) Now that I'm digging deeper, it seems there are one or more projects in progress built around Push-like thin... [15:32:55] o/ [15:34:25] halfak: there? [15:34:32] yup [15:34:34] What's up? [15:36:40] halfak: I've been able to generate 2k categories per mid-level category and it looks like - https://dpaste.de/V9Zy [15:36:48] the dataset has about 93.5k rows [15:37:19] Oooh sounds great :) [15:37:20] * halfak looks [15:37:31] and was generated using paws, what i'm thinking is for now we can upload this dataset somewher online like editquality datasets which can then be fetched in an automated pipeline [15:38:24] the last column basically specifies to which mid-level topic the article belongs to [15:38:40] Looks like page_id is wrong [15:38:55] Oh! I think I'm seeing it offset [15:38:59] halfak: it got misaligned during paste [15:39:08] the first column is just a serial no [15:39:37] Othewise looks pretty good. We'll need to re-label the observations with all tagged WikiProjects and a set of mid-level categories [15:39:56] Also, look up some early rev_id to represent the "draft version" of the article for training. [15:40:02] Otherwise, I think this looks great :) [15:41:31] halfak: yes we can have an additional step to qualify all these entries with all categories they belong to, [15:41:57] but what would that change in classification? [15:45:56] halfak: would it not be beneficial to have only pure samples for training, i.e, those that have only one mid-level category? [15:50:44] codezee, no way. [15:50:47] Not representative [15:51:00] Many articles fall into many mid-level categories. [15:51:20] We should try our best to model the reality of the world rather than some pure slice of it. [15:51:36] halfak: so you're saying its best to take such a random sample as above and then just label them with every category they belong to? [15:57:37] getting template info for each article i think amounts to 93.5k queries separately [16:20:08] codezee, right. That's what we do to detect reverts and perform other labeling activities. We can build a python script and use the API. [16:20:13] * halfak gets and example/ [16:20:28] Sorry I'm working around the house so not in front of my computer for long. [16:21:11] Here's a good example of a script that annotates a big dataset based on lots of mwapi queries: https://github.com/wiki-ai/wikiclass/blob/master/wikiclass/utilities/fetch_item_info.py [16:21:21] Oh this one too: https://github.com/wiki-ai/wikiclass/blob/master/wikiclass/utilities/fetch_text.py [16:21:37] I'm just about to head out. GOod luck on the hacking. :) [16:22:57] sure, thanks for the above scripts :) [17:25:58] 10Scoring-platform-team, 10ORES, 10Operations, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#3718297 (10Legoktm) >>! In T179156#3718221, @BBlack wrote: > Does Echo have any kind of push notification going on, even in lig...