[01:59:13] ok done for the day, see y'all tomorrow [10:16:11] 10ORES, 10Scoring-platform-team, 10Growth-Team, 10MediaWiki-Recent-changes, and 2 others: SpecialRecentChanges::doMainQuery needs tunning - https://phabricator.wikimedia.org/T244569 (10matej_suchanek) [14:32:59] Hello halfak_ [14:51:02] Hey haksoat! [14:51:19] I made a PR for the image issue, but then I remembered infobox is yet to be worked on [14:51:38] From my checks, the images in an infobox do not have specific structure [14:52:03] some have image_file, map, image, etc [14:52:34] haksoat, we could probably capture the most common template parameter names. [14:52:53] image, file, photo, map, image_file, etc. [14:53:17] Okay. Great. [14:53:40] When chanced, could you help take a look at the PR in its current state? [15:02:42] Sure! Got a link handy? [15:02:47] haksoat, ^ [15:03:29] Yeah [15:03:31] https://github.com/wikimedia/articlequality/pull/102 [16:12:31] Feedback seen halfak [16:12:47] Solid. Sorry I forgot to ping. I'm in a meeting block :| [16:12:48] The gallery_images you talked about will be for both the tag and template gallery types right? As we can have and {{gallery... [16:12:53] Okay [16:51:05] haksoat, I was only thinking about the tag-based. [16:51:26] You might call it "tag_images" then. [17:32:03] wikimedia/ores#1408 (master - 83bc66e : Andy Craze): The build has errored. https://travis-ci.org/wikimedia/ores/builds/655455077 [17:32:43] Arg! I didn't escape a character >:( [17:35:41] ahhh :( [17:35:53] is this on the travis side? [17:37:58] Yup. Fixed and restarted the build. [17:50:43] hmm weird... it looks like the auto-deploy still fails [17:50:51] "Could not restore untracked files from stash entry" [17:54:23] wat [17:55:28] Oh! I think I mixed up the pypi creds. [17:58:20] if that doesn't work, I'm also seeing "ORES needs Python 3 to run properly. Your version is 2.7.12" in the logs [17:58:39] accraze, I just PM'd a quick question [17:59:28] might need to add something like what we have in revscoring: https://github.com/wikimedia/revscoring/blob/master/.travis.yml#L5 [18:04:05] wikimedia/wikilabels#536 (install_docs - 9f25b7c : halfak): The build was fixed. https://travis-ci.org/wikimedia/wikilabels/builds/655467171 [19:12:39] Woo! Meeting block complete. [19:12:41] Lunch! [20:14:56] back [20:22:00] wikimedia/ores#1408 (master - 83bc66e : Andy Craze): The build has errored. https://travis-ci.org/wikimedia/ores/builds/655455077 [20:23:20] I bet the problem is that I need to escape periods. [20:25:36] ^ yep just remembered i had to do this [21:18:28] halfak: I'm thinking of using our previously published WikiProjects dataset of 93k articles for the new work. To work on article states as they evolve, I will have to store the version of article at each point in history. Do you think the API should be fine to get this much data or should i use the dumps? [21:22:40] wikimedia/ores#1408 (master - 83bc66e : Andy Craze): The build has errored. https://travis-ci.org/wikimedia/ores/builds/655455077 [21:35:07] 10Jade, 10Scoring-platform-team (Current), 10MW-1.35-notes (1.35.0-wmf.22; 2020-03-03), 10Patch-For-Review: Address Jade UI issues. - https://phabricator.wikimedia.org/T245311 (10ACraze) [21:53:21] codezee! Hey! So we've moved beyond that strategy recently. [21:53:59] oh, better ways available now? [21:54:02] We've switched to a manual taxonomy described here: https://github.com/halfak/wikitax/blob/master/taxonomies/wikiproject/halfak_20191202/taxonomy.yaml [21:54:27] The taxonomy has been adjusted and improved in a lot of ways. This is produced better fitness and more useful topic categories. [21:55:09] I'd be happy to share some labeled data with you. But you can also generate it with our updated makefile. [21:55:11] great! can you point to documentation which i can use to extract articles corresponding to these topics again? possible ~100k [21:55:23] *possibly [21:55:28] https://github.com/wikimedia/drafttopic/blob/master/Makefile [21:56:15] You probably want to start with "datasets/enwiki.labeled_article_items.json.bz2" [21:56:35] Alternatively, you could just use the model we have in production to label some articles :) [21:58:06] halfak: the dataset itself is a representative sample of articles of varying qualities, hence it seemed like a useful start point [21:58:20] do you know roughly how many articles are part of the new dataset?? [21:59:45] The new dataset is ~5 million. [21:59:51] But you should be able to sub-sample that. [22:00:07] It's basically every page on enwiki with a sitelink in wikidata. [22:00:35] that seems easy if its just the article titles and some metadata i'll take about 100k articles for analysis out of that [22:01:40] halfak: in terms of getting the article history (including content at each step) for these 100k articles, do you think API is an okay choice? [22:02:06] i'm planning to store them on myself locally, then run further analysis with regards to categories like clarification, verificiatoon, etc [22:02:07] Yeah, 100k is reasonable for a few queries. [22:02:20] *on mysql, not myself :P [22:02:37] Yourself using mysql :D [22:03:32] haha, true! ;) okay then, I'll use that 5M article dataset, subsample it, and fetch the entire history of these articles locally, lot of incoming data this week ;)