[14:50:48] o/ [16:26:00] Hi halfak, I'm trying to download all articles in given wikiproject, and I'm using your queries on quarry. Interestingly, for Wikiproject medicine I need to do cl_to = "All_WikiProject_Medicine_articles" instead of cl_to = "WikiProject_Medicine_articles" that gives an empty result... do you know if there is any logic behind this? [16:30:56] Can you share the query you are looking at dsaez [16:30:57] ? [16:31:23] sure [16:31:23] https://quarry.wmflabs.org/query/14033 [16:33:23] Looks like you get a result with cl_to = "WikiProject_Medicine_articles" [16:33:42] Oh I see. The query has women scientists [16:34:49] yep, but I get an empty result if I do this... I need to put All_WikiProject_Medicine_articles to get something... I'm trying to understand the logic behind [16:34:49] Looks like there's some inconsistency with naming of WikiProject categories. :\ [16:35:03] I see [16:35:22] Maybe you want to use isaac's dataset [16:35:35] The number that I get is also not consistent with the number here https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Directory/Science,_technology,_and_engineering_WikiProjects#Health_WikiProjects [16:35:36] Definitely has some limitations [16:35:46] got you. [16:36:14] I just wanted to check if I was doing something wrong, but what you say makes completely sense [16:36:30] isaacj, are you around? [16:36:37] Cool. Yeah. I thin you're understanding right. [16:36:48] o/ [16:37:55] isaacj, I'm trying to get all the articles under the Wikiproject Medicine [16:38:37] But just replacing "Woman_Scientits" by "Medicine" from this query :https://quarry.wmflabs.org/query/14033 [16:38:43] does not work [16:39:01] I need to add an "All_Wikiproject_Medicine_articles" [16:39:08] have you seen this problem in the past? [16:39:53] ahh okay, a few thoughts: you can look for all articles in the .bz2 file in https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject_Templates/10248344 that have "WikiProject Medicine" as a wp_template [16:40:16] dsaez, right yes. This is why we use templates instead of categories when doing a full scan. [16:40:21] Slightly more consistent :| [16:41:42] isaacj, yep, that's what I'm doing now, I would just trying to understand if this is just noise or if there is some logic, because the numbers using that trick are different from the ones here: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Directory/Science,_technology,_and_engineering_WikiProjects#Health_WikiProjects [16:43:12] btw, this an amazing dataset, I would add to my talks about resources [16:43:16] at one point I did actually look for all the wikiproject medicine articles via categories but I was using the categories API and pretty sure i used that category too (All_WikiProject_Medicine_articles) but the category names are super wikiproject specific and often don't exist -- i.e. not all wikiprojects have a standard category to track their articles [16:45:48] thanks dsaez ! in the next week or so hopefully i'm going to make an update too because morten pointed to alternative way that brings in even more articles. so you could actually also query the enwiki database on mariadb with the following query too: "select * from page_assessments where pa_project_id = 1;" (pa_project_id = 1 is for WikiProject Medicine; check page_assessments_projects for the full list) [16:46:12] the page_assessments pointer was courtesy of Morten! [16:46:24] (oh i already said that) [16:46:48] oh, interesting [16:47:09] the mariadb query returns 47133 wikipedia articles [16:47:58] whereas the All_WikiProject_Medicine_articles category has 47297 articles, so some small difference [16:48:24] i assume the category also tags non-article pages but wouldn't know without checking [16:48:37] the weird thig is that numbers in the link on the directory are completely different [16:48:57] [16:48:57] 28,825 articles [16:49:11] gah. [16:49:11] here: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Directory/Science,_technology,_and_engineering_WikiProjects#Health_WikiProjects [16:58:14] dsaez, looks like the WikiProject assessment table uses this category: https://en.wikipedia.org/wiki/Category:Medicine_articles_by_quality [16:58:35] I'm guessing it queries all sub-categories. [16:58:54] I see... cool thx [17:00:56] yeah and weirdly that number hasn't changed in at least a year so i think the reporting there is broken [17:01:07] (at the link you shared diego) [17:20:42] oh, maybe is ... manual? O: [17:36:26] ooof i think you might be right. looks like Reports bot doesn't touch that column [18:10:01] halfak and accraze sorry I jumped off the call, the cloud cover interrupted my connection. Enjoy your weekend! [18:10:15] no worries kevinbazira, have a good weekend! [18:29:24] Aha! I forgot to handle lower case! [18:29:42] This might make a big difference for locations as the relevant terms tend to be capitalized! [18:29:48] geography, I'm coming for you ;) [18:39:07] heading out to lunch while this runs. [19:41:04] Oops - "Could not find a version that satisfies the requirement enchant" (installing and making `editequality`). Does pip provides something equivalent to nodejs's `package-lock.json` [19:42:15] xinbenlv, there's a bit of complication with installing revscoring. Let me get you the guide. [19:42:34] https://github.com/wikimedia/revscoring#ubuntu--debian [19:42:47] A bunch of dictionaries and enchant. [19:45:02] ok okk [19:45:20] FITNESS THROUGH THE ROOF! [19:45:22] YES [19:45:31] ``` [19:45:31] Error: invalid option: --with-all-languages [19:45:31] ``` [19:45:32] Topic model works amazingly! [19:45:55] What produced that error, xinbenlv? [19:45:58] This is my homebrew version [19:45:58] ``` [19:45:58] Homebrew 2.2.3 [19:45:58] Homebrew/homebrew-core (git revision cb5e4; last commit 2020-01-17) [19:45:59] ``` [19:46:47] Aha. I don't recommend installing this in MacOS. [19:47:05] Our working environment is linux -- as we'd discussed on the phone. [19:47:45] xinbenlv/revscoring#1 (patch-1 - 2c471a7 : xinbenlv): The build passed. https://travis-ci.com/xinbenlv/revscoring/builds/144995955 [19:47:50] isaacj! I now get pr_auc (micro=0.8, macro=0.672): [19:48:19] The trick was to lowercase my words! I had a ton of words that weren't matching vectors because they had upper-case chars! [19:48:59] xinbenlv, merged! [19:49:35] @halfak, my work debain is a Google-special versioned Linux, and it couldn't even install Git LFS, because it blocks regular debian sources unless trusted by the company [19:50:22] so I guess if I can't install it on my laptop which is a MacOS, I will have to use a virtual machine, either on GoogleCloud Compute Engine or a Dockerized verson [19:51:29] now I got contributor badge for revscoring hahahahahaha [19:51:40] Hmm. VM or docker might be the way to go. [19:52:00] We have enough info to set up a docker/VM in our travis cofnig. [19:52:12] https://github.com/wikimedia/editquality/blob/master/.travis.yml [19:52:39] I'm so excited about this topic modeling stuff. WOO [19:53:38] still [19:53:38] ``` [19:53:38] ImportError: No enchant-compatible dictionary found for 'id'. Consider installing 'aspell-id'. [19:53:38] ``` [19:54:22] Yes, you'll need to install an indonesian dictionary for enchant. [19:54:34] aspell, myspell, and hunspell are all probably good. [19:56:06] :face palm: but been there. i should check on mine too for that :) [19:56:25] that's great news though! what type of model is this? [19:59:17] Gradient boosting. 150 estimators, depth of 5. [19:59:28] Using the supervised vectors from fasttext [19:59:34] 100dimension [21:06:40] nice -- i should test the 100-dimensional embeddings to see if they bump my performance much. do you know what the numbers are if you don't do the corrections for population_rate? i assume they look even better then? [21:10:00] Most likely, yeah. Hmm. I guess I could just drop the pop-rate config and re-CV the model [21:11:50] * halfak wait 15 minutes for his model to train >:( [21:14:40] Actually more like 20 mines [21:14:42] *mins [21:15:53] i think i was down to 5-10 minutes for the keras model on GPU but yeah :/ [21:50:12] Releasing revscoring 2.6.4 [21:56:36] Oh shoot. I should have released this before rebuilding the models :( [21:56:52] here we go again [21:57:50] Hmm. From pickle's output, it seems like we might get some weird memory issues with our refactoring of our word2vec datasource. [21:57:55] I'll check. [21:58:48] The current models are 48MB. If they get much bigger, we'll want to revert back to the old pattern. [22:36:11] Yup. It's bad. [22:36:18] Oof. There was a reason we were using that old pattern. [22:36:54] OK I'm just going to revert to 2.6.3 and we'll release a 2.6.5 version with a revert of the change. [22:37:45] 185MB vs. 48MB [22:47:02] https://github.com/wikimedia/revscoring/pull/469 [22:49:07] wikimedia/revscoring#1819 (rollback_word2vec_init - 954da23 : halfak): The build failed. https://travis-ci.org/wikimedia/revscoring/builds/638658024 [22:49:16] curses. [22:49:49] Oh interesting. A bunch of flake8 from the revert. [22:54:56] wikimedia/revscoring#1821 (rollback_word2vec_init - 3deb694 : halfak): The build was fixed. https://travis-ci.org/wikimedia/revscoring/builds/638659800 [22:55:07] BAM [22:55:14] OK almost ready with the topic models too. [22:56:22] This is going to make our assets way smaller. Even after I add all 5 langs, it'll still be less than half the size. [23:30:40] 1.6GB --> 604MB [23:32:20] 10Scoring-platform-team (Current), 10drafttopic-modeling: Retrain enwiki drafttopic models on supervised vectors - https://phabricator.wikimedia.org/T243107 (10Halfak) [23:32:30] 10Scoring-platform-team (Current), 10drafttopic-modeling: Retrain enwiki drafttopic models on supervised vectors - https://phabricator.wikimedia.org/T243107 (10Halfak) a:03Halfak [23:32:33] 10Scoring-platform-team (Current), 10drafttopic-modeling: Retrain enwiki drafttopic models on supervised vectors - https://phabricator.wikimedia.org/T243107 (10Halfak) https://github.com/wikimedia/drafttopic/pull/45 [23:32:58] 10ORES, 10Scoring-platform-team (Current), 10drafttopic-modeling: Add new vectors to deployment assets - https://phabricator.wikimedia.org/T243108 (10Halfak) [23:35:17] I'm still waiting on assets to upload, but I've got to run. [23:35:22] Have a good weekend, folks! [23:37:07] (03PS1) 10Halfak: Replaces gnews vectors with ar,cs,en,ko, and viwiki fasttext vectors. [scoring/ores/assets] - 10https://gerrit.wikimedia.org/r/565700