[03:13:04] 10Jade, 10User-DannyS712, 10ci-test-error: Tests failing for the master branch of Jade - https://phabricator.wikimedia.org/T251854 (10DannyS712) [03:30:33] 10Jade, 10User-DannyS712, 10ci-test-error: Tests failing for the master branch of Jade - https://phabricator.wikimedia.org/T251854 (10DannyS712) @ACraze until this is resolved, it is blocking work related to {T246284}, eg https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Jade/+/594343/ - can you take a... [05:19:00] 10Jade, 10Scoring-platform-team: Render usernames in Jade edit comments. - https://phabricator.wikimedia.org/T248135 (10kevinbazira) a:03kevinbazira [05:19:32] 10Jade, 10Scoring-platform-team (Current): Render usernames in Jade edit comments. - https://phabricator.wikimedia.org/T248135 (10kevinbazira) [06:11:00] (03Abandoned) 10Ashuro07: Upgrade tests to WebdriverIO v-5 [extensions/ORES] - 10https://gerrit.wikimedia.org/r/590259 (https://phabricator.wikimedia.org/T248223) (owner: 10Ashuro07) [10:15:01] 10Jade, 10User-DannyS712, 10ci-test-error: Tests failing for the master branch of Jade - https://phabricator.wikimedia.org/T251854 (10hashar) Looks like Jade hasn't had much changes recently. It sounds like a regression in mediawiki/core. [11:12:56] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Text fetched by articlequality's `fetch_text` might not match the talk page label (for moved pages) - https://phabricator.wikimedia.org/T251608 (10He7d3r) While the dumps are processed, we could store the `` of the talk pages... [12:55:34] 10Scoring-platform-team, 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10artificial-intelligence: Proposal (GSoC 2020): Implement articlequality model for ptwiki - https://phabricator.wikimedia.org/T247847 (10Chtnnh) [12:55:36] 10Scoring-platform-team (Current), 10artificial-intelligence: Add `words_to_watch` to articlequality and draftquality models in ptwiki - https://phabricator.wikimedia.org/T251171 (10Chtnnh) [12:55:38] 10ORES, 10Scoring-platform-team (Current), 10artificial-intelligence: Review model performance for ptwiki 'articlequality' and 'draftquality' - https://phabricator.wikimedia.org/T250809 (10Chtnnh) [12:55:40] 10Scoring-platform-team (Current), 10Wikilabels, 10editquality-modeling, 10artificial-intelligence: Build draft quality model for ptwikipedia - https://phabricator.wikimedia.org/T246667 (10Chtnnh) [12:55:42] 10Scoring-platform-team (Current), 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for ptwikipedia - https://phabricator.wikimedia.org/T246663 (10Chtnnh) [13:20:51] hello halfak__ [13:21:02] you in yet? [13:21:22] yes. But just preparing for a meeting. [13:22:53] thats alright! have a couple things I need your help with. Let me know when you are available halfak [14:44:22] 10ORES, 10Scoring-platform-team (Current), 10artificial-intelligence: Review model performance for ptwiki 'articlequality' and 'draftquality' - https://phabricator.wikimedia.org/T250809 (10Halfak) p:05Triage→03Medium [14:45:19] 10ORES, 10Scoring-platform-team (Current), 10artificial-intelligence: Write report about misclassification reports - https://phabricator.wikimedia.org/T251905 (10Halfak) [14:46:03] chtnnh, see https://phabricator.wikimedia.org/T251905. I think that's the right next step. I suspect that words_to_watch may help with some of these issues. [14:46:28] do we not need to finalize what steps to take for w2w first tho halfak ? [14:46:49] Good question. You could use the model to re-score some of the misclassifications. [14:46:56] And check to see if it does better. [14:47:18] Right now, the fitness statistics don't look better, but I'm interested in merging if we think there's reason to believe that it works better in practice. [14:48:21] alright then maybe we can merge this code and mark the corresponding tasks as resolved with proper reasons as to why we did so despite the low increase in fitness and then we can create new tasks for the update to w2w [14:48:39] sorry my sentences are too long :O [14:49:09] halfak ^ [14:49:51] "create new tasks for the update to w2w"? [14:50:05] Like removing some words that commonly show up in high quality articles? [14:50:35] yes exactly and also see if we can tweak the representation of w2w in the feature_list to gain performance [14:50:52] what do you think? [14:51:22] Right now, we have no evidence that w2w makes an improvement. [14:52:53] we can however merge the code as you suggested, clearly documenting it as an experiment to see whether it helps reduce the misclassifications [14:53:15] We don't need to merge for you to test. Look at the "revscoring score" utility [14:53:29] You can run the model on revisions and look at the prediction outside of ORES. [14:54:38] wow thats cool! let me try doing that [15:05:57] 10Jade, 10Scoring-platform-team, 10Documentation, 10User-srodlund: Review and improve mw:Jade - https://phabricator.wikimedia.org/T206150 (10srodlund) 05Open→03Resolved @Halfak I did a quick glance, and it looks good! :-) I'm going to move this to resolve since it's been off my radar for a while. [15:17:46] halfak , what mediawiki api should i use for extracting features? [15:34:17] revscoring score takes care of that chtnnh [15:34:33] Oh. Like the host-name? [15:34:39] https://pt.wikipedia.org [15:35:09] right! thank you [15:54:03] 10Jade, 10Scoring-platform-team (Current), 10Documentation, 10User-srodlund: Review and improve mw:Jade - https://phabricator.wikimedia.org/T206150 (10Halfak) Makes sense. Thanks! [16:08:17] https://gist.github.com/chtnnh/15a77653279d50a0b90179aa83db4fca [16:08:30] halfak the results are interesting ^ [16:09:49] Looks like some general improvements with those specific examples. [16:10:10] yes that is right [16:10:18] 10ORES, 10Scoring-platform-team (Current), 10artificial-intelligence: Write report about misclassification reports - https://phabricator.wikimedia.org/T251905 (10Chtnnh) https://gist.github.com/chtnnh/15a77653279d50a0b90179aa83db4fca This is the difference between model performance before and after adding w... [16:10:20] what do you think caused it [16:11:29] Quite possibly it is words_to_watch. It could also be improvements in the data pipeline. [16:11:36] Could you try with the model that is currently in master? [16:11:57] yes gimme a minute [16:13:27] the copy of master i have doesnt have the model built already, i will have to build the model first. do you want me to do it? [16:13:59] Should have it. [16:15:18] yeah its weird [16:15:26] idk :/ [16:24:30] what should I do now senpai? [16:24:34] halfak ^ [17:09:37] 10Scoring-platform-team (Research), 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10artificial-intelligence: Proposal (GSoC 2020): Implement an NSFW image classifier with open_nsfw - https://phabricator.wikimedia.org/T247614 (10Mholloway) A Gerrit repository has been created for the open-ns... [17:23:42] o/ chtnnh [17:23:45] Just got out of meetings. [17:24:05] but i think you must be running for lunch now? [17:24:20] So help me understand what you mean when you say "master doesn't have the model built". [17:24:29] I'd like to but I want to unblock you first :) [17:25:29] when i switch to master and run the revscore score utility, it returns no such file or directory for the model [17:25:49] and when i go through the makefile, i see no portuguese wikipedia in it :/ [17:31:15] Which repo? [17:31:21] articlequality [17:31:50] https://github.com/wikimedia/articlequality/blob/master/Makefile#L521 [17:33:00] https://gist.github.com/chtnnh/3b702556f569d9bcf72d99010ce7d16a [17:33:32] Hello halfak [17:33:36] Does [17:33:38] [17:33:43] count as a tag or a ref? [17:34:01] The tokenizer currently catches it as a tag, instead of a ref. [17:34:14] So I'm trying to confirm if this is an expected behavior [17:34:40] I would say that should be a ref [17:35:11] (and also its self-closed version: ) [17:35:15] Thoughts too. I guess I'll go with that. [17:35:20] True [17:35:27] Thanks Helder [17:37:10] It should count as both. [17:37:31] But ref is more specific than tag. [17:37:44] Oh! tokenizer-wise should be a "ref_open" [17:37:46] Looked deeper into the text now [17:37:56] Yes. It's a ref_open. [17:38:00] cool :) [17:38:13] The closing ref was quite far from the opening one, so I didn't pick initially [17:39:12] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Text fetched by articlequality's `fetch_text` might not match the talk page label (for moved pages) - https://phabricator.wikimedia.org/T251608 (10Halfak) I like it. Thanks for the PR. :) [17:45:51] halfak ^ [17:45:56] https://gist.github.com/chtnnh/3b702556f569d9bcf72d99010ce7d16a [17:46:58] That's very weird. It's there. https://github.com/wikimedia/articlequality/blob/master/models/ptwiki.wp10.gradient_boosting.model [17:47:08] Can you see it in the models/ folder? [17:47:37] maybe it is related to git-lfs? [17:47:52] no i do not see it in the models/ folder [17:48:25] i even performed a git pull upstream [17:50:09] i switched to upstream/master [17:50:13] this one now has the model [17:50:16] running on this [17:51:13] Nice. [17:51:17] OK heading to lunch [17:57:24] https://gist.github.com/chtnnh/15a77653279d50a0b90179aa83db4fca [19:01:43] chtnnh, looks good. There's definitely something different going on here. In the examples where the new prediction does better than master, I'd like to review those examples. [19:02:07] Could you make a table on the wiki? [19:02:26] Each column would be a model and each row is a misclassification. [19:02:28] i could definitely try :D [19:10:19] 10Scoring-platform-team (Research), 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10artificial-intelligence: Proposal (GSoC 2020): Implement an NSFW image classifier with open_nsfw - https://phabricator.wikimedia.org/T247614 (10Pavithraes) 05Open→03Declined @Chtnnh Congratulations on ge... [19:11:08] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Text fetched by articlequality's `fetch_text` might not match the talk page label (for moved pages) - https://phabricator.wikimedia.org/T251608 (10Halfak) Just left some notes there. [19:14:26] 10Scoring-platform-team (Research), 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10artificial-intelligence: Proposal (GSoC 2020): Implement an NSFW image classifier with open_nsfw - https://phabricator.wikimedia.org/T247614 (10Chtnnh) @Pavithraes Thank you! 😄 @Mholloway this task continu... [19:17:37] kevinbazira, was just looking back at the async standup. Looks like the edit comment stuff is really coming together :) [19:17:57] It's exciting stuff. [19:18:58] brb [19:20:03] https://www.mediawiki.org/wiki/ORES/Issues/Article_quality#Portuguese_Wikipedia halfak here is a preview, is this what you want [19:32:26] chtnnh, I was thinking you might make in the bottom ^_^ [19:32:34] And just include the IDs and the scores. [19:32:43] But good job getting wiki tables working :) [19:32:43] for sure [19:32:48] xD [19:33:13] What do you think about having the original score in the left most column and adding iterations to it as we work on the models. [19:33:46] See the bottom of this page: https://www.wikidata.org/wiki/Wikidata:ORES/Report_mistakes [19:33:56] yeah sure i will do that. I will revert my changes first then add a table at the bottom with model iterations [19:34:14] Great :) [19:34:17] wow that is so much better xD [19:57:00] https://www.mediawiki.org/wiki/ORES/Issues/Article_quality#Portuguese_Wikipedia [19:57:15] halfak have a look please ^ [20:24:44] Looks great! [21:34:53] 10ORES, 10Scoring-platform-team (Current), 10artificial-intelligence: Write report about misclassification reports - https://phabricator.wikimedia.org/T251905 (10Halfak) I just reviewed @chtnnh's post at https://www.mediawiki.org/wiki/ORES/Issues/Article_quality#Summary I made some modifications to the tabl... [22:16:48] halfak: [22:17:02] Any idea how often we come across none http or https urls in wiki text [22:17:07] ? [22:17:32] It's got to be super rare. [22:17:43] I don't know if I have seen it ever. [22:17:53] Okay. Great then. [22:18:07] I ask because the regex for urls has been picking two urls as one [22:18:21] so I intend using http and https directly [22:18:52] Current performance: [22:18:54] We can tokenize 6.473469273228868 Alan Turing's per second with wsplit [22:19:01] We can tokenize 9.006297144220843 Alan Turing's per second with wsplit_lex_2 [22:19:16] And wsplit_lex_2 picks up more tokens too [22:20:52] haksoat, can you give an example of an url which is "none http or https"? [22:21:12] Hello Helder [22:21:39] The regex currently takes bitcoin, geo, magnet, mailto, etc urls into consideration [22:22:03] 10Jade, 10User-DannyS712, 10ci-test-error: Tests failing for the master branch of Jade - https://phabricator.wikimedia.org/T251854 (10ACraze) @DannyS712 this might be related to a WIP patchset https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Jade/+/591503 where some tables were accidentally dropped duri... [22:22:31] haksoat, how about relative urls? E.g.: the ones in these search results: https://pt.wikipedia.org/w/index.php?sort=relevance&search=insource%3A%2F%5C%5B%5C%2F%5C%2Fen.wikipedia%2F&title=Especial:Pesquisar&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&ns1=1&ns2=1&ns3=1&ns4=1&ns5=1&ns6=1&ns7=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12=1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns104=1&ns105=1&ns446=1&ns447=1&ns710=1&ns711=1&ns828=1&ns829=1&ns2300 [22:22:32] =1&ns2301=1&ns2302=1&ns2303=1 [22:22:45] does that count? [22:22:58] E.g: [//en.wikipedia.org/wiki/Aftenposten Aftenposten] [22:23:32] I don;t think it considers relative urls [22:23:38] I'll take a look [22:23:43] Before that though... [22:23:48] It picked: [22:23:49] https://web.archive.org/web/20171209152236/https://www.repository.cam.ac.uk/handle/1810/245090|archivedate=9 [22:23:52] To be one url [22:24:10] Hence, I am trying to split at the 'https' point [22:24:51] Which is why I asked if we usually have other types show up a lot. [22:24:54] but that *is* one, isnt it? [22:25:04] (an archived one) [22:25:55] Hmmmm [22:26:05] more specifically, only this part [22:26:05] https://web.archive.org/web/*/https://www.repository.cam.ac.uk/handle/1810/245090 [22:26:25] re rest is part of some template parameter, I think [22:26:55] I get you now [22:27:04] I'll take another look [22:42:39] Helder: you are right [22:43:09] It's only the links from the web archive that look that way [22:45:46] Indeed [22:48:24] But the |archivedate=9 [22:48:31] isn't supposed to be caught right? [22:50:31] Helder: [23:08:40] Seen. It's from the template. [23:13:43] yep. That is a template parameter, I think [23:14:25] Yeah [23:15:10] Also the relative urls usually take the form [23:15:11] //en.wikipedia.org/wiki/Aftenposten Aftenposten [23:15:14] right? [23:15:30] talking about the //en...org part