[11:10:59] 10Scoring-platform-team, 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10artificial-intelligence: Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis - https://phabricator.wikimedia.org/T247847 (10Chtnnh) [13:41:53] o/ [14:02:16] Hello halfak [14:03:12] I intend pushing the changes made to wiki_split's regex to the deltas package [14:03:16] Is this fine? [14:18:33] Yes. Put a PR together and I'll take a look. [14:18:35] haksoat, ^ [14:19:12] Would be nice to get this included in a new version of revscoring and make a cross-cutting update. We have some other new features that revscoring needs soon. [14:30:58] o/ haksoat1! Not sure if you saw it, but I said yes to submitting a PR. I'll make time to review it today. I have some thoughts about getting this deployed soon too. [14:59:55] haksoat, connection troubles? [15:00:07] I've been responding to you but I'm not sure if you saw it. [15:00:34] FWIW, you can always see what you might have missed here: https://wm-bot.wmflabs.org/browser/index.php?display=%23wikimedia-ai [15:00:38] The channel is publicly logged. [17:17:39] kevin: [17:17:56] Y: [17:17:56] Localized number of endorsements in edit comments [17:17:56]     Used the wfMessage plural syntax to localize when it's one endorsement or many endorsements. [17:18:04] T: [17:18:04] Thanks to Andy for clearing the Jenkins build issue, I have pushed to gerrit all patchsets that were blocked. [17:18:04] These include; [17:18:04]     - Localized Jade history page comment prefixes [17:18:05]     - Localized Jade history page comment facet and labeldata parts [17:18:05]     - Made comment prefixes bold [17:18:05]     - Replaced user ID with user name in edit comments [17:18:05]     - Localized number of endorsements in edit comments [17:18:16] Aaron: [17:18:34] Y: Worked on updating the designs for 2ndary integrations on Special:Diff, undo, and rollback.   [17:18:46] T: Presented diffs and RC filters designs to the design team.  Finishing up 2ndary integrations for diff/undo/rollback.  Reaching out to people re. mentoring newcomers for the hackathon this weekend.  [17:18:52] Andy: [17:19:01] Y: Got the Jade db hooks patchset merged (thanks Kevin!), things seem to be working on Beta and Jenkins seems to be back to normal. Also did a bit of work on improving ORES sphinx docs [17:19:08] T: Gonna finish up ORES sphinx docs and add them to CI. Also will do some code review for Kevin and also take a look at the ORES CapEx estimates. [17:19:13] haksoat: [17:19:24] Y: [17:19:24] I tested the new regex in elasticsearch. Gave better performance than previous regex, but overall elasticsearch performance not still good enough. [17:19:32] T:Working to raise a PR for the new regex in the deltas package. Checking the tests now and some tests fail because word tokens appear before cjk and japan_puncts in the regex order. To fix this, I have to bring cjk and japan_punct tokens to appear before word tokens. But this is dropping the performance again. I am trying to see what generic pattern is in the word regex that causes it to match cjk and japan_punct stuff. [17:45:19] Thanks haksoat! [17:53:12] halfak: I just opened a PR on the deltas package [17:59:53] Checking it out now. [18:00:00] In the 2 minutes I have between meetings ^_^ [18:06:38] Ok [18:16:31] haksoat, left a note re testing. Otherwise, it looks great. [18:16:47] Is there somewhere I can read about the speed up you were able to achieve? [18:23:27] Here, halfak https://gist.github.com/HAKSOAT/c142a1f41d8f6ada8b3ff6d9c400503d [18:23:47] Currently writing up something on phabricator though [18:53:57] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Extracted labels might not be accurate when there are multiple reverts - https://phabricator.wikimedia.org/T252152 (10He7d3r) [18:58:27] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Extracted labels might not be accurate when there are multiple reverts - https://phabricator.wikimedia.org/T252152 (10He7d3r) See https://github.com/wikimedia/articlequality/pull/127 for a possible solution. [19:45:11] 10Scoring-platform-team (Current), 10Discovery-Search, 10Elasticsearch, 10revscoring, 10artificial-intelligence: Improve the performance and quality of tokenization in revscoring - https://phabricator.wikimedia.org/T248480 (10HAKSOAT) I opened a [[ https://github.com/halfak/deltas/pull/11 | pull request... [19:45:29] halfak: https://phabricator.wikimedia.org/T248480 [19:45:36] I made a comment on Phabricator [19:45:45] Working on the tests now [20:05:33] Cool. Thanks! [20:22:57] halfak: I have modified the tests [20:23:19] Do I have to show the previous regex mismatching tokens? [20:23:41] That will require having two lexicons [20:35:03] haksoat, no you don't. I think it's fine to just have demonstration that we match with the new ones. [20:35:16] Sorry for the late reply. I'm overloaded on pings right now :) [20:38:50] Merged! Nice work. [20:41:02] Ooops [20:41:06] Well done [20:41:14] :) [20:51:01] I'm working to get CI set up on there and then, I'll cut a new version for deltas. [20:51:45] Great [21:01:29] In the meantime, I'm curious to see a demo of the overall performance improvement :) [21:09:41] Hehe [21:09:52] When chanced can you go through the comment on Phabricator? [21:10:47] I'm thinking we could remove the cjk regex and have a different tokenizer for cjk content. Not sure how much of a good decision that will be to get more performance. [21:11:38] Hmm. We could potentially drop it. [21:11:52] We'd have to re-think part of revscoring to make it work. [21:12:36] Okay. [21:13:44] As an alternative, we could add CJK to "word". What do you think of that? [21:14:05] At the very least, we should probably include Japanese and Korean in word anyway. [21:14:16] It's Chinese that needs a more clever tokenization strategy. [21:17:25] That could work. I don't know fully how chinese characters work though. [21:17:59] I could probably look into to that sometime later, to see how tokenization can be done on them. [21:18:31] For right now, I think we should merge your changes. [21:19:00] And get our models rebuilt. This will give us a nice performance/quality improvement. But I think looking into this as a next step is awesome. [21:19:41] BTW, what languages are you familiar with? I wonder if we could take advantage of your not being a stupid American like me ^_^ [21:20:47] Oof. Deltas is going to need some work to get the flake8 tests passing. [21:20:57] I just added CI to it. [21:21:04] Haha... Not a lot. Just English, Yoruba. I can read and write Arabic, but I'm not fully familiar with the meaning of many words. [21:21:11] Okay [21:21:17] How can I help? [21:21:39] If you would be interested, I'd love it if you would take a pass over deltas and clean up the flake8 issues. [21:22:07] If you install flake8 and run this command in the base dir: [21:22:08] flake8 . --max-line-length=85 --exclude=.svn,CVS,.bzr,.hg,.git,__pycache__,.tox,.eggs,*.egg,docs [21:22:31] It'll show you all of the issues. Looks like I mostly put together this library before I was familiar with flake8 :| [21:23:03] I'm sorry to suggest you should clean up my mess. "No way man" is a totally reasonable response. [21:23:19] But re languages, I wonder if we have a Yoruba Wikipedia community. [21:24:14] Arabic is very valuable too. I struggle to read the regular expressions we use for arabic. Maybe you could help vet them. I don't think they have gotten much love. [21:24:34] I don't want to throw a bunch of stuff on your plate. [21:24:56] So take this as "thinking out loud" and not "Haksoat should do all these things". So let's talk more about that later :) [21:25:38] Hehe [21:25:49] I could take a look at the deltas package tomorrow [21:26:21] Help with some of the issues [21:26:36] Yeah. I'm interested in the arabic part. [21:26:45] Cool :) Will keep that in mind [21:27:09] Going off here now... hehe. Talk to you tomorrow. [21:27:13] :) [21:28:38] have a good night! [21:30:06] I'm out too. [21:30:10] Take care all. [22:06:37] 10ORES, 10Scoring-platform-team (Current), 10Documentation: Automate Sphinx docs for ORES repo - https://phabricator.wikimedia.org/T252173 (10ACraze)