[11:10:59] <wikibugs>	 10Scoring-platform-team, 10Outreach-Programs-Projects, 10Google-Summer-of-Code (2020), 10artificial-intelligence: Proposal (GSoC 2020): Implement articlequality and draftquality model for ptwiki and apply insights to models for bs, uk, hi wikis - https://phabricator.wikimedia.org/T247847 (10Chtnnh)
[13:41:53] <halfak>	 o/
[14:02:16] <haksoat>	 Hello halfak
[14:03:12] <haksoat>	 I intend pushing the changes made to wiki_split's regex to the deltas package
[14:03:16] <haksoat>	 Is this fine?
[14:18:33] <halfak>	 Yes.  Put a PR together and I'll take a look. 
[14:18:35] <halfak>	 haksoat, ^ 
[14:19:12] <halfak>	 Would be nice to get this included in a new version of revscoring and make a cross-cutting update.  We have some other new features that revscoring needs soon. 
[14:30:58] <halfak>	 o/ haksoat1!  Not sure if you saw it, but I said yes to submitting a PR.  I'll make time to review it today.  I have some thoughts about getting this deployed soon too. 
[14:59:55] <halfak>	 haksoat, connection troubles? 
[15:00:07] <halfak>	 I've been responding to you but I'm not sure if you saw it. 
[15:00:34] <halfak>	 FWIW, you can always see what you might have missed here: https://wm-bot.wmflabs.org/browser/index.php?display=%23wikimedia-ai
[15:00:38] <halfak>	 The channel is publicly logged. 
[17:17:39] <haksoat>	 kevin:
[17:17:56] <haksoat>	 Y:
[17:17:56] <haksoat>	 Localized number of endorsements in edit comments
[17:17:56] <haksoat>	     Used the wfMessage plural syntax to localize when it's one endorsement or many endorsements.
[17:18:04] <haksoat>	 T:
[17:18:04] <haksoat>	 Thanks to Andy for clearing the Jenkins build issue, I have pushed to gerrit all patchsets that were blocked.
[17:18:04] <haksoat>	 These include;
[17:18:04] <haksoat>	     - Localized Jade history page comment prefixes
[17:18:05] <haksoat>	     - Localized Jade history page comment facet and labeldata parts	
[17:18:05] <haksoat>	     - Made comment prefixes bold
[17:18:05] <haksoat>	     - Replaced user ID with user name in edit comments
[17:18:05] <haksoat>	     - Localized number of endorsements in edit comments
[17:18:16] <haksoat>	 Aaron:
[17:18:34] <haksoat>	 Y: Worked on updating the designs for 2ndary integrations on Special:Diff, undo, and rollback.  
[17:18:46] <haksoat>	 T: Presented diffs and RC filters designs to the design team.  Finishing  up 2ndary integrations for diff/undo/rollback.  Reaching out to people  re. mentoring newcomers for the hackathon this weekend. 
[17:18:52] <haksoat>	 Andy:
[17:19:01] <haksoat>	 Y: Got the Jade db hooks patchset merged (thanks Kevin!), things seem to  be working on Beta and Jenkins seems to be back to normal. Also did a  bit of work on improving ORES sphinx docs
[17:19:08] <haksoat>	 T: Gonna finish up ORES sphinx docs and add them to CI. Also will do  some code review for Kevin and also take a look at the ORES CapEx  estimates.
[17:19:13] <haksoat>	 haksoat:
[17:19:24] <haksoat>	 Y:
[17:19:24] <haksoat>	 I tested the new regex in elasticsearch. Gave  better performance than previous regex, but overall elasticsearch  performance not still good enough.
[17:19:32] <haksoat>	 T:Working to raise a PR for the new regex in the deltas  package. Checking the tests now and some tests fail because word tokens  appear before cjk and japan_puncts in the regex order. To fix this, I  have to bring cjk and japan_punct tokens to appear before word tokens.  But this is dropping the performance again. I am trying to see what  generic pattern is in the word regex that causes it to match cjk and  japan_punct stuff.
[17:45:19] <halfak>	 Thanks haksoat!
[17:53:12] <haksoat>	 halfak: I just opened a PR on the deltas package
[17:59:53] <halfak>	 Checking it out now. 
[18:00:00] <halfak>	 In the 2 minutes I have between meetings ^_^
[18:06:38] <haksoat>	 Ok
[18:16:31] <halfak>	 haksoat, left a note re testing.  Otherwise, it looks great. 
[18:16:47] <halfak>	 Is there somewhere I can read about the speed up you were able to achieve? 
[18:23:27] <haksoat>	 Here, halfak https://gist.github.com/HAKSOAT/c142a1f41d8f6ada8b3ff6d9c400503d
[18:23:47] <haksoat>	 Currently writing up something on phabricator though
[18:53:57] <wikibugs>	 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Extracted labels might not be accurate when there are multiple reverts - https://phabricator.wikimedia.org/T252152 (10He7d3r)
[18:58:27] <wikibugs>	 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Extracted labels might not be accurate when there are multiple reverts - https://phabricator.wikimedia.org/T252152 (10He7d3r) See https://github.com/wikimedia/articlequality/pull/127 for a possible solution.
[19:45:11] <wikibugs>	 10Scoring-platform-team (Current), 10Discovery-Search, 10Elasticsearch, 10revscoring, 10artificial-intelligence: Improve the performance and quality of tokenization in revscoring - https://phabricator.wikimedia.org/T248480 (10HAKSOAT) I opened a [[ https://github.com/halfak/deltas/pull/11 | pull request...
[19:45:29] <haksoat>	 halfak: https://phabricator.wikimedia.org/T248480
[19:45:36] <haksoat>	 I made a comment on Phabricator
[19:45:45] <haksoat>	 Working on the tests now
[20:05:33] <halfak>	 Cool.  Thanks!
[20:22:57] <haksoat>	 halfak: I have modified the tests
[20:23:19] <haksoat>	 Do I have to show the previous regex mismatching tokens?
[20:23:41] <haksoat>	 That will require having two lexicons
[20:35:03] <halfak>	 haksoat, no you don't.  I think it's fine to just have demonstration that we match with the new ones. 
[20:35:16] <halfak>	 Sorry for the late reply.  I'm overloaded on pings right now :) 
[20:38:50] <halfak>	 Merged!  Nice work. 
[20:41:02] <haksoat>	 Ooops
[20:41:06] <haksoat>	 Well done
[20:41:14] <haksoat>	 :)
[20:51:01] <halfak>	 I'm working to get CI set up on there and then, I'll cut a new version for deltas. 
[20:51:45] <haksoat>	 Great
[21:01:29] <halfak>	 In the meantime, I'm curious to see a demo of the overall performance improvement :) 
[21:09:41] <haksoat>	 Hehe
[21:09:52] <haksoat>	 When chanced can you go through the comment on Phabricator?
[21:10:47] <haksoat>	 I'm thinking we could remove the cjk regex and have a different tokenizer for cjk content. Not sure how much of a good decision that will be to get more performance.
[21:11:38] <halfak>	 Hmm.  We could potentially drop it.  
[21:11:52] <halfak>	 We'd have to re-think part of revscoring to make it work. 
[21:12:36] <haksoat>	 Okay. 
[21:13:44] <halfak>	 As an alternative, we could add CJK to "word".  What do you think of that? 
[21:14:05] <halfak>	 At the very least, we should probably include Japanese and Korean in word anyway. 
[21:14:16] <halfak>	 It's Chinese that needs a more clever tokenization strategy. 
[21:17:25] <haksoat>	 That could work. I don't know fully how chinese characters work though.
[21:17:59] <haksoat>	 I could probably look into to that sometime later, to see how tokenization can be done on them.
[21:18:31] <halfak>	 For right now, I think we should merge your changes. 
[21:19:00] <halfak>	 And get our models rebuilt.  This will give us a nice performance/quality improvement.  But I think looking into this as a next step is awesome. 
[21:19:41] <halfak>	 BTW, what languages are you familiar with?  I wonder if we could take advantage of your not being a stupid American like me ^_^ 
[21:20:47] <halfak>	 Oof.  Deltas is going to need some work to get the flake8 tests passing. 
[21:20:57] <halfak>	 I just added CI to it.  
[21:21:04] <haksoat>	 Haha... Not a lot. Just English, Yoruba. I can read and write Arabic, but I'm not fully familiar with the meaning of many words.
[21:21:11] <haksoat>	 Okay
[21:21:17] <haksoat>	 How can I help?
[21:21:39] <halfak>	 If you would be interested, I'd love it if you would take a pass over deltas and clean up the flake8 issues. 
[21:22:07] <halfak>	 If you install flake8 and run this command in the base dir:
[21:22:08] <halfak>	 flake8 . --max-line-length=85 --exclude=.svn,CVS,.bzr,.hg,.git,__pycache__,.tox,.eggs,*.egg,docs
[21:22:31] <halfak>	 It'll show you all of the issues.  Looks like I mostly put together this library before I was familiar with flake8 :| 
[21:23:03] <halfak>	 I'm sorry to suggest you should clean up my mess.  "No way man" is a totally reasonable response. 
[21:23:19] <halfak>	 But re languages, I wonder if we have a Yoruba Wikipedia community. 
[21:24:14] <halfak>	 Arabic is very valuable too.  I struggle to read the regular expressions we use for arabic.  Maybe you could help vet them.  I don't think they have gotten much love. 
[21:24:34] <halfak>	 I don't want to throw a bunch of stuff on your plate. 
[21:24:56] <halfak>	 So take this as "thinking out loud" and not "Haksoat should do all these things". So let's talk more about that later :) 
[21:25:38] <haksoat>	 Hehe
[21:25:49] <haksoat>	 I could take a look at the deltas package tomorrow
[21:26:21] <haksoat>	 Help with some of the issues
[21:26:36] <haksoat>	 Yeah. I'm interested in the arabic part.
[21:26:45] <halfak>	 Cool :)  Will keep that in mind 
[21:27:09] <haksoat>	 Going off here now... hehe. Talk to you tomorrow.
[21:27:13] <haksoat>	 :)
[21:28:38] <halfak>	 have a good night!
[21:30:06] <halfak>	 I'm out too.  
[21:30:10] <halfak>	 Take care all. 
[22:06:37] <wikibugs>	 10ORES, 10Scoring-platform-team (Current), 10Documentation: Automate Sphinx docs for ORES repo - https://phabricator.wikimedia.org/T252173 (10ACraze)