[00:28:42] 10Scoring-platform-team, 10VisualEditor, 10edittypes-modeling, 10Editing-team (Q3 2019-2020 Kanban Board), and 2 others: Change from source code editing to visual editing: minor edit property not kept - https://phabricator.wikimedia.org/T250388 (10ppelberg) 05Open→03Resolved [04:43:39] 10Scoring-platform-team (Current), 10editquality-modeling, 10Hindi-Sites, 10artificial-intelligence: Create editquality labeling campaign for Hindi Wikipedia - https://phabricator.wikimedia.org/T252594 (10CptViraj) [06:08:39] 10Scoring-platform-team, 10Analytics: [Discuss] ORES model development and deployment processes - https://phabricator.wikimedia.org/T216246 (10Aklapper) 05Stalled→03Open The previous comments don't explain who or what (task?) exactly this task is stalled on (["If a report is waiting for further input (e.g.... [09:11:56] 10Scoring-platform-team, 10editquality-modeling, 10Spanish-Sites, 10artificial-intelligence: Missing observations from eswikiquote - https://phabricator.wikimedia.org/T254785 (10MarcoAurelio) Noting that [[ https://es.wikiquote.org/w/index.php?title=Usuario_discusi%C3%B3n:MarcoAurelio&oldid=406713#Re:Limpi... [12:12:00] o/ [14:29:28] 10Scoring-platform-team, 10VisualEditor, 10edittypes-modeling, 10Editing-team (Q3 2019-2020 Kanban Board), and 2 others: Change from source code editing to visual editing: minor edit property not kept - https://phabricator.wikimedia.org/T250388 (10matmarex) Thanks @Ryasmeen! [15:22:55] 10Scoring-platform-team, 10Wikilabels, 10articlequality-modeling, 10artificial-intelligence: Build article quality model for Dutch Wikipedia - https://phabricator.wikimedia.org/T223782 (10Halfak) Thanks for the notes! I've added this to our sync meeting today. [15:24:49] Hey haksoat1! How far did you get with looking through https://github.com/wikimedia/editquality/pull/223 ? [16:57:46] Confirmed. English Wikipedia still does not have VE turned on for new editors. [18:12:54] VE? [18:13:33] ^ visual editor [18:13:40] ahhh got it [18:13:42] thanks [19:55:10] 10ORES, 10Scoring-platform-team (Current), 10revscoring, 10artificial-intelligence: Rebuild all models with revscoring-2.8.2 - https://phabricator.wikimedia.org/T254505 (10Halfak) https://github.com/wikimedia/drafttopic/pull/49 [19:55:28] o/ haksoat [19:55:39] How far did you get with https://github.com/wikimedia/editquality/pull/223 ? [19:56:06] That's the last model repo to merge with the new revscoring. And then we can merge your ukwiki work. [21:22:30] Missed your message halfak [21:23:50] I was able to go through [21:24:12] That was yesterday though. No questions. [21:26:48] Regarding tokenization. Help take a look at: https://github.com/halfak/deltas/blob/4a939fdf6d58451c8597bd9669ad77a6b2cc2438/deltas/tokenizers/tests/test_wikitext_split.py#L13 [21:27:02] Are the Chinese characters there from a text somewhere? [21:27:22] I ask because [21:27:25] It's likely I pasted them in from zhwiki [21:28:24] 克·科伊尔 [21:28:24] When I run the regex I'm currently writing on the above, it matches the character · as a latin character [21:28:40] Oh [21:30:14] Found something interesting [21:36:48] shouldn't that be a neutral character? [21:38:19] What's a "neutral character"? [21:38:31] hmm [21:38:36] maybe that's not the unicode term [21:38:47] I know there are some that depend on the context [21:39:30] It falls into the latin character range though [21:40:13] U+00B7 MIDDLE DOT [21:40:15] unicode character range for latin [21:40:17] Yeah [21:40:20] Puntuaction, other [21:40:52] with alias Georgian comma, and Greek middle dot :P [21:41:00] I've got to head out soon. [21:41:12] haksoat, anything you need before I go? [21:41:30] what were you expecting to do? [21:41:34] Nah. I'm fine at the moment. [21:41:41] OK have a good one, folks! [21:41:42] o/ [21:41:47] if you were breaking on punctuaction, i think it should break on that one [21:42:36] Platonides: If you take a look at : https://github.com/halfak/deltas/blob/master/deltas/tokenizers/wikitext_split.py [21:42:55] You'll notice that cjk is above word in the LEXICON [21:43:15] I am trying to switch positions so I can move cjk down the list [21:43:34] But some characters are being matched by word as a result [21:43:51] One of which is that dot from 克·科伊尔 [21:43:53] hmm [21:44:02] Originally, that dot falls into etc [21:44:03] why isn't it matched as an etc otherwise? [21:44:52] I'm not so sure though. I think I picked a unicode range too wide for my word characters [21:45:27] My concern is I could drop that single character, but there will certainly be others. [21:45:37] For context, the reason I am trying to switch their positions [21:45:59] Is that since we run the tokenizer on non-cjk text more [21:46:50] The regex keeps running the cjk part of the regex through every text and not matching them. [21:47:02] it should be quite fast, though [21:47:08] that's just some ranges [21:47:22] Yes, it actually is faster than before at the moment [21:47:41] It's quite fine in its current state, just looking at the possibility of getting more [21:48:35] the definition of word is strange [21:48:59] have you considered using unicode properties in the regex? [21:49:13] they are likely to slow you down, though [21:50:52] I don't get what you mean by unicode properties [21:51:18] You mean ranges like say [\u0080-\u00FF] [21:51:21] ? [21:58:38] no [22:00:21] https://www.regular-expressions.info/unicode.html#category [22:06:05] Hmmm [22:06:12] I'll take a look at those [22:06:20] I actually once tried unicode scripts [22:06:27] But Python doesn't support those [22:06:48] hmm, I guess that could be a problem [22:06:51] This library does though [22:06:52] \u0080-\u00FF [22:07:02] Hehe wrong paste [22:07:05] https://bitbucket.org/mrabarnett/mrab-regex/src/hg/ [22:07:15] Sadly, it drops the speed in half [22:08:21] Which is why we are currently specifying the ranges using [] [22:09:12] https://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties suggests the regex module [22:10:32] Yeah. It's the same module as the one I shared in the link [22:12:57] oh, I didn't know that regex package was the same as tha mrab-regex [22:14:52] maybe parse unicode database, then automatically extract the rnages from it? [22:15:11] Okay. I've been stuck on this for a while now. Dropped before and just picking it back again. I may end up leaving is as it is, considering I already got around a ~40% speed increase. [22:15:15] you could end up with something like https://difnet.com.br/opensource/unicode_hack.py [22:15:32] Checking... [22:16:22] the point is writing something sensible [22:16:38] even if then in the background, it ends up in a horrible regex string :P [22:16:58] :-D [22:18:16] Aha! This looks great. [22:22:53] Thanks for sharing Platonides. I'll take a better look tomorrow and see what I can make out of it. [22:25:52] this one seems to have extracted the properties manually [22:26:05] this doesn't seem the most future-proof path [22:26:54] Yeah. May lead to a better option though. [22:29:30] *nod* [22:29:48] got that link from the above SO thread [22:30:16] for the sake of stating one's sources :) [22:31:49] Thanks [22:32:06] :) [23:45:28] (03PS1) 10DannyS712: Remove use of the Revision object returned in WikiPage::doEditContent [extensions/ORES] - 10https://gerrit.wikimedia.org/r/604187 (https://phabricator.wikimedia.org/T254952) [23:47:36] (03PS2) 10DannyS712: Remove use of the Revision object returned in WikiPage::doEditContent [extensions/ORES] - 10https://gerrit.wikimedia.org/r/604187 (https://phabricator.wikimedia.org/T254952)