[07:42:46] 10Jade, 10Scoring-platform-team (Current): [Spike] What facilities are available to us when rendering edit comments? - https://phabricator.wikimedia.org/T250723 (10kevinbazira) Also, [[ https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/7e950abc80ee9e3e528adc7654d4422b2c391984/repo/Wikibase.php#L1... [12:10:52] 10ORES, 10Scoring-platform-team, 10artificial-intelligence: Review model performance for ptwiki 'articlequality' and 'draftquality' - https://phabricator.wikimedia.org/T250809 (10He7d3r) Could the number of labels per article have a negative impact on the quality of the model? These are the frequency of the... [15:01:23] 10Scoring-platform-team, 10articlequality-modeling, 10artificial-intelligence: Text fetched by articlequality's `fetch_text` might not match the talk page label (for moved pages) - https://phabricator.wikimedia.org/T251608 (10He7d3r) [15:01:59] halfak_, here is a subtle one ^ [17:04:09] hello Helder [17:04:23] olá! [17:04:35] was hoping to get some insights from you on the w2w feature we've been working on [17:05:00] https://gist.github.com/chtnnh/caab4c8e1def5d65a002503e7b6751af [17:06:38] it seems like the words to watch have a much higher frequency in highly rated articles as compared to low rated articles [17:07:20] this is probably why we did not achieve any significant improvement in model fitness when we added w2w to the model [17:07:23] what do you think? [17:12:00] this is for the articlequality model, right? [17:13:00] yes thats right [17:13:55] maybe articles containing such words get deleted before someone assesses its quality on the talk page? (e.g. because they don't think it is worth it, as the article will be deleted anyway?) [17:14:52] hmm that seems like a possible explanation, although i may not find it 100% convincing [17:14:56] Or maybe the articles are so short that "words to watch" do not fit there [17:15:43] could you print also the number of words (of any kind) next to the samples? [17:15:46] what can we do to help the model better relate quality and w2w [17:16:22] number of words in the article or number of w2w found? [17:17:30] I see you used [17:17:31] words_to_watch_count, [17:17:31] words_to_watch_count / max(wikitext.revision.words, 1) [17:17:42] Is that the number of unique words? [17:17:53] Or does it count the repetitions? [17:18:20] it counts the repetitions as far as ik [17:18:29] for example, in your gist, there are many 'grande's [17:18:51] maybe using the count of unique words changes something? [17:19:19] or maybe some of these "words to watch" are too common, and they should be removed from the list? [17:19:55] yes i think the second suggestion would make sense, as the community doesnt seem to penalize articles containing these words [17:20:10] how exactly was this list compiled helder? [17:21:14] I used the list from the Portuguese version of Wikipedia:Manual of style/Words to watch, [17:21:14] https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Palavras_a_evitar [17:23:51] this is odd [17:24:29] if the manual contains these words, then it must be what the community finds undesirable in articles and yet many high rated articles have a fair amount of these words [17:35:08] chtnnh, when I run your gist locally, I found e.g. this article: https://pt.wikipedia.org/w/index.php?title=Lista_de_vencedores_de_corridas_da_F%C3%B3rmula_1&action=edit [17:35:36] it has 198 ocurrences of "grande" because it is part of the title of articles about races [17:36:10] Similar to "Grand" in the title of https://en.wikipedia.org/wiki/2007_Canadian_Grand_Prix [17:37:03] On the other hand, if I only search in the HTML output (instead of the source wiki code), the word appears "only" 19 times [17:37:53] we are only looking at the wikitext and not the wikicode [17:40:11] On what I said above, consider "source wiki code" = "wikitext" [17:47:07] oh sorry i am still learning about the wiki procedures [17:48:01] even if we compare 19 to 0/1 occurrences in the lower rated articles, it doesnt add up [17:48:17] i am definitely confused as to why this happened [18:06:32] chtnnh, maybe this feature would allow us to count words_to_watch only in what I called HTML output: https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/parsed.py#L32-L37 [18:06:47] o/ [18:06:51] Reading scrollback. [18:09:17] I don't think it's practical to look at html output. It would slow us down to pull a second kind of text blob for the page. [18:09:42] I do agree that "Rio Grande" should be OK but other uses may be problematic. [18:10:43] * Helder checks if "grande" is a considered a stopword in nltk [18:12:38] It's quite possible that words get added to the Manual of style without people considering that they have many legitimate uses. [18:12:39] * Helder concludes that it isn't, and none of the regexes added to WORDS_TO_WATCH match stopwords. Good [18:13:08] Ultimately these are words to "watch" -- not "inherently problematic words" [18:13:08] halfak, yeah, "grande" is possibly one of those [18:13:25] I wonder if we might consider trimming the list for the words we find to be very common in high quality articles. [18:13:26] which have many legitimate uses [18:13:50] chtnnh, could adapt that script to get a count of common words in level 6 articles. [18:14:12] I bet this a problem for English words_to_watch too as we don't get a big fitness boost from it either. [18:14:32] that would make sense [18:14:51] but why is the english words_to_watch included if the fitness boost is insignificant [18:15:07] halfak common words here refers to? [18:15:39] common == commonly occuring. [18:16:07] E.g. just grab a count of how often a word_to_watch appears in level 6 articles and print out the top 20 or something like that. [18:16:17] lemme try that [18:16:23] halfak: Some great news here [18:16:29] \o/ [18:16:35] What's up haksoat? [18:16:39] First improvement [18:16:45] We can tokenize 5.34710626315318 Alan Turing's per second with wsplit [18:16:47] We can tokenize 6.665738235089714 Alan Turing's per second with wsplit_lex [18:17:04] The wsplit_lex is the modified one [18:17:04] Oooh. That's twice as fast. What was the change? [18:17:21] Changing the order of some of the tokens [18:17:38] Without losing matches [18:17:40] Oh interesting. [18:17:44] Which tokens? [18:18:03] I noticed whitespaces were the most common tokens [18:18:16] and they were somewhere at the middle of the whole expression [18:18:37] Hence the engine will try a lot of values before reaching them at every character [18:18:42] So I moved them to the top [18:19:00] wow thats ingenious xD [18:19:22] haksoat, where are these regexes? [18:19:29] Ha! That's crazy! Nice. [18:19:29] * Helder is curious [18:19:59] Helder: https://github.com/halfak/deltas/blob/master/deltas/tokenizers/wikitext_split.py [18:20:16] The LEXICON variable [18:20:25] We are doing a pipe of each value [18:21:05] So I moved break and whitespace to the top.... I had to move break because I fear a whitespace may be caught before a break, causing us to lose the break. [18:22:13] haksoat, is r'[\d]+'equivalent to r'\d+'? [18:22:27] at https://github.com/halfak/deltas/blob/master/deltas/tokenizers/wikitext_split.py#L70 [18:22:36] Helder, yeah. Should be. [18:22:39] yes [18:24:10] haksoat, how about changing [18:24:11] r'(\n|\n\r|\r\n)\s*(\n|\n\r|\r\n)+') [18:24:12] to [18:24:15] r'(\n\r?|\r\n)\s*(\n\r?|\r\n)+') [18:24:17] ? [18:24:42] That is, factor out the "\n" [18:25:50] Also, do you need all the capturing groups? [18:26:12] if not, maybe [18:26:12] r'(?:\n\r?|\r\n)\s*(?:\n\r?|\r\n)+') [18:26:13] ? [18:27:27] Yes [18:27:34] I think for some [18:27:39] I could check those again [18:27:48] Shouldn't the dot be escaped at [18:27:49] ("etc", r".") [18:27:50] ? [18:28:12] or is it about any character [18:28:14] We are using that to capture characters that preceeding matches couldn't get [18:28:21] ah, ok [18:28:21] Like an etcetera [18:28:42] I was confused because sometimes we use "etc..." in the text [18:28:57] Okay [18:31:35] btw, haksoat, in case it is useful, take a look at this tool: https://regexper.com/#%28%5Cn%5Cr%3F%7C%5Cr%5Cn%29%5Cs*%28%5Cn%5Cr%3F%7C%5Cr%5Cn%29%2B [18:31:58] (if you don't know it already) [18:33:14] oooh [18:33:16] this is cool [18:33:33] Nice [18:33:44] it is like having a map for complex regexes :) [18:33:53] I saw something similar, was actually looking for something as powerful as this. [18:34:31] https://www.debuggex.com/ [18:34:44] Thanks for sharing. Will be very useful. [18:35:43] nice! [18:36:24] wow [18:54:39] posting our async update notes -- [18:54:57] kevinbazira- [18:54:59] Y: [18:55:01] Worked on localizing the second part of the edit comments on the history page. [18:55:03] - I ended up parsing comments in their HTML format as provided by the PageHistoryLineEnding MW hook. [18:55:04] 04Error: Command “i” not recognized. Please review and correct what you’ve written. [18:55:05] - I'll probably demo this in one of the sync meetings but the basic workflow was: parse the DOM, traverse it, pick the comment node, update old parts of the comment with new localized ones. [18:55:07] T: [18:55:09] Updated spike task with what facilities are available to us when rendering edit comments. ( benchmarked on Wikidata ) [18:55:11] Continued work on rendering localized edit comments on the history page. I made the assumptions below about the part of an edit comment. @Aaron and @Andy please confirm whether this is true. [18:55:13] {"damaging":false,"goodfaith":true} === Productive / Good-faith [18:55:15] {"damaging":true,"goodfaith":true} === Damaging / Good-faith [18:55:17] {"damaging":true,"goodfaith":false} === Damaging / Bad-faith [18:55:19] {"damaging":false,"goodfaith":false}**Can't be :) [18:55:21] halfak- [18:55:23] Y: I explored the effectiveness of words_to_watch did some iterations of ptwiki related modeling with Helder and chtnnh. Otherwise, it was all meetings yesterday. Notably, one of the meetings was related to the ORES paper. I have some new todos from that related to scouring the literature for anyone who has written about decoupling products from AIs (or other types of platform infrastructure). So [18:55:25] far the only thing people really talk about is standardizing interfaces -- not enabling varied product directions. [18:55:27] T: Met with the researchers on our SWE interview panel to answer their questions. I'll bring some discussion of this to our next staff. Continuing work on ptwiki. I am hoping to get the design work for RC filters wrapped up. Maybe I'll get some more paper reading in for the ORES stuff. [18:55:31] and me- [18:55:33] Y: Continued working on unraveling our table names inside all the link table helper classes and tests to support the ad-hoc approach. Also need to figure out how to merge the db patchset without breaking beta (most likely will just leave the code for old tables and then manually delete later). [18:55:35] T: More of the same, I demo'd some of the Jade 2ndary integration during sync today, will continue cleaning up the WIP patchset for re-enabling the db hooks in hopes to deploy to beta next week. [18:55:45] \o/ doing the design work for rc filters right now ^_^ [19:12:54] haksoat: [19:12:56] T: Spoke to the team about progress on the tokenization project during sync. Tested out the speed of different tokenization methods. Got some speed improvements on tokenization from tweaking existing regex. [19:25:27] halfak: hi! I have a small query - wikitext.revision.datasources.operations depends on paragraphs_sentences_and_whitespace instead of directly depending on revision.text. What extra preprocessing does paragraphs_Sentences_and_whitepsace introduce? [19:25:28] https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikitext/datasources/edit.py#L20 [19:25:38] feel free to take a look when you're free :) [19:27:30] I want to get the section names under which given diff segments are contained...it seems like a non-trivial task :/ [19:28:40] The operations should give you positions. If you have positions for the headers, you know the section. [19:29:01] https://github.com/halfak/deltas/blob/master/deltas/segmenters/paragraphs_sentences_and_whitespace.py [19:29:56] codezee, ^ [19:30:39] thanks! looking... [19:30:41] I have a question about this: https://github.com/wikimedia/articlequality/blob/master/Makefile#L553-L558 [19:30:51] If I make some changes to e.g. "extract_from_text", and then I run [19:30:56] $ make models/ptwiki.wp10.gradient_boosting.model [19:31:07] will it parse the dumps again? Or will it use the datasets/ptwiki.labeled_revisions.with_text.9k_2020.json from the last run? [19:31:16] (given that the changes do not affect the beginning of the pipeline) [19:31:47] Should re-use the text dataset [19:32:08] you can use "make -n" to check [19:32:16] (PS: the third attempt at running it finished successfully) [19:32:22] -n means "output the commands but don't execute them" [19:32:22] 04Error: Command “n” not recognized. Please review and correct what you’ve written. [19:32:38] Helder, that's the weirdest thing I've heard. [19:33:23] I mean, that dump processing which was not finishing yesterday [19:33:44] I stopped it again, and run it again tonight, and it worked fine [19:34:28] halfak, thanks for the tip. I'll try that [19:39:52] halfak: sorry, did't quite understand about the part on getting sections, for example here - https://en.wikipedia.org/wiki/?diff=709934164&diffmode=source I want to get that the first {{cite insertion belongs to the cast section, the only info operations is giving me is - Insert(name='insert', a1=5105, a2=5105, b1=5105, b2=5113) [19:40:14] regarding that insertion [19:41:09] i'm guessing i'll have to separately map sections to their index in the text right? then match these 'b' indexes to the closest preceding index in the section mapping? [19:41:29] *map section-names [20:04:40] codezee, yes that is right. [20:04:47] I think that might be the best strategy. [20:05:13] I would take the tokens and make a really simple parser that could extract sections headers. They are the easiest thing to match so it shouldn't be too hard. [20:05:31] If you see a sequence of equals and it is at the beginning of a line, you have a header. [20:05:48] Then let the differ do it's work on the tokens. [20:37:42] 10Jade, 10Scoring-platform-team: Design New Filters controls for Jade - https://phabricator.wikimedia.org/T229976 (10Halfak) {F31790284} I like putting these labels near the predicted labels, but I've also designed an "advanced filters" button that could be used to hide these filters deeper in the menu. [20:38:00] 10Jade, 10Scoring-platform-team (Current): Design New Filters controls for Jade - https://phabricator.wikimedia.org/T229976 (10Halfak) [20:39:47] 10Jade, 10Scoring-platform-team (Current): Design New Filters controls for Jade - https://phabricator.wikimedia.org/T229976 (10Halfak) Mostly I carefully reviewed the structures in place in the RCFilters menu and applied them for Jade. I made modifications to the i18n messages we use to describe the labels so... [20:40:04] 10Jade, 10Scoring-platform-team (Current), 10Design: Design New Filters controls for Jade - https://phabricator.wikimedia.org/T229976 (10Halfak) [20:40:10] I'll bring these to the design review meeting next week. [20:40:30] halfak, how did you get these feature importances? https://phabricator.wikimedia.org/T251171#6087216 [20:40:47] Assuming I have a trained model at hand [20:40:57] https://gist.github.com/halfak/53203c62f54dd9b83a4f2abc293b8534 [20:41:27] thanks [20:45:41] No problem :) [20:48:19] 10Scoring-platform-team, 10artificial-intelligence: Add `words_to_watch` to articlequality and draftquality models in ptwiki - https://phabricator.wikimedia.org/T251171 (10Halfak) Indeed. When I merged that PR, it had minor positive effects on quality. [21:20:30] halfak: How often do we come across cjk in the articles we tokenize? [21:22:11] I'm assuming not a lot of times [21:22:22] But not sure if this assumption is valid [21:49:12] Good question. We do have models for korean and Japanese. [21:49:32] We could use a different tokenizer for them but we'll always need to handle those chars. [21:52:29] halfak, any idea why you got 87 features at https://phabricator.wikimedia.org/T251171#6087216 but I only get 34 when I try chtnnh's model in my machine? [21:53:33] In particular, I don't see e.g. [21:58:26] Okay halfak [21:58:59] Yeah. It's better if we use a different tokenizer for them I think. [22:07:55] Helder, I think this is for the draftquality model. [22:09:25] hm, that could be it [22:09:36] Helder, are you working with the articlequality model? [22:09:43] I need to step away pretty soon FYI [22:09:49] I was [22:10:47] Say, do you have access to ores-misc-01? That might help you with your work if you'd like to experiment with building models. [22:11:10] ores-misc-01 matches our production environment so I'd like to have any models going to prod be built there. [22:11:15] Helder, ^ [22:11:49] I assume I don't have access [22:12:03] Hmmm [23:17:54] halAFK: thanks...i implemented that kind of thing, can get sections from operations now :)