[00:10:13] Hi, does any one knows if there are anti-vandalism bot for other languages then English. Seems ClueBot doesn't cover many languages? [00:13:39] https://phabricator.wikimedia.org/T184778 [00:19:50] PatruBOT works on eswiki based on ores scores [00:20:00] 82 [00:21:26] cool [00:21:27] thanks! [00:24:00] (03PS1) 10Awight: [WIP] Tests for PopulateDatabase [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403870 (https://phabricator.wikimedia.org/T184140) [00:34:08] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Tests for PopulateDatabase [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403870 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [00:51:40] (03PS2) 10Awight: Tests for PopulateDatabase [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403870 (https://phabricator.wikimedia.org/T184140) [01:21:22] 10Scoring-platform-team, 10MediaWiki-Core-Tests, 10MediaWiki-extensions-ORES, 10Release-Engineering-Team: How do I test my extension's maintenance scripts? - https://phabricator.wikimedia.org/T184775#3896057 (10Legoktm) > If what I'm looking for doesn't exist, I'd be happy to split the maintenance testing... [10:21:28] 10Scoring-platform-team, 10MediaWiki-Core-Tests, 10MediaWiki-extensions-ORES, 10Release-Engineering-Team: How do I test my extension's maintenance scripts? - https://phabricator.wikimedia.org/T184775#3895786 (10Ladsgroup) Count me in if you want help, [10:22:14] (03CR) 10Ladsgroup: [C: 032] Namespace maintenance scripts so they're discoverable from tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403663 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:24:21] (03Merged) 10jenkins-bot: Namespace maintenance scripts so they're discoverable from tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403663 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:25:07] (03CR) 10Ladsgroup: [C: 032] Steal model fixtures for TestHelper; add dirty tricks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403838 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:26:59] (03Merged) 10jenkins-bot: Steal model fixtures for TestHelper; add dirty tricks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403838 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:27:50] (03CR) 10jenkins-bot: Namespace maintenance scripts so they're discoverable from tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403663 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:28:16] (03CR) 10Ladsgroup: [C: 032] Add a maintenance script test [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403839 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:29:45] (03CR) 10Ladsgroup: [C: 04-1] Tests for PopulateDatabase (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403870 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:30:46] (03Merged) 10jenkins-bot: Add a maintenance script test [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403839 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:31:05] (03CR) 10jenkins-bot: Steal model fixtures for TestHelper; add dirty tricks [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403838 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:38:37] (03CR) 10jenkins-bot: Add a maintenance script test [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403839 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [10:56:43] (03PS1) 10Ladsgroup: Remove extension beta mode parts [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403902 (https://phabricator.wikimedia.org/T184554) [13:21:31] Amir1: Hello! Do you know if there is any revscoring feature modifier which splits wikitext by regular expression? Thanks in advance! [13:32:36] Phantom42: for splitting, I'm not sure but we use regex to detect bad words [13:33:24] https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/english.py [13:34:46] https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/features/regex_matches/regex_matches.py [13:35:14] Amir1: Hm. Thanks. I will try to use `RegexMatches` to split text into paragraphs or sections [14:01:50] (03PS3) 10Zfilipin: WIP Create Selenium UI tests for ORES damaging and good faith filters. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/402863 (https://phabricator.wikimedia.org/T184451) (owner: 10Etonkovidova) [14:02:46] (03CR) 10Zfilipin: "> Awight" (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/402863 (https://phabricator.wikimedia.org/T184451) (owner: 10Etonkovidova) [14:56:55] 10Scoring-platform-team, 10Documentation, 10Easy: Mock JADE discussion page - https://phabricator.wikimedia.org/T179301#3896841 (10Dheerajmalisetty) p:05Normal>03High [14:57:01] 10Scoring-platform-team, 10Documentation, 10Easy: Mock JADE discussion page - https://phabricator.wikimedia.org/T179301#3719822 (10Dheerajmalisetty) p:05High>03Normal [16:20:54] 10Scoring-platform-team (Current), 10MediaWiki-Core-Tests, 10MediaWiki-extensions-ORES, 10Release-Engineering-Team: How do I test my extension's maintenance scripts? - https://phabricator.wikimedia.org/T184775#3896986 (10awight) a:03awight [16:24:53] awight: hey, do you have a minute to review my patch in ext. ores? [16:25:13] Amir1: hi. Sure! [16:25:18] likewise :) [16:25:23] Thank you~ [16:25:27] *! [16:25:38] I merged two of yours and give a -1 to the third one [16:27:12] (03CR) 10Awight: [C: 032] Remove extension beta mode parts [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403902 (https://phabricator.wikimedia.org/T184554) (owner: 10Ladsgroup) [16:28:11] (03CR) 10Awight: Tests for PopulateDatabase (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403870 (https://phabricator.wikimedia.org/T184140) (owner: 10Awight) [16:29:17] Amir1: oh, hey. Shall we close this chapter of test writing? https://doc.wikimedia.org/cover-extensions/ORES/index.html [16:29:46] once your last patch is merged, I think so :) [16:29:57] Cool. We can always come back for another round, later. [16:30:32] I will write integration tests for ScreFetcher as part of refactoring it [16:31:36] (03Merged) 10jenkins-bot: Remove extension beta mode parts [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403902 (https://phabricator.wikimedia.org/T184554) (owner: 10Ladsgroup) [16:32:48] awight: also, reading from json files is also good IMO [16:32:56] in some cases we do that in Wikibase [16:35:01] Amir1: One thing I noticed while reading code from a testing perspective: a few of our classes are responsible for reading from a table, but not for writing to it. That’s fine, if both parts of the functionality are reasonably parallel... e.g. SqlScoreStorage vs FetchScoreJob [16:35:16] Amir1: +1, I’m happy to clean up the fixtures [16:36:22] awight: I don't understand the comment [16:36:28] :( [16:36:54] i.e. maybe SqlScoreStorage should include the interface for reading from ores_classification [16:37:55] That is very valid [16:38:19] (03PS3) 10Awight: Tests for PopulateDatabase [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403870 (https://phabricator.wikimedia.org/T184140) [16:38:31] I tried to clean that part in more depth but that sounded too much work for too little gain for now [16:38:52] this software will rot after a while and we need to revisit this in the next round of refactoring [16:39:33] (03CR) 10jenkins-bot: Remove extension beta mode parts [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403902 (https://phabricator.wikimedia.org/T184554) (owner: 10Ladsgroup) [16:40:44] +1 [16:40:49] just a random thought [16:45:01] (03PS1) 10Ladsgroup: Move helper methods of Hooks.php to Hooks\Helpers class [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403943 [16:45:44] ^ This is monster of a patch but just moving code around [16:47:40] (03CR) 10jerkins-bot: [V: 04-1] Move helper methods of Hooks.php to Hooks\Helpers class [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403943 (owner: 10Ladsgroup) [16:48:40] (03PS2) 10Ladsgroup: Move helper methods of Hooks.php to Hooks\Helpers class [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403943 [16:48:49] coverage report builder :/ [16:50:43] gaming the system, I see :D [16:51:44] Now this patch is ready for review [16:52:30] * Amir1 looks at awight :D [16:52:45] * awight whistles about being off-duty :p [16:58:19] * awight can’t resist the curiosity [16:58:43] pleaaaase [17:06:58] (03CR) 10Awight: [C: 032] "Great! I've added various unhelpful notes here and there." (035 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403943 (owner: 10Ladsgroup) [17:08:55] (03Merged) 10jenkins-bot: Move helper methods of Hooks.php to Hooks\Helpers class [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403943 (owner: 10Ladsgroup) [17:11:47] (03CR) 10Ladsgroup: Move helper methods of Hooks.php to Hooks\Helpers class (032 comments) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403943 (owner: 10Ladsgroup) [17:12:25] (03CR) 10jenkins-bot: Move helper methods of Hooks.php to Hooks\Helpers class [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403943 (owner: 10Ladsgroup) [17:17:32] (03CR) 10Ladsgroup: Move helper methods of Hooks.php to Hooks\Helpers class (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403943 (owner: 10Ladsgroup) [17:18:53] I'm calling it a day [17:19:07] see you on Monday [17:25:12] o/ [17:25:23] * halfak is at AI bias roundtable thingie [17:25:53] ooohh sounds nice [17:26:02] will they have a video or something later? [17:26:39] (03PS1) 10Ladsgroup: Remove $user from Helpers::OresUiEnabled() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403959 (https://phabricator.wikimedia.org/T184554) [17:26:44] Not likely. Lots of policy folks here from big corps. Sitting next to a guy from JP Morgan Chase [17:27:01] I just watched The Big Short yesterday. Feels weird. [17:27:31] Just said "We're not tweeting this. This is not a public conversation." [17:27:46] ಠ_ಠ [17:27:47] halfak: hey, please check telegram when you're free :D [17:27:53] kk [17:27:54] and have fun [17:28:07] lolol IRC != Twitter, yeah [17:28:27] that was a very smart move of halfak :P [17:28:28] heh [17:28:30] heheheh [17:28:43] ಠ_ಠ <"hehehe") [17:28:44] for review for awight|away when he's back: last but not least: https://gerrit.wikimedia.org/r/403959 [17:29:49] Going to leave now, for real [17:29:52] (03CR) 10Awight: [C: 032] Remove $user from Helpers::OresUiEnabled() (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403959 (https://phabricator.wikimedia.org/T184554) (owner: 10Ladsgroup) [17:30:27] me too [17:31:41] (03CR) 10Ladsgroup: "facepalm. It has been 12 hours that I'm in the office, forgive my tiredness" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403959 (https://phabricator.wikimedia.org/T184554) (owner: 10Ladsgroup) [17:33:13] (03CR) 10jerkins-bot: [V: 04-1] Remove $user from Helpers::OresUiEnabled() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403959 (https://phabricator.wikimedia.org/T184554) (owner: 10Ladsgroup) [17:38:36] (03PS2) 10Ladsgroup: Remove $user from Helpers::OresUiEnabled() [extensions/ORES] - 10https://gerrit.wikimedia.org/r/403959 (https://phabricator.wikimedia.org/T184554) [17:45:54] (re. roundtable) Looks like we're going to have a conversation about disparate impact and training data bias. [17:46:56] hrm [17:47:08] well I hope it is a good conversation and that some of it sticks [17:47:47] heh. Seems like this is a discussion that will lead toward policy recommendations. [17:48:04] interesting, sure hope those are made public at least! [17:48:22] There's a theme about moving discuss-ability away from algothmic specialists and toward policy, marketing, etc. [17:51:34] FWIW, I think this is starting out well. Seems like this is a workshop series with a good grasp on central issues. [18:01:03] people who know the algorithms need to set out in plain language the way data is gathered, cleaned up, consumed, and finally what is produced; once this is done, the field is open for everyone to discuss how things can go awry [18:01:08] And now we do the terminology dance. I'll have a lot on this in my report. [18:01:34] tbh I feel like a nmube of people have written such texts, but this hasn't made it out to the broader public [18:01:44] except after a lot of levels of filtering... [18:01:49] *number of people [18:02:20] I fear to ask what steps the terminology dance may have [18:26:43] Right now, it's mostly discussion about not using the word "bias" anymore. [18:26:52] But there's lots of discussion about the origins of bias. [18:33:01] Now with credit as an example... What is a "true" credit risk as opposed to a bias in the data? [18:33:17] How do we establish true risk? [18:45:14] what is ground truth really [18:45:38] it's not what the data shows, because the data refelcts the brokenness of the existing system [18:49:04] right. There are methods that can help. [18:49:17] E.g. randomly give people loans who get poor predictions and observe outcomes. [18:49:49] Also, experiment with strategies for helping people reduce their own risk. [18:49:49] ^ like reminders, a support agent, etc. [18:50:01] In the latter, risk is not assumed to be inherent. [18:50:03] yes, because that's another thing about the current system [18:50:15] it's set up so that certain classe of people are much more likely to be credit risks [18:50:20] do we just... perpetuate that? [18:52:07] what sort of use cases are being discussed? (I assume folks in the room are talking about things that impact their businesses) [19:37:26] PROBLEM - puppet on ORES-worker08.experimental is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:37:30] No cases really, more like gesturing to real world examples (or hypothetical ones) and then pointing at generalities and terminological nuance that are needed to describe certain situations. [19:37:38] apergos, ^ [19:38:08] Now working on a template for describing an "algorithmic decision systems" [19:38:12] Will be in my report ^_^ [19:41:34] ah ha [19:42:30] so we can talk for example about how long it will be until people walk around with apps that are designed to "out" trans folk [19:43:48] although that's not bias in the sense of your session, it's reproducing societal belief systems [19:43:56] and helping to perpetuate injustices [19:44:06] that's a large part of what bias is about [19:48:48] Not sure that this fits into a Algorithmic decision system statement -- it's more of a societal impact statement. [19:49:26] I don't think bias is the right framework to use there unless we make "stranger interactions" a decision-making system. [19:49:56] In this case, it's hard to differentiate the individuals goals and right to be judgmental from others right to feel safe and unharassed. [19:50:23] well a goal 'cateogrize this person as trans or not' [19:50:27] I mean [19:50:32] what do you do with that [19:50:45] regardless of who uses the app [19:51:27] you can just talk right there about invasion of privacy [19:52:35] regardless of accuracy [19:54:24] Hmm... Not sure how this is an algorithmic bias discussion. [19:57:00] Seems like it's sort of a different type of discussion. E.g. we can't make people stop coming to conclusions from what they see but we don't need to build tools to help them make certain types of judgments. [19:57:42] people come to conclusions based on all kinda crap [19:57:45] The problem wouldn't be that the prediction contains bias (errors are distributed differently for different groups), but rather the invasion of privacy as you say. [19:57:53] may or may not be accurate conclusions [19:58:03] which is another set of issues of course [19:59:22] if we build a tool that follows their way of drawing conclusions, that's bias for sure [19:59:25] we [19:59:29] someone [20:03:08] apergos, maybe. Bias has a formal definition. Does it have prediction error that is represented in some groups more than others? [20:03:18] anyways, no intent to derail the discussion [20:03:42] Bias != disparate impact -- which would be having real, negative effects on some group [20:04:43] :) [20:04:58] so this opens up the whole discussion of "passing" and those who do won't get flagged (presumably) [20:05:09] and the criteria for that are fscking all over the map [20:05:11] Not derailing. In a way, it's nice to discuss this. The room is repeating itself and going on-and-on. a bit [20:06:18] so if we are going for accuracy (and tbh I would go for jamming the classifier via some adversarial method :-P) then those who [20:06:30] let's say have had some hormone streatment but not much [20:06:48] those who have not had top surgery maybe (depends) [20:07:01] depends on what people wear and how they carry themselves and a bunch of other crap [20:07:36] anyways some small subgroup will be classifed as "trans yes" and many more will be "nope" [20:07:44] of those that are [20:07:57] RECOVERY - puppet on ORES-worker08.experimental is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:08:16] apergos, maybe it owuld be more fun to discuss this in the context of sports. [20:08:18] trans women I guess would be more likely to be flagged as such [20:08:28] and trans men not so much [20:08:30] E.g. right now we already have a clear bias and some notion of rights. [20:08:33] and then nonbinary folk [20:08:41] will simply be misgendered by any current classifier [20:09:05] I'm thinking about the first use of these things which is going to be [20:09:14] well it'll be about sex cause that's what drives the internet :-P [20:10:05] apergos, seems like a stretch to imagine the user of the "find the trans" app [20:10:26] It's just kind of creepy. Where making sure that people compete with their like-sex is a real problem in sports right now. [20:10:32] it is creepy [20:10:37] and that's a problem [20:10:42] and we're walking right towards it [20:11:05] are we? I'm still struggling to imagine this being highly desired. [20:11:10] do people need always to compete with their like-sex [20:11:14] is that a given? [20:12:17] apergos, it's an old divide that's been upheld under inquiry on a regular basis. It's not necessarily a given. [20:12:54] https://sports.vice.com/en_uk/article/vv95a4/what-actually-happens-when-a-trans-athlete-transitions [20:12:56] But there's a history of "cheaters" so a need for better (and maybe less intrusive) testing standards. or I guess a re-adjustment of sex-divides. [20:13:20] people are finally studying it a bit instead of knee-jerk "no no no" [20:13:24] so that is at least improvement [20:13:44] now doping, there's a thing where if we had an app [20:13:45] damn [20:14:12] apergos, maybe. because that app would likely try to measure testosterone and now we have the same problem. [20:14:22] it would? [20:14:38] I mean people use all kinds of enhancers, I thought most of them are not about testosterone levels [20:14:57] Oh it's one of the more useful illegal supplements. [20:14:59] they can't be really, because those levels vary widely from one person to thenext [20:15:14] Right and that directly affects performance in many sports. [20:16:37] so do we divide people into groups by their levels, like weight groups for boxing? [20:16:42] I mean, nothing wrong with that [20:16:49] groups seems pretty coarse [20:16:52] Oh I think that would be really interesting! [20:16:52] *2 groups [20:17:12] E.g. compete with people who have similar age, weight, testosterone levels. [20:17:20] why not [20:17:49] we still have the enhancers issue; doping seems to be really endemic in a lot of sports [20:18:05] I'm kind of astounded how widespread it is (form reports in the last decade anyways) [20:18:10] Good Q! [20:18:19] maybe that's not something that can be tackled by an app [20:18:21] (too bad) [20:18:41] Why not. It already has an app. It just doesn't run on your phone (yet). [20:18:53] it does? [20:18:54] Lots of computational tech around doping detection [20:19:04] afaik you have to take pee and/or blood tests and something something [20:19:05] dunno really [20:20:14] Sure. That's just a measurement. I thought we were talking about the algorithm. [20:20:21] ah [20:20:31] well part of the process is: what data do we get [20:20:33] how do we clean it [20:20:37] how does it get ingested [20:20:42] +1 [20:21:24] they must be really deadly dull in that discussion given our active chat here :-D [20:21:35] I totally agree. Maybe there's something about "enough data" that we'd want to talk about there [20:21:43] But now it's lunch! [20:21:47] good! [20:21:48] back in a bit [20:21:52] see ya [20:21:53] enjoy [22:09:41] (03CR) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ORES] - 10https://gerrit.wikimedia.org/r/404024 (owner: 10L10n-bot) [22:30:02] o/ [22:32:09] how was your lunch? [22:32:56] Grumpy. Watched a lot of people talk over others who had their thoughts together better. >:( [22:33:49] Oh and tasty. [22:34:22] Now we're back to debating whether or not regulations should require businesses to lose money gathering more data about minorities in order to more effectively assess loan risk. [22:35:19] cost of gathering data > potential profits --> lose money [22:36:01] ah [22:36:28] gathering more data? gathering data better? [22:36:45] Na. This would be strictly *more* data. [22:36:49] ic [22:37:09] because many people who can't get loans are in that state because they don't have a credit history. Arguably a credit history is good data. [22:37:40] the whole credit history dance is a very weird one [22:37:44] I suppose there could be "better" data in the sense that you'd gather data with better coverage. [22:37:46] Agreed. [22:38:15] by gathering data better, I meant being cleverer about it or more efficient or something [22:38:24] not necessarly that the data itself is better [22:38:51] of course it is cheaper to cut more people out of the loan loop [22:39:18] otoh the banks certainly managed to overlook credit history guring the subprime morgage fiasco [22:39:24] *during the [22:40:00] on the third hand, what sort of data are they agitating to get? [22:40:10] 9for folks with no credit history) [22:40:58] anyways, enjoy (I hope) the rest of your sessions, ttyl! [22:45:09] o/ [22:45:09] Additional data not clear. [22:45:09] We're generally trying to make the whole world more fair here. ;) [23:13:39] Regulations! It's our obligation to go to lawmakers and inform them "constantly and consistently" so that they don't regulate poorly. [23:14:00] It seems problematic (though practical) that they would not instead be responsible to inform themselves. [23:14:43] A chief of staff for a senator is here and she's explained that legislators are used to people coming to them to explain. [23:17:18] And that if you don't do that first, they'll show up later with a hammer. [23:28:44] Hello! I am working on adding new feature to enwiki feature lists (it is my GCI task, related phab task is https://phabricator.wikimedia.org/T174384). In fact, I have implemented the feature and test for it and it works fine running from test. However when I try to build the model I get a `NotImplemented` exception: https://dpaste.de/SJK4 . Feature line which causes the problem is `paragraphs = [23:28:44] features.RegexMatches('enwiki.paragraphs', [r'((?:[^\n][\n]?)+)'], wrapping=(r'\s*', r''))`. I am currently trying to figure this out. Any tips on what I am doing wrong would be highly appreciated. Thank you! [23:32:41] Phantom42, I'm not sure you want RegexMatches for this. I think you can read this from the wikitext.revision.datasource.tokens. [23:32:42] * halfak looks [23:33:43] my connection is really bad so I'll be on and off every 30s or so [23:34:12] halfak: My idea was to split text into paragraphs using RegexMatches. But I will check now if `...datasource.tokens` can be used for this too [23:34:35] There should be a paragraph break token [23:34:41] * halfak waits for docs ot load up :) [23:35:26] https://github.com/wiki-ai/revscoring/blob/master/revscoring/features/wikitext/datasources/tokenized.py#L20 [23:35:32] Phantom42, ^ that [23:35:40] You can get paragraphs themself to process :) [23:36:16] It's essentially a list of paragraphs -- which is a list a sentences -- which is a list of tokens. [23:37:35] halfak: Wow! Thank you! I spent so much time with regular expressions and now there is easier solution :) [23:37:44] \o/ happy to help [23:37:54] Sorry I've been absent. I'm in travel mode this week [23:40:21] In fact, my plan is to: split text into paragraphs -> take only those which don't have references in them -> sum their lengths. I implemented it and it seems to work. Now I will rebuild the model and see if it gives some improvement :) I will also think later if there are better strategies for this [23:46:04] Phantom42, sounds reasonable.