[00:22:17] I updated the German numbers. Coverage went up by more than 1%! (re @Nikki: "beispielsweise" (how appropriate)) [00:23:07] How long does one of these updates take? (re @vrandecic: I updated the German numbers. Coverage went up by more than 1%!) [00:25:26] yay \o/ [00:29:21] Once they're correctly set up, a few minutes [00:29:43] The slowest step is downloading the data [00:30:42] ok cool [00:54:06] @Csisc1994 yes, you are very welcome to try out your idea for semantic annotation! I'd be happy to figure out how your annotator could be integrated with the annotation interface [00:56:01] @বোধিসত্ত্ব : re bn, I'd love to, but I need a reasonably clean, encylopedic corpus of bn text - preferably a cleaned up version of bn.wikipedia - if we can get that somewhere, I'd be happy to have that wired up [01:10:35] @carn1: re "would abstract wiki aid global templates initiative?" yes and no. we will help with some of the use cases from global templates, but we will not implement global templates and modules as they are currently envisioned [01:11:02] we plan to support Lua as one of the early languages [01:11:31] (I expect it to be in the first five) [01:12:44] @Dennis0123 , @mahir256 : the talk was recorded and will be published. the chat, to the best of my knowledge, will not [01:18:44] @mahir256 - I take this back, the downloading is not the slowest, creating the wordlist is. [01:19:04] about half an hour, it looks like? [01:21:00] @Nikki - thanks for all the suggestion, I put them in a list. I probably should turn them into phabricator tasks, but for now I just see how many of those are resolved quickly without process [01:23:36] ok :) [02:48:26] Resolved. It was making a difference between capitalization and not (but then not displaying that difference) (re @Nikki: some of the words appear multiple times, is there a reason for that? (e.g. dadurch, noch, schon)) [02:49:07] ahh, I see [02:49:51] Resolved means for the languages that are already in the new python 3 pipeline. I'm trying to move them all over. German is there already. [02:50:09] Three missing file for German has been updated [02:50:53] *The, not Three [06:16:26] it doesn't seem to consider a dash part of a word, e.g. nordrhein and westfalen are listed separately, that doesn't seem right to me [06:18:05] but then it leaves „ alone [06:49:55] 👍 (re @carn1: My term as ruwiki ArbCom member will end in two weeks and I plan to focus on templates globalization) [07:00:50] There are a lot of different approaches, apparently it will be necessary to write a code analyzer to check if templatedata is up-to-date [07:02:34] At the moment, the task is to find out which modules and templates in different sections perform the same functionality. It can be done through wikidata or json on commons parameters mapping [07:04:23] Sometime functionality is same, but same info (date, for example) has different formats [07:04:49] At the moment, the task is to find out which modules and templates in different wikis perform the same functionality. It can be done through wikidata or json on commons parameters mapping [15:20:00] We currently have two Outreachy students working on that (re @carn1: At the moment, the task is to find out which modules and templates in different wikis perform the same functionality. It can be done through wikidata or json on commons parameters mapping) [15:20:16] Working on this (re @Nikki: but then it leaves „ alone) [15:20:23] \o/ [15:21:48] Not sure what to do about that. I could try to have a run that doesn't remove dashes, and we can see what makes better results? (re @Nikki: it doesn't seem to consider a dash part of a word, e.g. nordrhein and westfalen are listed separately, that doesn't seem right to me) [15:22:12] that would be nice [15:22:43] Will do. But I'll be away form a computer until Tuesday, so it'll need a bit patience [15:22:58] Can I join? (re @vrandecic: We currently have two Outreachy students working on that) [15:23:08] Carn@narod.ru [15:23:09] that's fine, there's no rush [15:23:20] Also for the Hindi runs you should strip U+0964 and U+0965 from the ends of tokens (the copula है appears twice in the current list) [15:23:28] also yeah, no rush [15:24:33] Thanks, noted in my to-do list (re @mahir256: Also for the Hindi runs you should strip U+0964 and U+0965 from the ends of tokens (the copula है appears twice in the current list)) [15:26:47] Yes! You can find more information here https://meta.m.wikimedia.org/wiki/Abstract_Wikipedia/Updates/2020-12-09 (re @carn1: Can I join?) [16:28:37] Hi, I'm one of the Outreachy interns working on this task here. [16:28:38] Our approach is a bit different from parsing json - we are basically fetching all the modules from the database and do some analytics based on the sourcecode itself (modified Levenshtein distance and word embedding are 2 most promising methods for now, but each is not without a flaw). [16:28:40] I'm not sure how you would like to use Wikidata, but I have to warn you that sometimes modules and templates with same functionality are mixed; see https://phabricator.wikimedia.org/T272003 comments. [16:28:41] Either way, our internship will end up in a few weeks, and the project is of course open for development and improvements. :) (re @carn1: At the moment, the task is to find out which modules and templates in different wikis perform the same functionality. It can be done through wikidata or json on commons parameters mapping) [16:40:50] Thank you for comment and link! In wikidata there is no property for templates yet: https://www.wikidata.org/wiki/Wikidata:Property_proposal/Wikipedia_infobox_field [16:44:42] Wiki-code analytics will be easier than parsing lua code, I think [16:47:55] It might be - just initially formulated task was more about analysing the code itself. [16:52:43] Thanks for jumping in Jade! (re @lostEnchanter: Hi, I'm one of the Outreachy interns working on this task here. [16:52:44] Our approach is a bit different from parsing json - we are basically fetching all the modules from the database and do some analytics based on the sourcecode itself (modified Levenshtein distance and word embedding are 2 most promising methods for now, but each is not without a flaw). [16:52:46] I'm not sure how you would like to use Wikidata, but I have to warn you that sometimes modules and templates with same functionality are mixed; see https://phabricator.wikimedia.org/T272003 comments. [16:52:47] Either way, our internship will end up in a few weeks, and the project is of course open for development and improvements. :)) [20:02:28] That is excellent. The idea is simple. Instead of just annotating named entities, why do not we annotate relations. (re @wmtelegram_bot: @Csisc1994 yes, you are very welcome to try out your idea for semantic annotation! I'd be happy to figure out how your annotator could be integrated with the annotation interface) [20:04:14] I am currently reading several papers about Quantum NLP. I think that the principle of Quantum Information Science can drive the automatic annotation of texts. [20:05:19] See this example : Guide staff and colleagues [20:09:39] If we take "Guide staff" alone, there are two tokenization choices here. [20:09:40] Choice 1: Guide (Adj.) and Staff (N.) [20:09:41] Choice 2: Guide (Verb) and Staff (N.) [20:09:43] But, because, we have "and colleagues", Choice 2 is the accurate one as "Guide (Adj.) colleagues (N.)" does not exist. [20:12:30] @vrandecic I have seen https://www.dropbox.com/s/zqmgf7cdtyayrmk/Denny%20Vrande%C4%8Di%C4%87.mp4?dl=0. It is an excellent talk about Abstract Wikipedia Project. [20:13:53] Congratulations. I invite all the members to see it too. However, I think that there should be several presentations about Wikifunctions as a stand-alone.