[05:02:12] I recalculated the Lexicographic coverage for ar, bg, ca, cs, da, de, el, en, hr, lv, and sv. The biggest changes have been to Swedish with +2%, Latvian that went from 0.5% to 3.3%, and German with a whopping +6% in coverage! https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage [05:02:57] Let me know if you want any of the existing languages recalculated, or else I will slowly work through them until they are all in PAWS [05:03:28] Can you do hi right quick? (re @wmtelegram_bot: Let me know if you want any of the existing languages recalculated, or else I will slowly work through them until they are all in PAWS) [05:03:58] I'll try right now! [05:04:26] Some of the more common particles in that language now have lexemes, so there should be a considerably larger jump in coverage [05:05:10] (Reminder to strip U+0964 and U+0965--punctuation marks in Indic scripts--from tokens [05:09:57] hmm, this script doesn't recognize a single hindi lexeme [05:10:06] that's a regression, because last time I had 94 [05:10:14] can you give me an example lexeme? [05:10:33] Which language item are you filtering with? [05:10:49] Q1568 [05:10:57] The hi forms and ur forms are used with Q11051 [05:11:07] that would explain it [05:11:32] and the shift to use that item was before the first time you ran the coverage stats [05:12:28] the previous version of the script relied only on the language tag on the lemma, not on the language item (that was a mistake that leads to small discrepancies - or to larger ones, for hindi) [05:12:36] it's work in progress [05:14:04] wow, it went up from 1% to 14.9% ! [05:14:33] https://www.wikidata.org/w/index.php?title=Wikidata:Lexicographical_coverage&diff=1365673338&oldid=1365664785 [05:14:58] Neat! [05:15:18] I didn't do the suggested stripping yet [05:16:49] now let's see about that if I understand it right [05:18:43] do you have an example of that? [05:18:45] https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/Missing/hi [05:18:52] the character on #2 on the right end? [05:19:14] Entries 2 and 4 are the same, but 2 has U+0964 at the end (re @wmtelegram_bot: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/Missing/hi) [05:20:20] (The other entries ending with U+0964 in that list are mostly conjugations of the verb "to be", with a few others in there as well.) [05:21:24] ok, thanks! [05:37:25] I hope it's fixed now: https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage/Missing/hi [05:37:36] The coverage went up to 15.1% now [05:37:47] Indeed it is. Thanks a bunch! [05:38:35] so why do hi and ur have the same item? (or shall I just read the respective Wikipedia articles?) [05:39:03] yeah, reading the article on the Hindustani language should suffice (re @wmtelegram_bot: so why do hi and ur have the same item? (or shall I just read the respective Wikipedia articles?)) [05:39:14] thanks! [05:40:11] I'd suggest merging the standardized variants of Shtokavian similarly, but perhaps you know better than I whether this would start a war later ;-) [05:46:40] related: what are your thoughts on the [[en:Declaration on the Common Language]]? [05:53:25] Not using the WMF account to answer that: I think there would be advantages in treating these languages as a single language from the perspective of making more and better knowledge available to more people. [05:53:56] Awesome, that's the same motivation behind "Hindustani" (re @vrandecic: Not using the WMF account to answer that: I think there would be advantages in treating these languages as a single language from the perspective of making more and better knowledge available to more people.) [05:54:38] Awesome, that's one of the same motivations behind "Hindustani" (re @vrandecic: Not using the WMF account to answer that: I think there would be advantages in treating these languages as a single language from the perspective of making more and better knowledge available to more people.) [05:54:43] Then I understand the situation :)