[02:02:55] [[abstract:Q467]] returns the error "Wikifunctions returned a failed response: Reached time limit in orchestrator" when selected Japanese. I think this probably means that in function Z33078 AW tried to return the lexeme "lexical category" = Q34698 as an adjective and it was not found. [02:02:56] There is no adjective(Q34698) in Japanese that directly corresponds to the adjective "grown(L337192)" used in this article. Instead, in Japanese, we use expressions that correspond to adjectives with words in other "lexical category" such as "na-adjective/adjectival noun(Q1091269)" ex. L1561541#F6 or "collocation(Q1122269)" ex. L1561528, and so on. [02:02:58] I think it may be nessesary to think about mapping the "lexical category" of lexemes set, which are categorized by the grammar of each language, from the aspect of its role, so to speak, like a super lexical category. Are there any ideas or past discussions? [06:20:07] Good question... Mutations can be quite complex, depending on word gender for instance. Also there may be homographic words that trigger different mutations, I'll check. If not then your idea maybe the best solution. (re @Al: 🤔 Would it be too painful to think of everything operating without mutation and only mutating the whole sentence/paragraph at th...) [07:12:41] yes, it was outdated 😅 (re @Al: Oh, and the test case provided only a reference, not a lexeme 😎) [09:00:37] I should begin by referring you to [[Wikifunctions:Type proposal/Syntactic table]]. It doesn’t address your points directly, but it’s a useful place to start. In a sense, Abstract Wikipedia is an exercise in “pure”, or at least minimally committed, semantics: the language-neutral functions are there to support knowledge representation, not to commit to any particular [09:00:37] linguistic form. [09:00:38] In practice, however, functions are often discovered via the kinds of sentences they produce in a small and unrepresentative set of languages. That is best seen as a provisional indexing device rather than a defining property. At present we move rather quickly from a high-level function to a language-specific realisation, but there is no reason in principle why groups of [09:00:38] language [09:00:40] s could not route themselves via an intermediate layer when they share a mode of expression. That said, I cannot think of any current cases where this occurs. [09:00:41] Considering a familiar example, and using English purely for illustration, “Paris is the capital of France”, but also “Paris is the French capital”, “Paris is a city in France and its capital”, and “Paris is a city. It is in France. It is the capital of that country.” That is already assuming that Paris is the focus, which is probably a commitment made at the high [09:00:41] [09:00:43] est level, even if some languages may find that awkward to realise. (re @higa4: [[abstract:Q467]] returns the error "Wikifunctions returned a failed response: Reached time limit in orchestrator" when selected...) [09:27:39] Ah, yes… the string is just the tip of the iceberg. I’m just thinking about simple old English, where the choice between “a” and “an” can only be made when the following word is settled (to use an, if I may say so, contrived construction). (re @NicolasVIGNERON: Good question... Mutations can be quite complex, depending on word gender for instance. Also there may be ho [09:27:40] [09:27:40] mographic words that...) [10:33:10] Thanks. I’ll try to understand that type proposal. (re @Al: I should begin by referring you to [[Wikifunctions:Type proposal/Syntactic table]]. It doesn’t address your points directly, but...) [11:36:40] The interesting thing about "a" and "an" is that it is decided according to how the word after it is pronounced and not how it is written. (re @Al: Ah, yes… the string is just the tip of the iceberg. I’m just thinking about simple old English, where the choice between “a” and...) [11:37:31] A United /junajted/. [11:37:32] An umbrella /^mbrElla/. [11:38:35] So, probably, you will need to fetch the pronunciation, retrieve the first phoneme, and then decide. [11:38:49] Absolute Cinema. [12:36:07] lexemes are so cool! (re @Csisc1994: So, probably, you will need to fetch the pronunciation, retrieve the first phoneme, and then decide.) [12:42:35] Al I just got Z33123 to work thanks to you 🤩 [12:45:37] now I'm going to use it to reimplement Z22018 because I dont like the inputs nor outputs [12:46:16] Yes, but perhaps only in the less than 1% of cases where the spelling is an unreliable guide. 🤔 [12:47:26] ⬆️ (re @Csisc1994: So, probably, you will need to fetch the pronunciation, retrieve the first phoneme, and then decide.) [12:49:05] AW the rabbit hole of edge cases (because language) (re @Al: Yes, but perhaps only in the less than 1% of cases where the spelling is an unreliable guide. 🤔) [12:55:11] Does that support Z32645? 🤷‍♂️ (re @Npriskorn: now I'm going to use it to reimplement Z22018 because I dont like the inputs nor outputs) [12:57:28] my function output monolingual text "the" and not a kleenean, we could create an implementation using Z32645 but I'm thinking of rolling my own logic first. Let the best function win! 😎 (re @Al: Does that support Z32645? 🤷‍♂️) [13:00:17] According to chatgpt the difference when it comes to "the" for countries is: [13:00:17] What matters is the meaning pattern of the name. [13:00:19] A. Needs “the” → descriptive entities [13:00:20] These are not pure names, but descriptions: [13:00:22] political structures → [13:00:23] the United States (a union of states) [13:00:25] the United Kingdom [13:00:26] plural/geographical groups → [13:00:28] the Netherlands [13:00:29] the Philippines [13:00:31] 👉 These behave like common nouns, so they take “the”. [13:00:32] So if the proper noun has P31 = common noun then it needs "the" [13:01:46] Sure 👍 It is our most common word and far from straightforward… (re @Npriskorn: my function output monolingual text "the" and not a kleenean, we could create an implementation using Z32645 but I'm thinking of...) [13:03:43] oh it's not that simple [13:03:44] ❌ It’s context-dependent [13:03:46] Even for something like: [13:03:47] “the United States” [13:03:49] You can still see: [13:03:50] “United States policy” (no “the”) [13:03:52] headlines dropping articles [13:03:53] So it’s not an absolute property of the lexeme itself. [13:03:55] So we can't encode this on the lexeme I guess (re @Npriskorn: According to chatgpt the difference when it comes to "the" for countries is: [13:03:56] What matters is the meaning pattern of the name. [13:03:58] A...) [13:04:13] I rather doubt your inference. Items are not “nouns”, as such. (re @Npriskorn: According to chatgpt the difference when it comes to "the" for countries is: [13:04:13] What matters is the meaning pattern of the name. [13:04:14] A...) [13:07:45] We need to know the context of the item in the sentence to know if it needs "the". I'm not sure how to do that. [13:09:49] Indications of need for definitive article: [13:09:49] * country name is not the first word of the sentence [13:09:50] * country name has multiple parts (combines) [13:09:52] * lexeme is a choronym (name for country or region) [13:09:53] * has the in any alias? [13:12:30] so a function that helps users state country names with these settings: [13:12:31] * start of sentence (bool) [13:12:32] * predicate exists before (bool) [13:12:34] might be valuable? [13:12:59] I think it is lexicographically correct that some countries have or take the definite article whilst most do not. But perhaps in so doing, they function as common nouns and therefore may be unmarked for definiteness in certain grammatical contexts. In particular, one of English’s stronger rules is the single determiner, so *“a the United States policy” will never do! The [13:12:59] de [13:12:59] terminer (even if null) attaches to the head of the phrase, not necessarily the following noun (which is used attributively in many cases). (re @Npriskorn: oh it's not that simple [13:13:01] ❌ It’s context-dependent [13:13:02] Even for something like: [13:13:04] “the United States” [13:13:05] You can still see: [13:13:07] “United Sta...) [13:22:37] Here are two examples from the guardian today: [13:22:38] "US rescues second crew member of downed F-15E fighter jet from Iran" [13:22:40] "Middle East crisis live: Trump uses expletive-ridden social media post to threaten Iran’s infrastructure" [13:26:59] I asked chatgpt to give me examples of no the for US and not in the beginning of a sentence: [13:26:59] 1. As a modifier (attributive use) [13:27:01] When it modifies another noun, the article drops: [13:27:02] She studies United States history. [13:27:04] They discussed United States foreign policy. [13:27:05] This is United States law. [13:27:07] 👉 Here “United States” behaves like an adjective, so no “the”. [13:27:08] We have Q692218 but no lexemes for US law or United States law currently. [13:27:57] “Headlines always ignored in shock AW guidance” according to Guardian journalist… 😏 (re @Npriskorn: Here are two examples from the guardian today: [13:27:58] "US rescues second crew member of downed F-15E fighter jet from Iran" [13:27:59] "Middle E...) [13:28:01] I'm thinking we don't need for account for this in the function I'm building. So the signals we need to the function is: [13:28:02] * the item [13:28:04] * whether in the beginning of a sentence [13:28:50] And perhaps we shouldn't? We can just combine the lexemes for United States and law to get what is needed. (re @Npriskorn: I asked chatgpt to give me examples of no the for US and not in the beginning of a sentence: [13:28:50] 1. As a modifier (attributive use) [13:28:52] ...) [13:28:55] do we need "is headline" signal also to the function then? (re @Al: “Headlines always ignored in shock AW guidance” according to Guardian journalist… 😏) [13:29:48] but in that case we need the following signal also: "is used (as an adj.) to modify the subsequent noun" (re @Jan_ainali: And perhaps we shouldn't? We can just combine the lexemes for United States and law to get what is needed.) [13:30:17] I'm adding these 2 signals, then we have 4 in total [13:31:57] Is it an adjective or in genitive? Compare "Det här är Sveriges lag." and "Det här är svensk lag." Both can be useful depending on what one is trying to say. (re @Npriskorn: but in that case we need the following signal also: "is used (as an adj.) to modify the subsequent noun") [13:32:39] so another signal? "is genitive"? (re @Jan_ainali: Is it an adjective or in genitive? Compare "Det här är Sveriges lag." and "Det här är svensk lag." Both can be useful depending ...) [13:33:43] Yes, and it’s simply the law of the United States, with no article when the United States is used attributively, just as we do not talk about *”the French law” but “French law”. (re @Jan_ainali: And perhaps we shouldn't? We can just combine the lexemes for United States and law to get what is needed.) [13:44:38] I created a WIP implementation but get an error Z33138 [13:45:16] i havent learnt how to debug yet 🙈 : https://tools-static.wmflabs.org/bridgebot/01414697/file_79209.jpg [13:45:50] Simple, check the type of the arguments you have. (re @Npriskorn: i havent learnt how to debug yet 🙈) [13:46:00] They should be the same. [13:47:02] ah its because I changed the inputs and the old tests dont have them [13:49:26] now I get 🙈 : https://tools-static.wmflabs.org/bridgebot/7845fb7d/file_79210.jpg [13:51:13] does anyone know how the joining of strings work with empty strings? e.g. "first", "", "last" joined with space does that get 2 spaces in the middle? [13:52:36] *Z32645* [13:52:37] Takes Wikidata item and not Wikidata item reference. (re @Npriskorn: now I get 🙈) [13:53:24] this is a common failure mode, we should improve that error to be crystal clear IMO (re @Csisc1994: Z32645 [13:53:25] Takes Wikidata item and not Wikidata item reference.) [13:53:54] Debugging the calls and manually checking does not work well for the end user. (re @Npriskorn: this is a common failure mode, we should improve that error to be crystal clear IMO) [13:54:44] I agree, but still the system is so cool you just gotta love it despite some quirks and rough edges! (re @Csisc1994: Debugging the calls and manually checking does not work well for the end user.) [13:55:43] Perhaps it pays to step back. The rules for using determiners in English sentences are one problem. Whether a proper noun omits the definite article where a contextually unique common noun retains it is conceptually fairly simple: names felt as such omit it. That, I think, is why Mars has no article whereas the Moon does. In effect, this just rephrases the question, but [13:55:43] it feels [13:55:44] closer to the real problem. When the phrase requires a definite article, the head resists it if it is felt as a name (whatever that means). (re @Npriskorn: but in that case we need the following signal also: "is used (as an adj.) to modify the subsequent noun") [13:58:13] Yes, you get two, but HTML reduces those to the appearance of one, which is either helpful or not 😏 (re @Npriskorn: does anyone know how the joining of strings work with empty strings? e.g. "first", "", "last" joined with space does that get 2 ...) [13:59:43] we dont seem to have anything in the Mars lexeme to indicate that "it feels like a name". (re @Al: Perhaps it pays to step back. The rules for using determiners in English sentences are one problem. Whether a proper noun omits ...) [13:59:50] https://tools-static.wmflabs.org/bridgebot/316b8f03/file_79211.jpg [14:00:02] the derived lexeme is not marked as a name either [14:00:17] https://tools-static.wmflabs.org/bridgebot/dab4365d/file_79212.jpg [14:00:25] Indeed. That is hardly a surprise! (re @Npriskorn: we dont seem to have anything in the Mars lexeme to indicate that "it feels like a name".) [14:00:26] how would we encode this "feels like a name" [14:06:26] That depends on the lexicographic community. They might take the view that proper nouns generally feel like names so only those that do not should be marked. That, I think, is how we arrive at item aliases that make the article (and capitalisation) explicit. (re @Npriskorn: how would we encode this "feels like a name") [14:16:12] I guess you want a "remove blank string from typed list of strings" function. (re @Npriskorn: does anyone know how the joining of strings work with empty strings? e.g. "first", "", "last" joined with space does that get 2 ...) [14:23:06] I think that’s just Filter with the is-empty-string predicate, but it may already be wrapped as a named function. (re @Winston_Sung: I guess you want a "remove blank string from typed list of strings" function.) [14:24:26] And please note that surname ≠ "sequential order" last name and given name ≠ "sequential order" first name in all people names. (re @Winston_Sung: I guess you want a "remove blank string from typed list of strings" function.) [14:28:43] It has lexical category:proper noun. That feels like a name to me. (re @Npriskorn: we dont seem to have anything in the Mars lexeme to indicate that "it feels like a name".) [14:45:33] And so the wheel turns 🤣 (re @Jan_ainali: It has lexical category:proper noun. That feels like a name to me.) [14:52:02] There's further nuance, too: [14:52:02] Imperial/colonial usage treats *territories* as requiring definite articles despite being proper names, e.g. "the Sudan", "the Congo", "the Ukraine". [14:52:04] Citizens of these now-independent-countries rightly object to this vestigial usage, which implicitly denies their country's statehood. (re @Al: Perhaps it pays to step back. The rules for using determiners in English sentences are one problem. Whether a proper noun omits ...) [14:53:38] Is "the Congo" colonial? I thought it just had a "the" because it's named after a river like "the Bronx" (re @abartov: There's further nuance, too: [14:53:40] Imperial/colonial usage treats *territories* as requiring definite articles despite being proper n...) [14:58:13] "The Congo river" (or just "the Congo" when the river is the context, e.g. "they sailed down the Congo") is fine, but yes, referring to (either country) as "the Congo" harks back to colonial times, and if it feels natural to you, it's simply because it hasn't been rooted out yet in your sociolect. [14:59:42] Ukrainians have had to point this out to many well-intentioned people in the years since the full-scale Russian invasion, when folks started paying attention and said things like "we want to help folks in the Ukraine"... [15:03:19] never even heard about this 🙈 (re @abartov: There's further nuance, too: [15:03:20] Imperial/colonial usage treats *territories* as requiring definite articles despite being proper n...) [15:04:31] it seems that by just working with languages we invite a ton of historical complexities. Reminds me about https://www.youtube.com/watch?v=-5wpm-gesOY which is great! [15:05:37] maybe if Tom was in here he would say: "You should never ever deal with generating language if you can help it." 😂 [15:08:16] Yes, it was previously noted that Ukraine’s item lost its alias with the definite article in 2012. Usage changes over time but we would generally favour the reasonably contemporary. (re @abartov: There's further nuance, too: [15:08:17] Imperial/colonial usage treats *territories* as requiring definite articles despite being proper n...) [15:18:55] The lexical category already indicates that (re @Npriskorn: we dont seem to have anything in the Mars lexeme to indicate that "it feels like a name".) [15:21:14] Do you know any proper noun that that doesn't behave like a name? [15:21:14] As a lexicographer and a French speaker (noun and name both are "nom"), I'm very curious (re @Al: That depends on the lexicographic community. They might take the view that proper nouns generally feel like names so only those ...) [15:26:59] and there is also "Astronym" indicating it's like a name... (re @NicolasVIGNERON: The lexical category already indicates that) [15:29:00] In Zhuang, WeChat (veihsin), TikTok (doujyinh) are not capitalized, while new borrowings from Western languages, lile v_ia ferrata_ (Feihlahdaz) are capitalized (re @NicolasVIGNERON: Do you know any proper noun that that doesn't behave like a name? [15:29:01] As a lexicographer and a French speaker (noun and name both ar...) [15:29:25] Pretty sure (90%) it's not a lexeme and thus it can't be a Wikidata Lexeme (re @Npriskorn: I asked chatgpt to give me examples of no the for US and not in the beginning of a sentence: [15:29:26] 1. As a modifier (attributive use) [15:29:28] ...) [15:29:34] Yes, the United Kingdom, the Moon (arguably) and the Solar System. The capitalisation comes from their status as proper nouns, but they don’t omit the definite article. (re @NicolasVIGNERON: Do you know any proper noun that that doesn't behave like a name? [15:29:35] As a lexicographer and a French speaker (noun and name both ar...) [15:29:46] https://en.wiktionary.org/wiki/veihsin [15:29:47] https://en.wiktionary.org/wiki/Feihlahdaz (re @OverflowCat: In Zhuang, WeChat (veihsin), TikTok (doujyinh) are not capitalized, while new borrowings from Western languages, lile via ferrat...) [15:32:29] Isn't it true for both proper nouns and names? [15:32:31] I'm trying to understand the difference between the two (re @Al: Yes, the United Kingdom, the Moon (arguably) and the Solar System. The capitalisation comes from their status as proper nouns, b...) [15:34:27] And in any case, I don't think that we can assume that a name/noun should be capitaliazed (or not) nor it should have an article [15:35:02] I had iMac or iPhone in mind 😉 (re @OverflowCat: In Zhuang, WeChat (veihsin), TikTok (doujyinh) are not capitalized, while new borrowings from Western languages, lile via ferrat...) [15:36:15] We say just “France” but “the Netherlands”, “Brittany” but “the Algarve”… [15:37:59] Capitalising proper nouns is the English rule, but there are always exceptions 😏 (re @NicolasVIGNERON: And in any case, I don't think that we can assume that a name/noun should be capitaliazed (or not) nor it should have an article) [15:43:08] But so are names, no? (re @Al: Capitalising proper nouns is the English rule, but there are always exceptions 😏) [15:44:43] Oh, yes… I see what you mean. Names are usually classed as proper nouns too. (re @NicolasVIGNERON: But so are names, no?) [16:00:13] ah thanks 🙏 [16:00:14] and vise-versa? (re @Al: Oh, yes… I see what you mean. Names are usually classed as proper nouns too.) [16:06:23] 🤷‍♂️ enwiki says [16:06:25] “A distinction is normally made in current linguistics between _proper nouns_ and _proper names_. By this strict distinction, because the term _noun_ is used for a class of single words (_tree_, _beauty_), only single-word proper names are proper nouns: _Peter_ and _Africa_ are both proper names and proper nouns; but _Peter the Great_ and _South Africa_, while they are [16:06:25] proper n [16:06:26] ames, are not proper nouns. The term _common name_ is not much used to contrast with _proper name_, but some linguists have used it for that purpose. While proper names are sometimes called simply _names_, this term is often used more broadly…” 😏 (re @NicolasVIGNERON: ah thanks 🙏 [16:06:28] and vise-versa?) [23:17:21] Given that our database is never assumed complete, there will always be unmarked items. So I'd be in favour of explicit marking in either direction. (re @Al: That depends on the lexicographic community. They might take the view that proper nouns generally feel like names so only those ...) [23:27:08] If it is rare, it could be marked by having its own lexical subcategory item. (re @NicolasVIGNERON: Do you know any proper noun that that doesn't behave like a name? [23:27:10] As a lexicographer and a French speaker (noun and name both ar...) [23:34:50] The hardest will be the least documented items. Rivers around my local area usually get "the", but most creeks don't. (re @Al: We say just “France” but “the Netherlands”, “Brittany” but “the Algarve”…)