[09:16:08] An interesting article relevant to Abstract Wikipedia, published by the Vinogradov Russian Language Institute, about using Wikipedias in minority languages of Russia in corpora: http://nevmenandr.net/personalia/wiki.pdf [09:16:09] It's in Russian, but I'll give a quick summary here: They are saying that Wikipedia is tempting as a source for building digital corpora, and it's good for well-developed languages like Russian, and in theory it could be good for other minority languages of Russia, but in practice it's problematic. They analyzed the word frequency in the Russian Wikipedia and compared it to Tatar, Bashkir, Erzya, Moksha, Komi, and Komi-Permyak [09:16:10] The result was that in Russian, the most frequent 20 words were prepositions, conjunctions, and particles, which is comparable to other Russian texts. The only exception was the word "year", which is understandable given that an encyclopedia talks a lot about history and chronology. [09:16:12] But in the other languages, the frequency was quite different. In some languages, the most common words were "river" and "water", because they have a lot of articles about rivers, probably filled by bots. In another, more common words were about time: day, year, holiday. [09:16:13] In one language, many of the most common words were about plants, with "orchid" at the 6th place! So they assume that it was filled, automatically or semi-automatically, by articles about plants from a database. They called the whole phenomenon "the orchid syndrome". [09:16:15] The conclusion is that while it would be nice in theory to use Wikipedia to build corpora for low-resource languages, it's problematic in practice because of the many bots. [09:29:52] That's one example why it's important to tag AW-generated text in a machine-readable way: responsible researchers know that some texts in Wikipedias are auto-generated, but if they can be certain that the auto-generated ones are tagged, they can ignore them or treat them differently when using them for corpora (and corpora are, in turn, used for machine learning). [23:55:55] amire80: interesting!