[16:47:56] Welcome Genoveva! [18:02:05] The title of Timnit Gebru's paper, which is making so much noise in the media in the last few days before it's even published, includes the words "Stochastic Parrots", and I love these two words. [18:02:06] Abstract Wikipedia, if I understand correctly, is not going to be a language model for learning texts according to the current plan. But someone may use AW's output to train language models, so let's hope they won't become stochastic parrots. Or better, design AW in a way that will minimize the likelihood of this. [18:02:07] https://www.technologyreview.com/2020/12/04/1013294/google-ai-ethics-research-paper-forced-out-timnit-gebru/amp/ [18:14:05] Right, Abstract Wikipedia is not planning to use probabilistic language models for her text generation. I don't know how to or whether to minimize the chance of our output becoming input for model training. In fact, I think it would be a great contribution, because of the ability to create parallel corpora well beyond the language pairs that already have parallel corpora, for example (and many other possibilities). [18:17:11] The main concern here is bias. If it is solved, there will be limited matters using Wikifunctions for Machine Learning. (re @vrandecic: Right, Abstract Wikipedia is not planning to use probabilistic language models for her text generation. I don't know how to or whether to minimize the chance of our output becoming input for model training. In fact, I think it would be a great contribution, because of the ability to create p [18:18:20] Yes, it can be a great contribution, but I'm nevertheless concerned about possible misuse. For example, if there's a mistake in a renderer function, it can easily create a lot of text that is wrong, grammatically or factually (the mistake in the function can be made in both good-faith and bad-faith, it doesn't matter very much). But an ML algorithm that isn't thoughtful enough may think that it's correct. [18:19:39] This is interesting as an idea. I will exploring several ideas about how this can be solved. (re @amire80: Yes, it can be a great contribution, but I'm nevertheless concerned about possible misuse. For example, if there's a mistake in a renderer function, it can easily create a lot of text that is wrong, grammatically or factually (the mistake in the function can be made in both good-faith and bad-faith, it doesn't matter ve [18:19:47] This is interesting as an idea. I am exploring several ideas about how this can be solved. (re @amire80: Yes, it can be a great contribution, but I'm nevertheless concerned about possible misuse. For example, if there's a mistake in a renderer function, it can easily create a lot of text that is wrong, grammatically or factually (the mistake in the function can be made in both good-faith and bad-faith, it doesn't matter very [18:20:44] The first one is quantum NLP techniques. [18:21:08] The second one is integrating semantic similarity in language models. [18:21:29] This is interesting as an idea. I am exploring several ideas about how this can be solved (re @amire80: Yes, it can be a great contribution, but I'm nevertheless concerned about possible misuse. For example, if there's a mistake in a renderer function, it can easily create a lot of text that is wrong, grammatically or factually (the mistake in the function can be made in both good-faith and bad-faith, it doesn't matter very [18:21:45] The best thing I can think of that can be done to avoid it is what I already mentioned here once or twice: To mark all the output in a machine-readable manner, so that consumers will know to treat it accordingly. Consumers may be machine learning algorithms that train themselves on a lot of texts. The marking is supposed to say that it was machine-generated, and it should probably also include a version of the function that wa [18:21:46] And even then we can only hope that the consumers will actually use this metadata correctly and ethically. I cannot think of a way to enforce it. [20:51:14] Yes. Most of the people using Wikipedia for training data use it via the wikitext. That wouldn't include the generated texts, so we're good [20:51:21] You'd need to explicitly add it