AI models at risk: Scientists warn of 'model collapse'
AI models trained on data generated by artificial intelligence can fail. Scientists emphasize that the quality of the information given to the models is crucial for their functionality.
In an article published in the prestigious scientific journal "Nature," scientists argue that artificial intelligence (AI) models can experience "model collapse" when trained on data generated by other AI models. They highlight the necessity of using reliable, accurate data during the AI model training process to ensure their proper functioning.
The foundation of their argument is the concept of "model collapse," which refers to a situation where AI models are trained on datasets generated by other AI models. Scientists claim that such a process can lead to "contamination" of the results, meaning that the original content of the data is replaced with unrelated nonsense. Consequently, after several generations, AI models can begin to generate content that makes no sense at all.
Scientists point to tools of generative artificial intelligence, such as large language models (LLMs), which have gained popularity and were mainly trained using data generated by humans. However, as researchers demonstrate, as these AI models proliferate on the internet, there is a risk that computer-generated content might be used to train other AI models and even itself. This process is known as a recursive loop.
Ilia Shumailov from Oxford University in the United Kingdom and his team, using mathematical models, illustrated how AI models can experience "collapse." They showed that AI might skip certain results (e.g., less common text fragments) in the training data, leading to a situation where training occurs on only part of the dataset.
Collapsing AI models
Researchers also conducted an analysis of how AI models respond to a training dataset primarily created by artificial intelligence. They found that feeding the model data generated by AI leads to the degradation of subsequent generations of models in terms of learning ability, ultimately leading to "model collapse."
All language models tested by the scientists that were recursively trained showed a tendency to repeat phrases. For example, the scientists conducted a test where a text about medieval architecture was used for training. By the ninth generation, instead of architecture, the artificial intelligence was generating content about hares.
The study authors emphasize that "model collapse" is inevitable if AI training uses datasets generated by previous generations of models. They claim that effective training of artificial intelligence on its own results is possible but requires careful filtering of the generated data. Additionally, scientists note that technology companies that decide to use only human-generated content for AI training may gain a competitive advantage.