Listen to this chapter in the podcast from the beginning to learn more about full episode.

Short Wave

When AI Cannibalizes Its Data

18 Feb 2025

Description

Asked ChatGPT anything lately? Talked with a customer service chatbot? Read the results of Google's "AI Overviews" summary feature? If you've used the Internet lately, chances are, you've consumed content created by a large language model. These models, like DeepSeek-R1 or OpenAI's ChatGPT, are kind of like the predictive text feature in your phone on steroids. In order for them to "learn" how to write, the models are trained on millions of examples of human-written text. Thanks in part to these same large language models, a lot of content on the Internet today is written by generative AI. That means that AI models trained nowadays may be consuming their own synthetic content ... and suffering the consequences.View the AI-generated images mentioned in this episode.Have another topic in artificial intelligence you want us to cover? Let us know my emailing [email protected]!Listen to every episode of Short Wave sponsor-free and support our work at NPR by signing up for Short Wave+ at plus.npr.org/shortwave.Learn more about sponsor message choices: podcastchoices.com/adchoicesNPR Privacy Policy

Audio

Featured in this Episode

Advertisement Narrator

Regina Barber

Ilya Shumailov

Transcription

Full Episode

0.137 - 18.843 Advertisement Narrator

This message comes from Nature on PBS, producers of Going Wild with Dr. Rae Wynn Grant. Back for a brand new season, Going Wild highlights champions of nature and what led them to create change within themselves and the natural world. Follow Going Wild wherever you get your podcasts.

20.244 - 41.399 Regina Barber

You're listening to Shortwave from NPR. It seems like these days generative AI is everywhere. It's in my Google searches. It's suggested as a tool on TikTok. It's running customer service chats. And there's a lot of forms that generative AI can take, like it can create images or video.

42.714 - 58.326 Regina Barber

But the ones that have been in the news recently, DeepSeek R1, OpenAI's ChatGBT, Google Gemini, Apple Intelligence, all of those are large language models. A large language model is kind of like the predictive text feature in your phone, but on steroids.

59.105 - 74.131 Ilya Shumailov

Large language models are statistical beasts that learn from example of human written text and learn to produce text that is similar to the ones that the model was taught.

74.551 - 96.41 Regina Barber

That's Ilya Shumailov. He's a computer scientist and he says in order to teach these models, scientists have to train them on a lot of human written examples. Like, they basically make the models read the entire internet. which works for a while. But nowadays, thanks in part to these same large language models, a lot of the content on our internet is written by generative AI.

97.791 - 115.674 Ilya Shumailov

If you were today to sample data from internet randomly, I'm sure you'll find that a bigger proportion of it is generated by machines. But this is not to say that the data itself is bad. The main question is how much of this data is generated

120.31 - 135.866 Regina Barber

In the spring of 2023, Elio was a research fellow at the University of Oxford. And he and his brother were talking over lunch. They were like, OK, if the Internet is full of machine-generated content and that machine-generated content goes into future machines, what's going to happen?

136.718 - 158.526 Ilya Shumailov

Quite a lot of these models, especially back at the time, they're relatively low quality. So there are errors and there are biases. There are systematic biases inside of those models. And thus, you can kind of imagine the case where rather than learning useful contexts, and useful concepts, you can actually learn things that don't exist. They are purely hallucinations.

158.646 - 171.575 Regina Barber

Ilya and his team did this research study indicating that eventually, any large language model that learns from its own synthetic data would start to degrade over time, producing results that got worse and worse and worse.

Short Wave

When AI Cannibalizes Its Data

Full Episode

Want to see the complete chapter?

Login Required