Listen to this chapter in the podcast from the beginning to learn more about full episode.

Dwarkesh Podcast

Some thoughts on the Sutton interview

04 Oct 2025

Description

I have a much better understanding of Sutton’s perspective now. I wanted to reflect on it a bit.(00:00:00) - The steelman(00:02:42) - TLDR of my current thoughts(00:03:22) - Imitation learning is continuous with and complementary to RL(00:08:26) - Continual learning(00:10:31) - Concluding thoughts Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Audio

Featured in this Episode

Dwarkesh Patel

Unknown

Transcription

Full Episode

0.031 - 15.693 Unknown

Boy, do you guys have a lot of thoughts about this interview. I've been thinking about it myself, and I think I have a much better understanding now of Sutton's perspective than I did during the interview itself. So I wanted to reflect on how I understand his worldview now. And Richard, apologies if there's still any errors or misunderstandings.

16.214 - 26.391 Unknown

It's been very productive to learn from your thoughts. Okay, so here's my understanding of the steel man of Richard's position. Obviously, he wrote the famous essay, The Bitter Lesson. And what is this essay about?

26.451 - 46.065 Dwarkesh Patel

Well, it's not saying that you just want to throw away as much compute as you possibly can. The Bitter Lesson says that you want to come up with techniques which most effectively and scalably leverage compute. Most of the compute that's spent on an LLM is used in running it during deployment. And yet it's not learning anything during this entire period.

46.445 - 65.948 Dwarkesh Patel

It's only learning during this special phase that we call training. And so this is obviously not an effective use of compute. And what's even worse is that this training period by itself is highly inefficient because these models are usually trained on the equivalent of tens of thousands of years of human experience. And what's more, during this training phase,

65.928 - 82.9 Dwarkesh Patel

all of their learning is coming straight from human data. Now, this is an obvious point in the case of pre-training data, but it's even kind of true for the RLVR that we do with these LLMs. These RL environments are human-furnished playgrounds to teach LLMs the specific skills that we have prescribed for them.

83.401 - 98.048 Dwarkesh Patel

The agent is in no substantial way learning from organic and self-directed engagement with the world. Having to learn only from human data, which is an inelastic and hard to scale resource, is not a scalable way to use compute.

98.61 - 120.454 Dwarkesh Patel

Furthermore, what these LLMs learn from training is not a true world model, which would tell you how the environment changes in response to different actions that you take. Rather, they're building a model of what a human would say next. And this leads them to rely on human-derived concepts. A way to think about this would be, suppose you trained an LLM on all the data up to the year 1900.

120.774 - 142.832 Dwarkesh Patel

That LLM probably wouldn't be able to come up with relativity from scratch. And maybe here's a more fundamental reason to think this whole paradigm will eventually be superseded. LLMs aren't capable of learning on the job, so we'll need some new architecture to enable this kind of continual learning. And once we do have this architecture, we won't need a special training phase.

143.273 - 161.638 Dwarkesh Patel

The agents will just be able to learn on the fly, like all humans, and in fact, like all animals are able to do. And this new paradigm will render our current approach with LLMs and their special training phase that's super sample and efficient totally obsolete. So that's my understanding of Rich's position.

Dwarkesh Podcast

Some thoughts on the Sutton interview

Full Episode

Want to see the complete chapter?

Login Required