Richard Sutton
👤 PersonPodcast Appearances
Well, yes, I think it's really quite a different point of view.
And it can easily get separated and lose the ability to talk to each other.
And yeah, large language models have become such a big thing.
Generative AI in general, a big thing.
And our field is subject to bandwagons and fashions.
So we lose track of the basic, basic things.
Because I consider reinforcement learning to be basic AI.
And what is intelligence?
The problem is to understand your world.
And reinforcement learning is about understanding your world.
Whereas large language models are about mimicking people, doing what people say you should do.
They're not about figuring out what to do.
I would disagree with most of the things you just said.
Just to mimic what people say is not really to build a model of the world at all, I don't think.
You know, you're mimicking things that have a model of the world, the people.
But I don't want to approach the question in an adversarial way.
But I would question the idea that they have a world model.
So a world model would enable you to predict what would happen.
They have the ability to predict what a person would say.
They don't have the ability to predict what will happen.
What we want, I think, to quote Alan Turing, what we want is a machine that can learn from experience.
Right.
Where experience is the things that actually happen in your life.
You do things, you see what happens, and that's what you learn from.
The large language models learn from something else.
They learn from here's a situation and here's what a person did.
And implicitly, the suggestion is you should do what the person did.
No, I agree that it's the large language model perspective.
I don't think it's a good perspective.
Yeah, curious why.
So, to be a prior for something, there has to be a real thing.
I mean, a prior bit of knowledge should be the basis for actual knowledge.
What is actual knowledge?
There's no definition of actual knowledge in that large language framework.
What makes an action a good action to take?
You recognize the value, the need for continual learning.
Right.
So if you need to learn continually, continually means learning during normal interaction with the world.
Yeah.
And so then there must be some way during the normal interaction to tell what's right.
Yep.
Okay, so...
Is there any way for it to tell, in the largest language model set up, to tell what's the right thing to say?
You will say something and you will not get feedback about what the right thing to say is because there's no definition of what the right thing to say is.
There's no goal.
And if there's no goal, then there's one thing to say, another thing to say.
There's no right thing to say.
So there's no ground truth.
You can't have prior knowledge if you don't have ground truth.
Because the prior knowledge is supposed to be a hint or an initial belief about what the truth is.
But there isn't any truth.
There's no right thing to say.
Now, in reinforcement learning, there is a right thing to say or a right thing to do because the right thing to do is the thing that gets you reward.
So we have a definition of what the right thing to do is.
And so we can have prior knowledge or knowledge provided by people about what the right thing to do is.
And then we can check it to see because we have a definition of what the actual right thing to do is.
Now, an even simpler case is when you're trying to make a model of the world.
When you predict what will happen, you predict and then you see what happens.
Okay, so there's ground truth.
There's no ground truth in large language models because you don't have a prediction about what will happen next.
If you say something in your conversation, the large language models have no prediction about what the person will say in response to that or what the response will be.
Oh, no, they will respond to that question right.
But they have no prediction in the substantive sense that they won't be surprised by what happens.
And if something happens that isn't what you might say they predicted, they will not change because an unexpected thing has happened.
And to learn that, they'd have to make an adjustment.
I'm just saying they don't have, in any meaningful sense, they don't have a prediction of what will happen next.
They will not be surprised by what happens next.
They'll not make any changes if something happens based on what happens.
It's not what the world will give them in response to what they do.
Let's go back to their lack of goal.
For me, having a goal is the essence of intelligence.
Right.
Something is intelligent if it can achieve goals.
I like John McCarthy's definition that intelligence is the computational part of the ability to achieve goals.
Yeah.
So you have to have goals.
You're just a behaving system.
You're not anything special.
You're not intelligent.
Right.
And you agree that large language models don't have goals.
I think they have a goal.
What's the goal?
Next second prediction.
That's not a goal.
It doesn't change the world.
You know, tokens come at you, and if you predict them, you don't influence them.
Yeah, it's not a goal.
It's not a substantive goal.
You can't look at a system and say, oh, it has a goal if it's just sitting there predicting and being happy with itself that it's predicting accurately.
Well, the math problems are different.
Making a model of the physical world and carrying out the consequences of mathematical assumptions or operations, those are very different things.
The empirical world has to be learned.
You have to learn the consequences.
Whereas the math is more just computational.
It's more like standard planning.
So there they can have a goal to find the proof.
And they are in some way given that goal to find the proof.
It's an interesting question whether large language models are a case of the bitter lesson.
Because they are clearly a way of using massive computation, things that will scale with computation up to the limits of the internet.
But they're also a way of putting in lots of
knowledge.
And so this is an interesting question.
It's a sociological or industry question.
Will they reach the limits of the data and be superseded by things that can get more data just from experience rather than from
from people.
In some ways, it's a classic case of the bitter lesson.
The more human knowledge we put into the large language models, the better they can do.
And so it feels good.
And yet, one, well, I in particular expect there to be systems that can learn from experience, which could well perform much, much better and be much more scalable, in which case it will be another instance of the bitter lesson that the things that used human knowledge were eventually superseded by things that just trained from experience and computation.
Well, in every case of the bitter lesson, you know, you could start with human knowledge.
Right.
And then do the scalable things.
Yeah.
That's always the case.
And there's never any reason why that has to be bad.
Right.
But in fact, and in practice, it has always turned out to be bad.
Because people get locked into the human knowledge approach and they psychologically, or, you know, now I'm speculating why it is, but this is what has always happened.
Yeah.
That, yeah, they get, their lunch gets eaten by the methods that are truly scalable.
Yeah, give me a sense of what the scalable method is.
The scalable method is you learn from experience.
You try things, you see what works.
No one has to tell you.
First of all, you have a goal.
So without a goal, there's no sense of right or wrong or better or worse.
So large language models are trying to get by without having a goal or a sense of better or worse.
That's just, you know, it's exactly starting in the wrong place.
How old are these kids?
It's surprising.
You can have such a different point of view.
When I see kids, I see kids just trying things and waving their hands around and moving their eyes around.
And no one tells them... There's no imitation for how they move their eyes around or even the sounds they make.
They may want to create the same sounds, but the actions, the thing that the...
The large language model is learning from training data.
It's not learning from experience.
It's learning from something that will never be available during its normal life.
There's never any training data that says you should do this action in normal life.
Okay, I shouldn't have said never.
But I don't know.
I think I would even say it about school.
But formal schooling is the exception.
Don't be difficult.
I mean, this is obvious.
So I don't think learning is really about training.
I think learning is about learning.
It's about an active process.
The child tries things and sees what happens.
Right.
Yeah, it does not.
We don't think about training when we think of an infant growing up.
These things are actually rather well understood.
If you go to look about how psychologists think about learning, there's nothing like imitation.
Maybe there are some extreme cases where humans might do that or appear to do that, but there is no basic animal learning process called imitation.
The basic animal learning process is for prediction and for trial and error control.
I mean, it's really interesting how sometimes the most hardest things to see are the obvious ones.
It's obvious if you just look at animals and how they learn and you look at psychology and how our theories of them, it's obvious that supervised learning is not part of the way animals learn.
We don't have examples of desired behavior.
What we have is examples of things that happened, one thing that followed another.
And we have examples of we did something and there were consequences.
But there are no examples of supervised learning.
Supervised learning is not something that happens in nature.
And, you know...
School, even if that was the case, we should forget about it because that's some special thing that happens in people.
It doesn't happen broadly in nature.
Squirrels don't go to school.
Squirrels can learn all about the world.
It's absolutely obvious, I would say, that supervised learning doesn't happen in animals.
Why are you trying to distinguish humans?
Humans are animals.
What we have in common is more interesting.
What we have, what distinguishes us, we should be paying less attention to.
So I like the way you consider that obvious because I consider the opposite obvious.
Yeah, I think we have to understand how we are animals.
And if we understood a squirrel, I think we'd be almost all the way there.
It's understanding human intelligence.
The language part is just a small veneer on the surface.
Okay, so this is great.
You know, we're finding out the very different ways that we're thinking.
We're not arguing.
We're trying to share our different ways of thinking with each other.
No, I think about it the same way.
But still, it's a small thing on top of basic trial and error learning, prediction learning.
And that's what distinguishes us, perhaps, from many animals.
But we're an animal first.
And we were an animal before we had language and all those other things.
Morphics.
Let's lay out a little bit about what it is.
It says that experience, action, sensation, well, sensation, action, reward, and then this happens on and on and on, makes more life.
It says that this is the foundation and the focus of intelligence.
Intelligence is about taking that stream and altering the actions to increase the rewards in the stream.
Right.
So learning then is from the stream.
and learning is about the stream.
So that second part is particularly telling.
What you learn, your knowledge, your knowledge is about the stream.
Your knowledge is about if you do some action, what will happen?
Or it's about which events will follow other events.
It's about the stream.
It's the content of the knowledge is statements about the stream.
And so because it's a statement about the stream, you can test it by comparing it to the stream and you can learn it continually.
So when you're imagining this future continual learning agent.
They're not future.
Of course, they exist all the time.
This is what reinforcement learning paradigm is, learning from experience.
The reward function is arbitrary.
And so if you're playing chess, it's to win the game of chess.
If you're a squirrel, maybe the reward has to do with getting nuts.
Right.
In general, for an animal, you would say the reward is to avoid pain and to acquire pleasure.
Right.
And there's also would be a component having to do with, I think there should be a component having to do with your increasing understanding of your environment.
That would be sort of an intrinsic motivation.
I don't like the word model when used the way you just did.
I think a better word would be the network.
So I think you mean the network.
Maybe there's many networks.
So anyway, things would be learned and then you'd have copies and many instances.
And sure, you'd want to share knowledge across all.
the instances.
And there would be lots of possibilities for doing that.
Like there is not today.
You can't have one child grow up and learn about the world and then every new child has to repeat that process.
Whereas with AIs, with the digital intelligence, you could hope to do it once and then copy it into the next one as a starting place.
So this would be a huge savings and I think actually it would be much more important than trying to learn from people.
So this is something we know very well, and the basis of it is temporal difference learning, where the same thing happens in a less grandiose scale, like when you learn to play chess.
The long-term goal is winning the game, and yet you want to be able to learn from shorter-term things, like taking your opponent's pieces.
And so you do that by having a value function, which predicts the long-term outcome.
And then if you take the guy's pieces, well, your prediction about the long-term outcome is changed.
It goes up.
You think you're going to win.
And then that increase in your belief changes.
immediately quote reinforces the uh the move that led to taking the piece okay so we have this long-term 10-year goal of making a startup and making a lot of money and so when we make progress we say oh i'm i'm i'm more likely to uh achieve the long-term goal and that rewards the the steps along the way
I think the crux of this, and I'm not sure, but...
The big world hypothesis seems very relevant, and the reason why humans become useful on their job is because they are encountering the particular part of the world, and it can't have been anticipated, and it can't all have been put in in advance.
The world is so huge that you can't... The dream, as I see it, the dream of large language models is you can teach the agent everything and it will know everything and it won't have to learn anything online.
right during its life okay and and your examples are all well really you have to because you can there's a lot to you can teach it but there's all little idiosyncrasies of the particular life they're leading and the the particular people they're working with and what they like as opposed to what average people like right and so that's just saying the world is really big and so you're going to have to learn it uh along the way
And I'm- So I would say you're just doing regular learning.
Maybe using context, because in large language models, all that information has to go into the context window.
Right.
But in a continual learning setup, it just goes into the weights.
Maybe, yeah, so maybe context is the wrong word to use, because I mean a more general thing.
You learn a policy that's specific to the environment that you're finding yourself in.
So maybe we're trying to ask the question of, it seems like the reward is too small of a thing to do all the learning that we need to do.
But, of course, we have the sensations, right?
We have all the other information we can learn from.
Right.
We don't just learn from the reward.
We learn from all the data.
So now I want to talk about the base common model of the agent with the four parts.
Right.
So we need a policy.
The policy says...
In the situation I'm in, what should I do?
We need a value function.
The value function is the thing that is learned with TD learning.
And the value function produces a number.
The number says, how well is it going?
And then you watch if that's going up and down and use that to adjust your policy.
Okay, so those two things.
And then there's also the perception component, which is the construction of your state representation, your sense of where you are now.
And the fourth one is what we're really getting at, most transparently anyway.
The fourth one is the transition model of the world.
That's why I am uncomfortable just calling everything models, because I want to talk about the model of the world.
the transition model of the world your belief that if you do this what will happen what will be the consequences of what you do so your physics of the world but it's all it's not just physics it's also um abstract models like you know your model of how you traveled um from california up to edmonton for this podcast that was a model and that's a transition model and that would be uh learned and it's not learned from reward it's learned from you did things you saw what happened yeah you made that model of the world
That will be learned very richly from all the sensation that you receive, not just from the reward.
It has to include the reward as well, but that's a small part of the whole model, small crucial part of the whole model.
The idea is totally general.
I do use all the time, as my canonical example, the idea of an AI agent is like a person.
And people, in some sense, they have just one world they live in.
And that world may involve chess and it may involve Atari games.
But those are not a different task or a different world.
Those are different states that they encounter.
Right.
And so the general idea is not limited at all.
They just set it up.
It was not their ambition to have one agent across those games.
If we want to talk about transfer, we should talk about transfer, not across games or across tasks, but transfer between states.
We're not seeing transfer anywhere.
We're not seeing general... Critical to good performance is that you can generalize well from one state to another state.
We don't have any methods that are good at that.
What we have are people try different things and they settle on something that a representation that transfers well or that generalizes well.
But we don't have any automated techniques to promote.
We have very few automated techniques to promote transfer.
And none of them are used in modern deep learning.
The researchers did it.
Because there's no other explanation.
Gradient descent will not make you generalize well.
It will make you solve the problem.
It will not make you get new data.
you generalize in a good way.
Generalization means train on one thing that affects what you do on the other things.
So we know deep learning is really bad at this.
For example, we know that if you train on some new thing, it will often catastrophically interfere with all the old things that you knew.
So this is exactly bad generalization.
Right.
Generalization, as I said, is some kind of
influence of training on one state on other states.
And generalization is not necessarily good or bad.
Just the fact that you generalize is not necessarily good or bad.
You can generalize poorly, you can generalize well.
So generalization always will happen, but we need algorithms that will cause the generalization to be good rather than bad.
Well, large language models, so complex.
We don't really know what information they had prior.
We have to guess because they've been fed so much.
This is one reason why they're not a good way to do science.
It's just so uncontrolled, so unknown.
But if you come up with an entirely new... They're getting a bunch of things right, perhaps.
And so the question is why?
Well, it may be that they don't need to generalize to get them right because the only way to get some of them right is to form something which gets all of them right.
Right.
So if there's only one answer and you find it, that's not called generalization.
It's the only way to solve it, and so they find the only way to solve it.
Generalization is when it could be this way, it could be that way, and they do it the good way.
Well, there's nothing in them which will cause it to generalize well.
Creating dissent will cause them to find a solution to the problems they've seen.
And if there's only one way to solve them, they'll do that.
But there are many ways to solve it, some which generalize well, some which generalize poorly.
There's nothing in them, in the algorithms, that will cause them to generalize well.
But people, of course, are involved.
And if it's not working out, they fiddle with it.
Right.
until they find a way, perhaps until they find a way which it generalizes well.
Okay.
So, yeah, I...
thought a little bit about this.
There are many things, or a handful of things.
First, the large language models are surprising.
It's surprising how effective artificial neural networks are at language tasks.
That was a surprise.
It wasn't expected.
Language seemed different.
So that's impressive.
There's a long-standing controversy in AI about simple basic principle methods, the general-purpose methods like search and learning, compared to human-enabled systems like symbolic methods.
So in the old days, it was interesting because things like search and learning were called weak methods because they just use general principles.
They're not using the power that comes from imbuing a system with human knowledge.
So those were called strong and weak.
And so I think the weak methods have just totally won.
That's the biggest question from the old days of AI.
What would happen?
Learning and search have just won the day.
But there's a sense which that was not surprising to me because I was always voting for or hoping or rooting for the simple basic principles.
And so even with the large language models, it's surprising how well it worked, but it was all good and gratifying.
And things like AlphaGo, it's sort of surprising how well that was able to work.
And AlphaZero in particular, how well it was able to work.
But it's all very gratifying because, again, it's simple basic principles are winning the day.
So the whole AlphaGo thing has a precursor, which is TD Gammon.
Jerry Tesoro did exactly that.
reinforcement learning, temporal difference learning methods to play backgammon.
And it beat the world's best players.
And it worked really well.
And so in some sense, AlphaGo was merely a scaling up of that process.
It was quite a bit of scaling up, and there was also an additional innovation in how the search was done.
But it made sense.
It wasn't surprising in that sense.
AlphaGo actually didn't use TD Learning.
It waited to see the final outcomes.
But AlphaZero used TD and AlphaZero was applied to all the other games and did extremely well.
I've always been very impressed by the way AlphaZero plays chess because I'm a chess player and it just
It just sacrifices material for sort of positional advantages.
And it's just content and patient to sacrifice that material for a long period of time.
And so that was surprising that it worked so well, but also gratifying and fitting into my worldview.
Yeah.
So this has led me where I am.
Where I am is I'm in some sense a contrarian or thinking differently from the field is.
And I am personally just kind of content being out of sync with my field for a long period of time, perhaps decades, because occasionally I have improved right in the past.
And the other thing I do to help me not feel I'm out of sync and thinking in a strange way is to look not at my local environment or my local field, but to look back in time and into history and to see what people have thought classically about the mind in many different fields.
And I don't feel I'm out of sync with the larger traditions.
I really view myself as a classicist rather than as a contrarian.
I go to what the larger community of thinkers about the mind have always thought.
You want to presume that it's been done.
Well, but you're using it to get AGI again.
So these AGIs, if they're not superhuman already, then the knowledge that they might impart would be not superhuman.
I'm not sure your idea makes sense because it seems to presume the existence of AGI.
And then we've already worked that out.
And the way AlphaZero was an improvement was it did not use the human knowledge, but just went from experience.
Right.
So why do you say bring in other agents' expertise to teach it when it's worked so well from experience and not by help from another agent?
Right.
I think more interesting is just think about that case.
Which when you have many AIs, will they help each other?
the way cultural evolution works in people.
Let's just, maybe we should talk about that.
The bitter lesson, oh, who cares about that?
That's an empirical observation about a particular period in history.
70 years in history, no longer, doesn't necessarily have to apply the next 70 years.
So the interesting question is, you're an AI, you get some more computer power.
Should you use it to make yourself more computationally capable?
Or should you use it to spawn off a copy of yourself to go learn something interesting on the other side of the planet or on some other topic and then report back to you?
Yep.
I think that's a really interesting question that will only arise in the age of digital intelligences.
I'm not sure what the answer is, but I think it will... More questions.
Will it be possible to really spawn it off, send it out, learn something new, something perhaps very new, and then will it be able to be reincorporated into the original?
Or will it have changed so much that...
It can't really be done.
Is that possible or is it not?
And you can carry this to its limit, as I saw one of your videos the other night that suggested that it could, where you spawn off many, many copies, do different things, highly decentralized, but report back to the central master.
And that this will be such a powerful thing.
Well, I think one thing that, so this is my attempt to add something to this view, is that a big question, a big issue will become corruption.
You know, if you really could just get information from anywhere and bring it into your central mind, you could become more and more powerful.
And it's all digital and they all speak some internal digital language.
Maybe it'll be easy and possible, but...
it will not be that easy, as easy as you're imagining, because you can lose your mind this way.
If you pull in something from the outside and build it into your inner thinking, it could take over you.
It could change you.
It could be your destruction rather than your increment in knowledge.
I think this will become a big concern, particularly when you're, oh, he's figured all about how to play some new game or figured out he studied Indonesia and you want to incorporate that into your mind.
Yeah.
So you think, oh, just read it all in.
And that'll be fine.
But no, you've just read a whole bunch of bits into your mind.
And they could have viruses in them.
They could have hidden goals.
They can warp you and change you.
And this will become a big thing.
How do you have cybersecurity in the age of digital spawning and reforming again?
Yeah, so I do think succession to digital or digital intelligence or augmented humans is inevitable.
So the argument, I have a four part argument.
Step one is,
there's no government or organization that gives humanity a unified point of view that dominates and that can arrange.
There's no consensus about how the world should be run.
And number two, we will figure out how intelligence works.
Researchers will figure it out eventually.
And number three, we won't stop
Just with human-level intelligence, we will reach superintelligence.
And number four is that it's inevitable over time that the most intelligent things around would gain resources and power.
And, uh, so put all that together, it's, you know, you, um, it's sort of inevitable that you're going to have, um, succession to AI or to AI enabled augmented humans.
So within those, those four things seem clear and, and, and sure to happen.
Uh, but within that set of possibilities, some, there can be good outcomes as well as less good outcomes, bad outcomes.
And, um,
So I'm just trying to be realistic about where we are and ask how we should feel about it.
Right.
And so then I do encourage people to think positively about it, first of all, because it's something we humans have always tried to do for thousands of years, tried to understand themselves, trying to make themselves think better.
And...
you know, just understand themselves.
So this is a great success as science, humanities.
We're finding out what this essential part of humanness is, what it means to be intelligent.
And then what I usually say is that this is all kind of human-centric.
What if we look, you step aside from being a human and just say, take the point of view of the universe.
And this is, I think, a major stage in the universe, a major transition, a transition from replicators, humans and animals,
plants we're all replicators and that gives us some strengths and some limitations and then we're entering the age of design where because our ai's are designed our our our all of our physical objects are designed our buildings are designed our our technology is designed and we're we're designing now uh ai's things that can be intelligent themselves and that are themselves capable of design and so this is this is a key step in the world and in the universe and i think
So it's the transition from the world in which most of the interesting things that are, are replicated.
Replicated means you can make copies of them, but you don't really understand them.
Like right now we can make more intelligent beings, more children, but we don't really understand how intelligence works.
Right.
Whereas we're reaching now to having design intelligence, intelligence that we do understand how it works, and therefore we can change it in different ways and at different speeds than otherwise.
And our future, they might not be replicated at all.
We may just design AIs, and those AIs will design other AIs, and everything will be done by design and construction rather than by replication.
Yeah, I mark this as one of the four great stages of the universe.
First there's dust, ends of stars, and then stars make planets, and the planets give rise to life, and now we're giving life to designed entities.
And so I think we should be proud that we are giving rise to this great transition in the universe.
Yeah, so it's an interesting thing.
Should we consider them part of humanity or different from humanity?
It's our choice.
It's our choice whether we say, oh, they are our offspring and we should be proud of them and we should celebrate their achievements.
Or we could say, oh, no, they're not us and we should be horrified.
It's interesting that it feels to me like a choice, and yet it's such a strongly held thing that how can we be a choice?
I like these sort of contradictory implications of thought.
So are you thinking like maybe we are like the Neanderthals who give rise to Homo sapiens.
Maybe Homo sapiens will give rise to a new group of people.
Well, I think it's relevant to point out that for most of humanity, they don't have much influence on what happens.
Most of humanity doesn't influence who can control the atom bomb.
or who controls the nation states.
Even as a citizen, I often feel that we don't control the nation states very much.
They're out of control.
A lot of it has to do with just how you feel about change.
And if you think the current situation is really, really good, then you're more likely to be suspicious of change and averse to change than if you think...
it's imperfect.
And I think it's imperfect.
In fact, I think it's pretty bad.
So I'm open to change.
And I think humanity has had a super good track record.
And maybe it's the best thing that there's been, but it's far from perfect.
We should be concerned about our future, the future.
We should try to make it good.
We also, though, should recognize the limits, our limits.
And I think we want to avoid the feeling of entitlement, avoid the feeling, oh, we're here first.
We should always have it in a good way.
How should we think about the future and how much control a particular species on a particular planet should have over it?
And how much control do we have?
You know, a counterbalance to our limited control over the long-term future of humanity should be how much control do we have over our own lives?
Like we have our own goals and we have our families and those things are much more controllable than like trying to control the whole universe.
So I think it's appropriate for us to really work towards our own local goals.
And it's kind of aggressive for us saying, oh, the future has to evolve this way that I want it to.
Sure.
Because then we'll have arguments.
Different people think the future, the global future should evolve in different ways.
And they have conflict.
So you're saying we're trying to design the future and the principles by which it will evolve and come into being.
Right.
And so you're saying the first thing you're saying is, well, we will, we try to teach our children general principles which will promote
more likely evolutions.
Maybe we should also seek for things being voluntary.
If there is change, we want it to be voluntary rather than imposed on people.
I think that's a very important point.
And yeah, that's all good.
I think this is like a big, you know, the big...
the big or one of the really big human enterprises to design society.
And that's been ongoing for thousands of years again.
And so it's like the more things change, really the more things, they stay the same.
We still have to figure out how to be.
The children will still come up with different values that seem strange to their parents and their grandparents and things will evolve.
Okay.
Thank you very much.