Sergey Levine
👤 PersonPodcast Appearances
Yeah.
So physical intelligence aims to build robotic foundation models.
And that basically means general purpose models that could, in principle, control any robot to perform any task.
We care about this because we see this as a very fundamental aspect of the AI problem.
Like the robot is essentially encompassing all AI technology.
So if you can get a robot that's truly general, then you can do hopefully a large chunk of what people can do.
And where we're at right now is I think we've kind of gotten to the point where we've built out a lot of the basics.
And, you know, I think those basics actually are pretty cool.
Like they work pretty well.
We can get a robot that will like fold laundry and that will go into a new home and like try to clean up the kitchen.
But in my mind, what we're doing at Physical Intelligence right now is really the very, very early beginning.
It's just like putting in place the basic building blocks on top of which we can then tackle all these like really tough problems.
So there are a few things that we need to get right.
I mean, dexterity obviously is one of them.
And in the beginning, we really wanted to make sure that
we understand whether the methods that we're developing have the ability to tackle like the kind of intricate tasks that people can do.
As you mentioned, like folding a box, folding different articles of laundry, cleaning up a table, making a coffee, that sort of thing.
And that's like, that's good.
Like that works.
I think that the results we've been able to show are pretty cool.
But again, the end goal of this is not to fold a nice t-shirt.
The end goal is to just confirm our initial hypothesis that the basics are kind of solid.
But from there, there are a number of really major challenges.
And I think that sometimes when results get abstracted to the level of a three-minute video, someone can look at this video and it's like, oh, that's cool.
That's what they're doing.
But it's not like it's a very simple and basic version of what I think is to come.
Like what you really want from a robot is not to tell it like, hey, please fold my T-shirt.
What you want from a robot is to tell it like, hey, robot, like.
You're now doing all sorts of home tasks for me.
I like to have dinner made at 6 p.m.
I wake up and go to work at 7 a.m.
I'd like, you know, I like to do my laundry on Saturday.
So make sure that's ready.
This and this and this.
And by the way, check in with me like every Monday to see like, you know, what I want you to pick up when you do the shopping.
Right.
Like that's the prompt.
And then the robot should go and do this for like, you know, six months, a year.
Like that's the duration of the task.
So it's...
Ultimately, if this stuff is successful, it should be a lot bigger.
And it should have that ability to learn continuously.
It should have the
understanding of the physical world, the common sense, the ability to go in and pull in more information if it needs it.
Like, if I ask you, like, hey, tonight, like, you know, can you make me this type of salad?
Okay, you should, like, figure out what that entails, like, look it up, go and buy the ingredients.
So there's a lot that goes into this.
It requires common sense.
It requires understanding that there are certain edge cases you need to handle intelligently, cases where you need to think harder.
It requires the ability to improve continuously.
It requires understanding safety, being reliable at the right time, being able to fix your mistakes when you do make those mistakes.
So there's a lot more that goes into this.
But the principles there are you need to leverage prior knowledge and you need to have the right representations.
I think it's something where –
It's not going to be a case where we develop everything in the laboratory and then it's done, and then come 20, 30-something, you get a robot in a box.
I think it'll be the same as what we've seen with AI assistance, that once we reach some basic level of competence where the robot is delivering something useful, it'll go out there in the world.
The cool thing is that once it's out there in the world, it can collect experience and leverage that experience to get better.
To me, like what I tend to think about a lot in terms of timelines is not the date when it will be done but the date when it will – when like the flywheel starts basically.
So when does the flywheel start?
I think that could be very soon.
And I think there are some decisions to be made.
Like the tradeoff there is the more narrow you scope the thing, the earlier you can get it out into the real world.
So but soon as in like this is something we're already exploring.
We're already trying to figure out like what are like the real things this thing could do that could allow us to start spinning the flywheel.
But I think in terms of like stuff that you would actually care about that you would want to see.
So I don't know.
But I think that single digit years is very realistic.
I'm really hoping it'll be more like one or two before something is like actually out there.
But it's hard to say.
It means that there is a robot that does a thing that you actually care about, that you want done.
And it does so competently enough to like actually do it for real, for real people that want it done.
Well, I think it's actually...
I think it's actually very close to working.
And I am 100% certain that many organizations are working on exactly this.
In fact, arguably, there is already a flywheel in the sense that, not an automated flywheel, but a human loop flywheel where everybody who's deploying an LLM is, of course, going to look at what it's doing.
And it's going to use that to then modify its behavior.
It's complex because it comes back to this question of representations and figuring out the right way to derive supervision signals and ground those supervision signals in the behavior of the system so that it actually improves on what you want.
And I don't think that's like a profoundly impossible problem.
It's just something where the details get like pretty gnarly.
and challenges with algorithms and stability become pretty complex.
So it's just, it's something that's taken a while for the community collectively to get their hands around.
Yeah, I don't think there's a profound reason why robotics is that different, but there are a few small differences that I think make things a little bit more manageable.
Especially if you have a robot that's doing something in cooperation with people, whether it's a person that's supervising it or directing it, there are very natural sources of supervision.
There's a big incentive for the person to provide the assistance that will make things succeed.
There are a lot of dynamics where you can make mistakes and recover from those mistakes and then reflect back on what happened and avoid that mistake in the future.
And I think that when you're doing physical things in the real world, that kind of stuff just happens more often than it does if you're like,
an AI assistant answering a question.
Like, if you answer a question and you just, like, answered it wrong, it's like, well, it's not like you can just, like, go back and, like, tweak a few things.
Like, the person you told the answer to might not even know that it's wrong.
Whereas if you're, like, folding the T-shirt and you messed up a little bit, like, yeah, it's pretty obvious.
Like, you can reflect on that, figure out what happened, and do it better next time.
Well, I think it's actually not that different than what we've seen with LLMs in some ways, that it's a matter of scope.
Like if you think about coding assistance, right?
Like initially, the best tools for coding, they could do like a little bit of completion.
Like you give them a function signature and they'll like try their best to type out like the whole function and they'll maybe like get half of it right.
And as that stuff progresses, then you're willing to give these things a lot more agency.
So that like the very best coding systems now, like if you're doing something relatively formulaic, maybe it can like put together most of a PR for you for something, you know, fairly accessible.
So I think it'll be the same thing.
That we'll see an increase in the scope that we're willing to give to the robots as they get better and better.
Where initially the scope might be like there is a particular thing you do like.
you're making the coffee or something.
Whereas as they get more capable, as their ability to have common sense and a broader repertoire of tasks increases, then we'll give them greater scope.
Now you're running the whole coffee shop.
I mean my sense there too is that this is probably a single-digit thing rather than a double-digit thing.
But the reason it's so hard to really pin down is because as with all research, it does depend on figuring out a few question marks.
And I think my answer in terms of the nature of those question marks is I don't think these are things that require –
profoundly, deeply different ideas, but it does require the right synthesis of the kinds of things that we already know.
And, you know, sometimes synthesis, to be clear, is just as difficult as coming up with like profoundly new stuff, right?
So I think it's intellectually a very deep and profound problem and figuring that out is going to be like very exciting.
But it
I think we kind of like know like roughly the puzzle pieces and it's something that we need to work on.
And I think if we work on it and we're a bit lucky and everything kind of goes as planned, I think single digit is reasonable.
So I think there's a nuance here.
And the nuance is it becomes more obvious if we consider the analogy to the coding assistance, right?
It's not like the –
The nature of coding assistance today is that there's a switch that flips and suddenly, instead of writing software, suddenly all software engineers get fired and everyone's using LMs for everything.
And that actually makes a lot of sense that the biggest gain in productivity comes from experts, which is software engineers, whose productivity is now augmented by these really powerful tools.
It's a very subtle question.
I think what it probably will come down to is this question of scope.
The reason that LLMs aren't doing all software engineering is because they're good within a certain scope, but there's limits to that.
Those limits are increasing, to be clear, every year.
I think that there's no reason that we wouldn't see the same kind of thing with robots.
The scope will have to start out small because there will be certain things that these systems can do very well and certain other things where more human oversight is really important.
And the scope will grow and what that will translate into is increased productivity.
And some of that productivity will come from
like the robots themselves being valuable, and some of it will come from the people using the robots are now more productive in their work.
That's a very hard question to answer.
I think...
I'm probably not prepared to tell you what percentage of all labor work can be done by robots because I don't think right now off the cuff I have a sufficient understanding of what's involved in that big of a cross-section of all physical labor.
I think what I can tell you is this, that I think it's much easier to get effective systems rolled out gradually in a human-in-the-loop setup.
And again, I think this is exactly what we've seen with coding systems.
And I think we'll see the same thing with automation where
Basically, robot plus human is much better than just human or just robot.
And that just like makes total sense.
It also makes it much easier to get all the technology bootstrapped because when it's robot plus human, now there's a lot more potential for the robot to like actually learn on the job, acquire new skills.
It's just like, you know.
And also because the human can help.
The human can give hints.
You know, let me tell you this story.
When we were working on the Pio5 project, this was the paper that we released last April, we initially controlled our robots with teleoperation in a variety of different settings.
And then at some point, we actually realized that we can actually make significant headway once the model was good enough by supervising it, not just with low-level actions, but actually literally instructing it through language.
Now, you need a certain level of competence before you can do that, but once you have that level of competence, just standing there and telling the robot, okay, now pick up the cup, put the cup in the sink, put the dish in the sink, just with words, already actually gives the robot information that it can use to get better.
Now, imagine what this implies for the human plus robot dynamic.
Like now, basically learning is not – for these systems, it's not just learning from real action.
It's also learning from words, eventually be learning from observing what people do, from the kind of natural feedback that you receive when you're doing a job together with somebody else.
And –
This is also the kind of stuff where the prior knowledge that comes from these big models is tremendously valuable because that lets you understand that interaction dynamic.
So I think that there's a lot of potential for these kind of human plus robot deployments to make the model better.
Yeah, that's a really good question.
So one of the big things that is different now than it was in 2009 actually has to do with
the technology for machine learning systems that understand the world around them.
Principally, for autonomous driving, this is perception.
For robots, it can mean a few other things as well.
And perception certainly was not in a good place in 2009.
The trouble with perception is that it's one of those things where you can nail a really good demo with a somewhat engineered system, but
hit a brick wall when you try to generalize it.
Now at this point in 2025, we have much better technology for generalizable and robust perception systems and more generally generalizable and robust systems for understanding the world around us.
Like when you say that the system is scalable and machine learning scalable really means generalizable.
So that gives us a much better starting point today.
So that's not an argument about robotics being easier than autonomous driving.
It's just an argument for 2025 being a better year than 2009.
But there's also other things about robotics that are a bit different than driving.
Like in some ways, robotic manipulation is a much, much harder problem.
But in other ways, it's a problem space where it's easier to get rolling, to start that flywheel with a more limited scope.
So to give you an example,
If you're learning how to drive, you would probably be pretty crazy to learn how to drive on your own without somebody helping you.
Like, you would not trust your teenage child to learn to drive just on their own, just drop them in the car and say, like, go for it.
And that's like a, you know, a 16-year-old who's had a significant amount of time to learn about the world.
You would never even dream of putting a 5-year-old in a car and tell him to get started.
But if you want somebody to, like, clean the dishes – like, dishes can break too, but you would probably be okay with a child trying to do the dishes –
without somebody constantly like, you know, sitting next to them with a break, so to speak.
For a lot of tasks that we want to do with robotic manipulation, there's potential to make mistakes and correct those mistakes.
And when you make a mistake and correct it, well, first you've achieved the task because you've corrected, but you've also gained knowledge that allows you to avoid that mistake in the future.
With driving, because of the dynamics of how it's set up, it's very hard to make a mistake, correct it, and then learn from it because the mistakes themselves have significant ramifications.
Now, not all manipulation tests are like that.
There are truly some, like, very safety-critical stuff.
And this is where the next thing comes in, which is common sense.
Common sense, meaning the ability to make inferences about what might happen that are reasonable guesses, but that do not require you to experience that mistake and learn from it in advance.
That's tremendously important, and that's something that we basically had no idea how to do about five years ago.
But now, you...
we can actually use LLMs and VLMs, ask them questions, and they will make reasonable guesses.
Like, they will not give you expert behavior, but you can say, like, hey, there's a sign that says slippery floor.
Like, what's going to happen when I walk over that?
Kind of pretty obvious, right?
And no autonomous car in 2009 would have been able to answer that question.
So common sense plus the ability to make mistakes and correct those mistakes, like that's sounding like an awful lot like what a person does when they're trying to learn something.
All of that doesn't make robotic manipulation easy necessarily, but it allows us to get started with a smaller scope and then grow from there.
Yeah, that's a really good question.
So I'll start out with maybe a slight modification to your comment is I think they've made a lot of progress.
And in some ways, a lot of the work that we're doing now at Physical Intelligence is built on the backs of lots of other great work that was done, for example, at Google.
Like many of us were actually at Google before.
We were involved in some of that work.
Some of it is work that we're drawing on that others did.
So there's definitely been a lot of progress there.
But –
To make robotic foundation models really work, it's not just a laboratory science kind of experiment.
It also requires kind of industrial scale building effort.
It's more like the Apollo program than it is like a science experiment.
The excellent research that was done in the past in industrial research labs, and I know I was involved in much of that, was very much framed as a fundamental research effort.
And that's good.
Like the fundamental research is really important, but it's not enough by itself.
You need the fundamental research and you also need the impetus to make it real.
And make it real means like actually put the robots out there.
data that is representative of the kind of tasks that they need to do in the real world, get that data at scale, build out the systems, get all that stuff right.
And that requires a degree of focus, a singular focus on really nailing the robotic foundation model for its own sake, not just as a way to do more science, not just as a way to publish a paper, and not just as a way to have a research lab.
Yeah, that's a really good question.
The challenge here is in understanding which axis of scale contributes to which axis of capability.
So if we want to expand capability horizontally, meaning like the robot knows how to do 10 things now and I'd like it to do 100 things later, that can be addressed by just directly horizontally scaling what we already have.
But we want to get robots to a level of capability where they can do practical useful things in the real world, and that requires expanding
Along other axes too, it requires, for example, getting to very high robustness.
It requires getting them to perform tasks very efficiently, quickly.
It requires them to recognize edge cases and respond intelligently.
And those things, I think, can also be addressed with scaling.
But we have to identify the right axes for that, which means figuring out what kind of data to collect, what settings to collect it in, what kind of methods consume that data, how those methods work.
So answering those questions more thoroughly will give us greater clarity on the axes, on those dependent variables, on the things that we need to scale.
And
We don't fully know right now what that will look like.
I think we'll figure it out pretty soon.
It's something we're working on actively.
But we want to really get that right so that when we do scale it up, it'll directly translate into capabilities that are very relevant to practical use.
It's very hard to do because robotic experience consists of time steps that are very correlated with each other.
So like the raw like byte representation is enormous, but probably the information density is comparatively low.
Maybe a better comparison is to the data sets that are used for multimodal training.
And there it's, I believe last time we did that count, it was like between one and two orders of magnitude.
the vision you have of uh robotics will not be possible until you collect like what 100x 1000x more data well that's the thing that we don't know that um uh it's certainly very reasonable to infer that like you know robotics is a tough problem uh and probably it requires you know as much experience as the language stuff but because we don't know the answer to that to me a much more useful way to think about it is not
How much data do we need to get before we're fully done?
But how much data do we need to get before we can get started?
Meaning before we can get a data flywheel that represents a self-sustaining and ever-growing data collection.
Learning on the job or acquiring data in a way that the process of acquisition of that data itself is useful and valuable.
I see.
Like, just some kind of RL.
Like doing something, like, actually real.
Yeah.
I mean, ideally, I would like it to be RL because you can get away with the robot acting autonomously.
Which is easier.
But it's not out of the question that you can have mixed autonomy.
You can, you know, as I mentioned before, robots can learn from all sorts of other signals.
I described how we can have a robot that learns from a person talking to it.
So there's a lot of middle ground in between fully teleoperated robots and fully autonomous robots.
Yeah, so the current model that we have basically is a vision language model that has been adapted for motor control.
So to give you a little bit of like a fanciful brain analogy, a VLM, a vision language model, is basically an LLM that has had a little like pseudo visual cortex grafted to it, a vision encoder, right?
So our models, they have a vision encoder, but they also have an action expert, an action decoder essentially.
So it has like a little visual cortex and notionally a little motor cortex.
And the way that the model actually makes decisions is it reads in the sensory information from the robot.
It does some internal processing and that could involve actually outputting intermediate steps.
Like you might tell it clean up the kitchen and it might think to itself like, hey, to clean up the kitchen, I need to pick up the dish and I need to pick up the sponge and I need to put this and this.
And then eventually it works its way through that chain of thought generation down to the action expert, which actually produces continuous actions.
And that has to be a different module because the actions are continuous.
They're high frequency, so they have a different data format than text tokens.
But structurally, it's still an end-to-end transformer.
And roughly speaking, technically, it corresponds to a kind of mixture of experts architecture.
That's right.
With the exception that the actions are actually not represented as discrete tokens.
It actually uses a flow matching kind of diffusion because they're continuous and you need to be very precise with your actions for dexterous control.
Yeah, so one theme here that like I think is important to keep in mind is that
The reason that those building blocks are so valuable is because the AI community has gotten a lot better at leveraging prior knowledge.
And a lot of what we're getting from the pre-trained LLMs and VLMs is prior knowledge about the world.
And it's a little bit abstracted knowledge.
You can identify objects.
You can figure out roughly where things are in an image, that sort of thing.
But I think if I had to summarize in one sentence
the big benefit that recent innovations in AI give to robotics is really the ability to leverage prior knowledge.
And I think the fact that the model is the same model, that's kind of always been the case in deep learning, but it's that ability to pull in that prior knowledge, that abstract knowledge that can come from many different sources.
That's really powerful.
What's up with that?
Yeah, yeah.
Yeah, so I have maybe two things I can say there.
I have some bad news and some good news.
So the bad news is what you're saying is really getting at the core of a long-running challenge with video and image generation models.
Yeah.
In some ways, the idea of getting intelligent systems by predicting video is even older than the idea of getting intelligent systems by predicting text.
But the text stuff turned into practically useful things earlier than the video stuff did.
I mean, the video stuff is great.
You can generate cool videos, and I think that the work there that's been done recently is amazing.
But it's not like just generating videos and images.
has already resulted in systems that have this kind of, like, deep understanding of the world where you can, like, ask them to, like, do stuff beyond just generating more images and videos.
Whereas with language, clearly it has.
And I think that this point about representations is really key to it.
One way we can think about it is this, that if you...
Imagine pointing a camera outside this building.
There's the sky.
There's the clouds are moving around, the water, cars driving around, people.
If you want to predict everything that will happen in the future, you can do so in many different ways.
You can say, okay, there's people around, so let me get really good at understanding, like, the psychology of how people behave in crowds and predict the pedestrians.
But you could also say, like, well, there's clouds moving around.
Let me, like, understand everything about water molecules and ice particles in the air.
And you can go super deep on that.
Right.
If you want to fully understand down to the subatomic level everything that's going on, as a person, you could spend decades just thinking about that, and you'll never even get to the pedestrians or the water, right?
So if you want to really predict everything that's going on in that scene, there's just so much stuff.
that even if you're doing a really great job and capturing like 100% of something, by the time you get to everything else, like, you know, ages will have passed.
Whereas with text, it's already been abstract into those bits that we as humans care about.
So the representations are already there, and they're not just good representations.
They actually focus in on what really matters.
Okay, so that's the bad news.
Here's the good news.
The good news is that...
we don't have to just get everything out of like pointing a camera outside this building.
Because when you have a robot, that robot is actually trying to do a job.
So it has...
Yeah.
And its perception is in service to fulfilling that purpose.
And that is like a really great focusing factor.
We know that for people this really matters.
Like literally what you see is affected by what you're trying to do.
Like there's been no shortage of psychology experiments showing that people have like almost a shocking degree of tunnel vision where they will like literally not see things right in front of their eyes if it's not relevant to what they're trying to achieve.
Yeah.
And that is tremendously powerful.
There must be a reason why people do that because certainly if you're out in the jungle, seeing more is better than seeing less.
So if you have that powerful focusing mechanism, it must be darn important for getting you to achieve your goal.
And I think robots will have that focusing mechanism because they're trying to achieve a goal.
Well, let me put it this way.
Like let's say that I gave you lots of videotapes or lots of recordings of different sporting events and gave you a year to just watch sports.
And then after that year, I told you, okay, now your job, you're going to be playing tennis.
Okay, that's pretty dumb, right?
Whereas if I told you first, you're going to be playing tennis, and then I let you study up, right?
Now you really know what you're looking for.
So I think that actually...
There's a very real challenge here.
I don't want to understate the challenge, but I do think that there's also a lot of potential for foundation models that are embodied, that learn from interaction, from controlling robotic systems, to actually be better at absorbing the other data sources because they know what they're trying to do.
I don't think that that by itself is like a silver bullet.
I don't think it solves everything.
But I think that it does help a lot.
And I think that we've already seen the beginnings of that where we can see that including web data in training for robots really does help with generalization.
And I actually have the suspicion that in the long run, it'll make it easier to use those sources of data that have been tricky to use up until now.
Yeah.
So there's a subtlety here.
Emerging capabilities don't just come from the fact that internet data has a lot of stuff in it.
They also come from the fact that generalization, once it reaches a certain level, becomes compositional.
There was a cute example that one of my students really liked to use in some of his presentations, which is –
You know what International Phonetic Alphabet is?
No.
So if you look in a dictionary, they'll have the pronunciation of a word written in kind of funny letters.
That's basically International Phonetic Alphabet.
So it's an alphabet that is pretty much exclusively used for writing down pronunciations of individual words in dictionaries.
And you can ask an LLM to write you a recipe for making some meal in International Phonetic Alphabet, and it will do it.
And that's like, holy crap.
That is definitely not something that it has ever seen because—
IPAs only ever used for writing down pronunciations of individual words.
So that's compositional generalization.
It's putting together things you've seen like that in new ways.
And it's like, you know, arguably there's nothing like profoundly new here because like, yes, you've seen different words written that way, but you figured out that now you can compose the words in this other language the same way that you've composed words in English.
So that's actually where the emergent capabilities come from.
And
Because of this, in principle, if we have a sufficient diversity of behaviors, the model should figure out that those behaviors can be composed in new ways as the situation calls for it.
We've actually seen things even with our current models, which I should say that I think they're in the grand scheme of things like looking back five years from now, we'll probably think that these are tiny in scale, but we've already seen what I would call emerging capabilities.
When we were playing around with some of our laundry folding policies,
Actually, we discovered this by accident.
The robot accidentally picked up two T-shirts out of the bin instead of one, starts folding the first one, the other one gets in the way, picks up the other one, throws it back in the bin.
And we're like, we didn't know it would do that.
Like, holy crap.
And then we tried to play around with it, and it's like, yep, it does that every time.
Like, you can...
drop in, you know, it's doing its work, drop something else on the table, just pick it up, put it back, right?
Okay, that's cool.
Shopping bag, it starts putting things in the shopping bag, the shopping bag tips over, it picks it back up and stands it upright.
We didn't tell anybody to collect data for that.
I'm sure somebody accidentally at some point or maybe intentionally picked up the shopping bag, but it's just...
You have this kind of compositionality that emerges when you do learning at scale.
And that's really where all these remarkable capabilities come from.
And now you put that together with language.
You put that together with all sorts of chain of thought reasoning.
And there's a lot of potential for the model to compose things in new ways.
I mean it's not that there's something good about having less memory to be clear.
Like I think that adding memory, adding longer context, all that stuff, adding higher resolution images, I think those things will make the model better.
But the reason why it's not the most important thing for the kind of skills that you saw when you visited us –
At some level, I think it comes back to Moravec's paradox.
So Moravec's paradox is basically that it's like, you know, if you want to know one thing about robotics, it's like that's the thing.
Moravec's paradox says that basically in AI, the easy things are hard and the hard things are easy, meaning like the things that we take for granted, like picking up objects, perceiving the world, all that stuff, those are all the hard problems in AI.
And the things that we find challenging, like playing chess and doing calculus, actually are often the easier problems.
And I think this memory stuff is actually more of a paradox in disguise where we think that the cognitively demanding tasks that we do that we find hard that kind of cause us to think like, oh, man, I'm sweating.
I'm working so hard.
Those are the ones that require us to keep lots of stuff in memory, lots of stuff in our minds.
Like if you're solving some big math problem, if you're having a complicated technical conversation on a podcast, like those are things we have to keep all those pieces, all those puzzle pieces in your head.
If you're
doing a well-rehearsed task, if you are an Olympic swimmer and you're swimming with perfect form and you're like right there in the zone, like people even say like it's in the moment,
It's in the moment, right?
It's like you've practiced it so much, you've baked it into your neural network in your brain that you don't have to think carefully about keeping all that context, right?
So it really is just more of its paradox manifesting itself, but that doesn't mean that we don't need the memory.
It just means that if we want to match the level of dexterity and physical proficiency that people have, there's other things we should get right first and then gradually go up that stack
into the more cognitively demanding areas, into reasoning, into context, into planning, all that kind of stuff.
And that stuff will be important too.
Yeah.
Well, that's a very big question.
Yeah, let's try to unpack this a little bit.
I think there's a lot going on in there.
One thing that I would say is a really interesting technical problem, and I think that it's something where we'll see
perhaps a lot of really interesting innovation over the next few years is the question of representation for context.
So if you imagine the, like some of the examples you gave, like if you have a home robot that's doing something and needs to keep track,
As a person, there are certainly some things where you keep track of them very symbolically, like almost in language.
Like, you know, I have my checklist.
I'm going shopping and I, you know, at least for me, I can like literally visualize in my mind like my checklist.
Like, you know, pick up the yogurt, pick up the milk, pick up whatever.
And I'm not like picturing the milk shelf with the milk sitting there.
I'm just thinking like milk, right?
But then there's other things that are much more spatial, almost visual things.
When I was trying to get to your studio, I was thinking like, okay, here's what this street looks like.
Here's what that street looks like.
Here's what I expect the doorway to look like.
So representing your context in the right form that captures what you really need to achieve your goal and otherwise kind of discards all the unnecessary stuff, that's a really important thing.
Yeah.
And I think we're seeing the beginnings of that with multimodal models.
But I think that multimodality has so much more to it than just like image plus text.
And I think that that's a place where there's a lot of room for really exciting innovation.
Yeah, how we represent both context, both what happened in the past, and also plans or reasoning, as you can call it in our world, which is what we would like to happen in the future or intermediate processing stages in solving a task.
I think doing that in a variety of modalities, including potentially learned modalities that are suitable for the job, is something that has, I think, enormous potential to overcome some of these challenges.
Interesting.
Yeah, that's a really good question.
So I definitely don't know the answer to this.
I am not by any means well-versed in neuroscience.
But if I had to guess and also provide an answer that leans more on things I know, it's something like this, that the brain is extremely parallel.
It kind of has to be just because of the biophysics.
But
It's even more parallel than your GPU.
If you think about how a modern multimodal language model processes the input, if you give it some images and some text, first it reads in the images, then it reads in the text, and then proceeds one token at a time to generate the output.
It makes a lot more sense to me for an embodied system to have parallel processes.
Now, mathematically, you can actually make close equivalences between parallel and sequential stuff.
Like transformers aren't actually fundamentally sequential.
Like you kind of make them sequential by putting in position embeddings.
Transformers are fundamentally actually very parallelizable things.
That's what makes them so great.
So I don't think that actually...
Mathematically, this this like highly parallel thing where you're doing perception and proprioception and planning all at the same time is actually actually necessarily needs to look that different from a transfer, although its practical implementation will be different.
And you could imagine that the system will in parallel think about
okay, here's like my long-term memory, like here's what I've seen, you know, a decade ago.
Here's my short-term kind of spatial stuff.
Here's my semantic stuff.
Here's what I'm seeing now.
Here's what I'm planning.
And all of that can be implemented in a way that there's some, you know, very familiar kind of attentional mechanism, but in practice, all running in parallel, maybe at different rates, maybe with the more complex things running slower, the faster reactive stuff running faster.
I think there are a lot of things to this question.
I think certainly there's like a really fascinating systems problem.
I'm by no means a systems expert, but I would imagine that the right architecture in practice, especially if you want an affordable low-cost system, would be to externalize at least part of the thinking.
Uh, you know, you could imagine maybe in the future you'll have a robot that has like, uh, you know, if your internet connection is not very good, the robot is in kind of like a dumber reactive mode.
But if you have a good internet connection, then it can like be a little smarter.
Right.
It's pretty cool.
Um, but I think there is, there are also research and algorithms, things that can help here.
Um,
Like figuring out the right representations, concisely representing both your past observations, but also changes in observation, right?
Like, you know, your sensory stream is extremely temporally correlated, which means that the marginal information gained from each additional observation is not the same as the entirety of that observation.
because the image that I'm seeing now is very correlated to the image I saw before.
So in principle, if I want to represent it concisely, I could get away with a much more compressed representation than if I represent the images independently.
So there's a lot that can be done on the algorithm side to get this right, and that's really interesting algorithms work.
I think there's also like a really fascinating systems problem.
To be truthful, like, I haven't gotten to the systems problem because, you know, you want to implement the system once you sort of know the shape of the machine learning solution.
But I think there's a lot of cool stuff to do there.
I don't know.
But if I were to guess, I would guess that we'll actually see both.
That we'll see low-cost systems with off-board inference and more reliable systems, for example, in settings where, like if you have an outdoor robot or something where you can't rely on connectivity, that are costlier and have on-board inference.
A few things I'll say...
from a technical standpoint, that might contribute to understanding this.
While a real-time system obviously needs to be controlled in real time, often at high frequency, the amount of thinking you actually need to do for every time step might be surprisingly low.
And again, we see this in humans and animals.
When we...
plan out movements, there is definitely a real planning process that happens in the brain.
If you record from a monkey brain, you will actually find neural correlates of planning.
And there is something that happens in advance of a movement, and when that movement actually takes place, the shape of the movement correlates with what happened before the movement.
Like that's planning, right?
So that means that you put something in place and, you know, set the initial conditions of some kind of process and then unroll that process and that's the movement.
And that means that during that movement, you're doing less processing and you kind of batch it up in advance.
But you're not like entirely an open loop.
It's not like you're playing back a tape recorder.
You are actually reacting as you go.
You're just reacting at a different level of abstraction, a more basic level of abstraction.
And, again, this comes back to representations.
Figure out which representations are sufficient for kind of planning in advance and then enrolling, which representations require a tight feedback loop.
And for that tight feedback loop, like, what are you doing feedback on?
Like, you know, if I'm driving a vehicle, maybe I'm doing feedback on the position of the lane marker so that I stay straight.
And then at a lower frequency, I sort of gauge where I am in traffic.
Yeah.
So the key here is prior knowledge.
Yeah.
So, in order to effectively learn from your own experience, it turns out that it's really, really important to already know something about what you're doing.
Otherwise, it takes far too long.
It's just like it takes a person when they're a child a very long time to learn very basic things, to learn to write for the first time, for example.
Once you already have some knowledge, then you can learn new things very quickly.
Training the models with supervised learning now is to build out that foundation that provides the prior knowledge so they can figure things out much more quickly later.
And, again, this is not a new idea.
This is exactly what we've seen with LLMs, right?
LLMs started off being trained purely with next token prediction, and that provided an excellent starting point first for all sorts of synthetic data generation and then for RL.
So I think it makes total sense that we would expect basically any foundation model effort to follow that same trajectory where we first build out the foundation, essentially in like a somewhat brute force way.
And the stronger that foundation gets, the easier it is to then make it even better with much more accessible training.
I really hope that they will actually be the same.
And, you know, obviously I'm extremely biased.
I love robotics.
I think it's like it's very fundamental to AI.
But I think that it's optimistically that it's actually the other way around, that the robotics
element of the equation will make all the other stuff better.
There are two reasons for this that I could tell you about.
One has to do with representations and focus.
What I said before, with video prediction models, if you just want to predict everything that happens, it's very hard to figure out what's relevant.
If you have the focus that comes from actually trying to do a task, now that acts to structure how you see the world in a way that allows you to more fruitfully utilize the other signals.
That could be extremely powerful.
The second one is that understanding the physical world at a very deep fundamental level, at a level that goes beyond just what we can articulate with language, can actually help you solve other problems.
And we experience this all the time.
Like when we talk about abstract concepts, we say like this company has a lot of momentum.
We'll use social metaphors to describe inanimate objects like my computer hates me.
We experience the world in a particular way and our subjective experience shapes how we think about it in very profound ways.
And then we use that as a hammer to basically hit all sorts of other nails that are far too abstract to handle any other way.
Well, and I should say that the coding is probably like the pinnacle of abstract knowledge work in the sense that like just by the mathematical nature of computer programming, it's an extremely abstract activity, which is why people struggle with it so much.
This is a very subtle question.
Your example with the airplane pilot using simulation is really interesting.
But something to remember is that
When a pilot is using a simulator to learn to fly an airplane, they're extremely goal-directed.
So their goal in life is not to learn to use a simulator.
Their goal in life is to learn to fly the airplane.
They know there will be a test afterwards, and they know that eventually they'll be in charge of like a few hundred passengers, and they really need to not crash that thing.
And when we train –
models on data from multiple different domains the models don't know that they're supposed to solve a particular task they just see like hey here's one thing i need to master here's another thing i need to master so maybe like a better analogy there is if you if you're like playing a video game where you can fly an airplane and then eventually someone puts you in the cockpit of a real of a real one like it's not that the video game is useless but it's it's not the same thing and if you're
trying to play that video game and your goal is to like really like master the video game, you're not going to go about it in quite the same way.
Yeah, yeah.
So I think what you're trying to say is basically that, well, maybe if we have like a really smart model that's doing meta-learning, perhaps it can figure out that its performance on a downstream problem, a real-world problem, is increased by doing something in a simulator.
Yeah, that's right.
But here's the thing with this.
There's –
A set of these ideas that are all going to be like something like train to make it better on the real thing by leveraging something else.
And the key linchpin for all of that is the ability to train to be better on the real thing.
The thing is like I actually suspect in reality we might not even need to do something quite so explicit because –
Metal learning is emergent, as you pointed out before, right?
Like, LLMs essentially do a kind of metal learning via in-context learning.
I mean, we can debate as to how much that's learning or not, but the point is that large, powerful models trained on the right objective on real data get much better at leveraging all the other stuff.
And I think that's actually the key.
And coming back to your airplane pilot, like, the airplane pilot is trained on a real-world objective.
Like, their objective is to be a good airplane pilot, to be successful, to have a good career.
And all of that kind of propagates back into the actions they take in leveraging all these other data sources.
So what I think is actually the key here to leveraging auxiliary data sources, including simulation, is to build the right foundation model that is really good, that has those immersion abilities.
And to your point...
To get really good like that, it has to have the right objective.
Now, we know how to get the right objective out of real-world data.
Maybe we can get it out of other things, but that's harder right now.
And I think that, again, we can look to the examples of what happened in other fields.
Like these days, if someone trains an LLM for solving complex problems, they're using lots of synthetic data.
But the reason they're able to leverage that synthetic data effectively is because they have this starting point that is trained on lots of real data that kind of gets it.
And once it gets it, then it's more able to leverage all this other stuff.
So I think perhaps ironically, the key to leveraging other data sources, including simulation, is to get really good at using real data, understand what's up with the world, and then now you can fruitfully use all this other stuff.
So here's what I would say, that deep down at a very fundamental level,
The synthetic experience that you create yourself doesn't allow you to learn more about the world.
It allows you to rehearse things.
It allows you to consider counterfactuals.
But somehow information about the world needs to get injected into the system.
So – and I think the way you pose this question actually elucidates this very nicely because –
In robotics, classically, people have often thought about simulation as a way to inject human knowledge because a person knows how to write down differential equations.
They can code it up, and that gives the robot more knowledge than it had before.
But I think that increasingly what we're learning from experiences in other fields, from how the video generation stuff goes, from synthetic data for LLMs, is that actually probably the most powerful way to create synthetic experience is from a really good model.
Because, you know, the model probably knows more than a person does about those fine-grained details.
But then, of course, where does that model get the knowledge from experiencing the world?
So, in a sense, what you said, I think, is actually quite right in that a very powerful AI system can simulate a lot of stuff.
But also, at that point, it kind of almost doesn't matter because viewed as a black box, what's going on with that system is that information comes in and capability comes out.
And whether the way it processes that information is by imagining some stuff and simulating or by some model-free method is kind of irrelevant in understanding its capabilities.
Well, yeah, I mean, certainly when you sleep, your brain does stuff that looks an awful lot like what it does when it's awake, that looks an awful lot like playing back experience or perhaps generating new statistically similar experience.
And
So I think it's very reasonable to guess that perhaps simulation through a learned model is like part of how your brain figures out like counterfactuals basically.
But something that's kind of even more fundamental than that is that
Optimal decision-making at its core, regardless of how you do it, requires considering counterfactuals.
You basically have to ask yourself, if I did this instead of that, would it be better?
And you have to answer that question somehow.
And whether you answer that question by using a learned simulator or whether you answer that question by using a value function or something like that, by using a reward model, in the end, it's kind of all the same.
Like, as long as you have some mechanism for considering counterfactuals and figuring out which counterfactual is better, you've got it.
Yeah.
So...
I like thinking about it this way because it kind of simplifies things.
It tells us that the key is not necessarily to do really good simulation.
The key is to figure out how to answer counterfactuals.
That's cool.
So you're basically saying, like, how much concrete should I buy now to build a data center so that by 2030 I can power all the robots?
Yeah, yeah.
That is a more ambitious way of thinking about it than has occurred to me.
But it's a cool question.
I mean, the good thing, of course, is that the robots can help you build that stuff.
I mean in principle quite a lot, right?
I think that –
We have a tendency sometimes to think about robots as like mechanical people, but that's not the case, right?
Like people are people and robots are robots.
Like the better analogy for the robot, it's like your car or a bulldozer.
Like it has much lower maintenance requirements.
You can put them into all sorts of weird places and they don't have to look like people at all.
You can make a robot that's, you know, 100 feet tall.
You can make a robot that's tiny.
So I think that if you have the intelligence to power
very heterogeneous robotic systems, you can probably actually do a lot better than just having like, you know, mechanical people in effect.
And it can be a big productivity boost for the real people, and it can allow you to solve problems that are very difficult to solve now.
You can, you know, for example,
I'm not an expert on data centers by any means, but you could build your data centers in a very remote location because the robots don't have to worry about whether there's like a shopping center nearby.
Yeah, these are very tough questions.
And also, you know...
Economies of scale in robotics so far have not functioned the same way that they probably would in the long term.
Just to give you an example, when I started working in robotics in 2014, I used a very nice research robot called the PR2 that cost $400,000 to purchase.
When I started my research lab at UC Berkeley, I bought robot arms that were $30,000.
The kind of robots that we are using now at Physical Intelligence, each arm costs about $3,000, and we think they can be made for a small fraction of that.
So these things— What is the cause of that learning rate?
Well, there are a few things.
So one, of course, has to do with economies of scale.
So custom-built, high-end research hardware, of course, is going to be much more expensive than kind of more productionized hardware.
But the other—and then, of course, there's a technological element that as we get better at
building actuated machines, they become cheaper.
But there's also a software element, which is the smarter your AI system gets, the less you need the hardware to satisfy certain requirements.
So traditional robots and factories, they need to make motions that are highly repeatable, and therefore it requires a degree of precision and robustness that you don't need if you can use cheap visual feedback.
So AI also makes robots more affordable.
and lowers the requirements on the hardware.
That is a great question for my co-founder, Adnan Esmail, who is probably like the best person arguably in the world to ask that question of.
But certainly the drop in costs that I've seen has surprised me year after year.
So I don't know the answer to that question, but it's also a tricky question to answer because not all arms are made equal.
Like arguably the kind of robots that are like assembling cars in a factory are
are just not the right kind to think about.
Very few, because they are not currently commercially deployed, unlike the factory robots.
Well, you know, economies are very good at filling demand when there's a lot of demand, right?
Like how many iPhones were in the world in 2001, right?
That's right.
So I think there's definitely a challenge there.
And I think it's something –
that is worth thinking about.
And a particularly important question for researchers like myself is how can AI affect how we think about hardware?
Because there are some things that I think are going to be really, really important.
Like you probably want your thing to like not break all the time.
There's some things that are firmly in that category of like question marks.
Like how many fingers do we need?
Like you said yourself before that you were kind of surprised that a robot with two fingers can do a lot.
Okay, maybe you still want like more than that, but still like finding the bare minimum that still lets you have
good functionality that's important that's in the question mark box and there's some things that i think like we probably don't need like we probably don't need the robot to be like super duper precise because we know that feedback can compensate for that so i think my my job as i see it right now is to figure out what's sort of the minimal package we can get away with and i really like to think about robots in terms of minimal package because i don't think that we will have like the one ultimate robot like sort of the mechanical person basically
I think what we will have is a bunch of things that good, effective robots need to satisfy, just like good smartphones need to have a touchscreen.
That's something that we all kind of agreed on.
And then a bunch of other stuff that's kind of optional depending on the need, depending on the cost point, et cetera.
And I think there will be a lot of innovation where once we have very capable AI systems that can be plugged into any robot to endow it with some basic level of intelligence, then lots of different people can innovate on how to get the robot hardware to be optimal for each niche it needs to be in.
Not right now.
Maybe there will be someday.
I would really like, maybe I'm being idealistic, but I would really like to see a world where there's a lot of heterogeneity in robots.
It's a tough question to answer, mainly because things are changing so fast.
I think that, to me, the things that I spend a significant amount of time thinking about on the hardware side is really more like reliability and cost.
It's not that I'm that worried about cost.
It's just that cost translates to
number of robots, which translates to amount of data.
And being an ML person, I really like having lots of data.
So I really like having robots that are low cost because then I can have more of them and therefore more data.
And reliability is important more or less for the same reason.
But I think it's something that we'll get more clarity on as things progress, because as we
Basically, the AI systems of today are not pushing the hardware to the limit.
So as the AI systems get better and better, the hardware will get pushed to the limit, and then we'll hopefully have a much better answer to your question.
Yeah.
So this is a very complex question.
I'll start with the broader themes and then try to drill a little bit into the details.
So
One broader theme here is that if you want to have an economy where you get ahead by having a highly educated workforce, by having people that have high productivity, meaning that for each person's hour of work, lots of stuff gets done.
Automation is really, really good because automation is what multiplies the amount of productivity that each person has.
Again, same as like LLM coding tools.
LLM coding tools amplify the productivity of a software engineer.
Robots will amplify the productivity of basically everybody that is doing work.
Now, that's kind of like a final state, like a desirable final state.
Now, there's a lot of complexity in how you get to that state, how you make that an appealing journey to society, how you navigate the geopolitical dimension of that.
All of that stuff is actually pretty complicated, and it requires making a number of really good decisions, like good decisions about investing in a balanced robotics ecosystem, supporting
both software innovation and hardware innovation.
I don't think any of those are insurmountable problems.
It just requires a degree of kind of long-term vision and the right kind of balance of investment.
But what makes me really optimistic about this is that final state, that if...
I think we can all agree that in the United States, we would like to have the kind of society where people are highly productive, where we have highly educated people doing high-value work.
And because that end state seems to me very compatible with automation, with robotics, there's a lot of – at some level, there should be a lot of incentive to get to that state.
And then from there, we have to solve for all the details that will help us get there.
And that's not easy.
I think there's a lot of complicated –
decisions that need to be made in terms of private industry, in terms of investment, in terms of the political dimension.
But I'm very optimistic about it because it's like, it seems to me like the light at the end of the tunnel is kind of, it's in the right direction.
So, again, for the specifics of how we make that happen, I think that's a very long conversation that I'm probably not the most qualified to speak to.
But I think that in terms of the ingredients, the ingredient here that I think is important is that robots help with –
physical things, physical work.
And if producing robots is itself physical work, then getting really good at robotics should help with that.
It's a little circular, of course, and as with all circular things, you have to kind of bootstrap it and try to get that engine going.
It seems like it is an easier problem to address than, for example, the problem of digital devices where work goes into creating computers, phones, et cetera, but the computers and phones don't themselves help with the work.
Well, yeah.
And this is why I said before that I think something really important to get right here is a balanced robotics ecosystem, right?
Like I think –
I think AI is tremendously exciting, but I think we should also recognize that getting AI right is not the only thing that we need to do.
And we need to think about how to balance our priorities, our investment, the kind of things that we spend our time on.
Just as an example, at Physical Intelligence, we do take hardware very seriously, actually.
We build a lot of our own things, and we want to have a hardware roadmap alongside our AI roadmap.
But I think that that's just us.
I think that for the United States, arguably for human civilization as a whole, I think we need to think about these problems very holistically.
And I think it is easy to get distracted sometimes when
There's a lot of excitement, a lot of progress in one area like AI.
And we are tempted to lose track of other things including things you've said like, hey, like, you know, there's a hardware component.
There's an infrastructure component with compute and things like that.
So I think that in general it's good to have a more holistic view of these things and I wish we had, you know, more holistic conversations about that sometimes.
So I think at some level that's a very reasonable way to look at things.
But I think that if there's one thing that I've learned about technology, it's that
It rarely evolves quite the way that people expect.
And sometimes the journey is just as important as the destination.
So I think it's actually very difficult to plan ahead for an end state.
But I think directionally what you said makes a lot of sense.
And I do think that it's very important for us collectively to think about how to structure the world around us in a way that is amenable to greater and greater automation across all sectors.
But I think we should really think about the journey just as much as the destination because things evolve in all sorts of unpredictable ways.
And we'll find automation showing up in all sorts of places, probably not the places we expect first.
So, you know, I think that the constants here that I think are really important is education is really, really valuable.
Like education is important.
the best buffer somebody has against the negative effects of change.
So if there is like one single lever that we can pull collectively as a society, it's like more education because that's very helpful.
Well, what education gives you is flexibility.
So it's less about the particular facts you know as it is about your ability to acquire skills, acquire understanding.
So
It has to be good education.
Right.
Yeah, this was intense.