Menu
Sign In Pricing Add Podcast
Podcast Image

Lex Fridman Podcast

#416 – Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI

Thu, 07 Mar 2024

Description

Yann LeCun is the Chief AI Scientist at Meta, professor at NYU, Turing Award winner, and one of the most influential researchers in the history of AI. Please support this podcast by checking out our sponsors: - HiddenLayer: https://hiddenlayer.com/lex - LMNT: https://drinkLMNT.com/lex to get free sample pack - Shopify: https://shopify.com/lex to get $1 per month trial - AG1: https://drinkag1.com/lex to get 1 month supply of fish oil Transcript: https://lexfridman.com/yann-lecun-3-transcript EPISODE LINKS: Yann's Twitter: https://twitter.com/ylecun Yann's Facebook: https://facebook.com/yann.lecun Meta AI: https://ai.meta.com/ PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ YouTube Full Episodes: https://youtube.com/lexfridman YouTube Clips: https://youtube.com/lexclips SUPPORT & CONNECT: - Check out the sponsors above, it's the best way to support this podcast - Support on Patreon: https://www.patreon.com/lexfridman - Twitter: https://twitter.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Medium: https://medium.com/@lexfridman OUTLINE: Here's the timestamps for the episode. On some podcast players you should be able to click the timestamp to jump to that time. (00:00) - Introduction (09:10) - Limits of LLMs (20:47) - Bilingualism and thinking (24:39) - Video prediction (31:59) - JEPA (Joint-Embedding Predictive Architecture) (35:08) - JEPA vs LLMs (44:24) - DINO and I-JEPA (45:44) - V-JEPA (51:15) - Hierarchical planning (57:33) - Autoregressive LLMs (1:12:59) - AI hallucination (1:18:23) - Reasoning in AI (1:35:55) - Reinforcement learning (1:41:02) - Woke AI (1:50:41) - Open source (1:54:19) - AI and ideology (1:56:50) - Marc Andreesen (2:04:49) - Llama 3 (2:11:13) - AGI (2:15:41) - AI doomers (2:31:31) - Joscha Bach (2:35:44) - Humanoid robots (2:44:52) - Hope for the future

Audio
Featured in this Episode
Transcription

0.109 - 15.018 Lex Fridman

The following is a conversation with Yann LeCun, his third time on this podcast. He is the chief AI scientist at Meta, professor at NYU, Turing Award winner, and one of the seminal figures in the history of artificial intelligence.

0
💬 0

15.678 - 47.41 Lex Fridman

He and Meta AI have been big proponents of open sourcing AI development and have been walking the walk by open sourcing many of their biggest models, including Llama 2 and eventually Llama 3. Also, Jan has been an outspoken critic of those people in the AI community who warn about the looming danger and existential threat of AGI. He believes the AGI will be created one day, but it will be good.

0
💬 0

47.93 - 74.872 Lex Fridman

It will not escape human control, nor will it dominate and kill all humans. At this moment of rapid AI development, this happens to be somewhat a controversial position. And so it's been fun seeing Jan get into a lot of intense and fascinating discussions online, as we do in this very conversation. And now a quick few second mention of each sponsor. Check them out in the description.

0
💬 0

75.313 - 102.063 Lex Fridman

It's the best way to support this podcast. We've got Hidden Layer for securing your AI models, Element for electrolytes, Shopify for shopping for stuff online, and AG1 for delicious health. Choose wisely, my friends. Also, if you want to get in touch with me for whatever reason, maybe to work with our amazing team, go to lexfreeman.com slash contact.

0
💬 0

102.564 - 125.398 Lex Fridman

And now onto the full ad reads, never any ads in the middle. I try to make these interesting. I don't know why I'm talking like this, but I am. There's a staccato nature to it. Speaking of staccato, I've been playing a bit of piano. Anyway, if you skip these ads, please still check out the sponsors. We love them. I love them. I enjoy their stuff. Maybe you will too.

0
💬 0

127.13 - 154.805 Lex Fridman

This episode is brought to you by a on-theme, in-context, see what I did there, sponsor. Since this is Yann LeCun, artificial intelligence, machine learning, one of the seminal figures in the field. So of course you're going to have a sponsor that's related to artificial intelligence. Hidden Layer, they provide a platform that keeps your machine learning models secure.

0
💬 0

156.141 - 177.537 Lex Fridman

The ways to attack machine learning models, large language models, all the stuff we talk about with Jan, there's a lot of really fascinating work, not just large language models, but the same for video, video prediction, tokenization, where the tokens are in the space of concept versus the space of literally letters, symbols.

0
💬 0

178.377 - 205.971 Lex Fridman

japa v japa all of that stuff that they're open sourcing all the stuff they're publishing on just really incredible but that said all of those models have security holes in ways that we can't even anticipate or imagine at this time and so you want good people to be trying to find those security holes trying to be one step ahead of uh the people that trying to attack so if you're especially a company that's relying on these models you need to uh

0
💬 0

207.13 - 235.01 Lex Fridman

have a person who's in charge of saying, yeah, this model that you got from this place has been tested, has been secured. Whether that place is Hugging Face or any other kind of stuff, or any other kind of repository or model zoo kind of place. I think the more and more we rely on large language models or just AI systems in general, the more the security threats that are always going to be there

0
💬 0

236.191 - 264.824 Lex Fridman

become dangerous and impactful. So protect your models. Visit hiddenlayer.com slash Lex to learn more about how Hidden Layer can accelerate your AI adoption in a secure way. This episode is also brought to you by Element. A thing I drink throughout the day. I'm drinking now. When I'm on a podcast, you'll sometimes see me with a mug and clear liquid in there that looks like water.

0
💬 0

265.264 - 292.424 Lex Fridman

In fact, it is not simply water. It is water mixed with element. Watermelon salt. Cold. What I do is I take one of them Powerade 28 fluid ounces bottles. fill it up with water, one packet of watermelon salt, shake it up, put it in the fridge, that's it. I reuse the bottles and drink from a mug. Or sometimes from the bottle.

0
💬 0

293.584 - 322.917 Lex Fridman

Either way, delicious, good for you, especially if you're doing fasting, especially if you're doing low-carb kinds of diets, which I do. You can get a sample pack for free with any purchase. Try it at drinkelement.com. This episode is brought to you by Shopify, as I take a drink of Element. It is a platform designed for anyone, even me, to sell stuff anywhere on a great looking store.

0
💬 0

322.937 - 346.869 Lex Fridman

I use a basic one, like a really minimalist one. You can check it out if you go to legstreaming.com slash store. There's a few shirts on there. If that's your thing, it was so easy to set up. I imagine there's like a million features they have that can make it look better and all kinds of extra stuff you can do with the store, but I use the basic thing, and the basic thing is pretty damn good.

0
💬 0

347.609 - 374.668 Lex Fridman

I like basic. I like minimalism. And they integrate with a lot of third-party apps, including what I use, which is on-demand printing. So, like you buy the shirt on Shopify, but it gets printed and shipped by another company that I always keep forgetting, but I think it's called Printful. Or Printify. I think it's Printful. I'm not sure. It doesn't matter. I think there's several integrations.

0
💬 0

375.068 - 401.741 Lex Fridman

You can check it out yourself. For me, it works. I'm using the most popular one. Printful, I think it's called. Anyway. I look forward to your letters correcting me on my pronunciation. Shopify is great. I'm a big fan of the good side of the machinery of capitalism. Selling stuff on the internet, connecting people to the thing that they want, or rather the thing that would make their life better.

0
💬 0

403.018 - 427.225 Lex Fridman

both in advertisement and e-commerce, shopping in general, I'm a big believer when that's done well, your life legitimately in the long term becomes better. And so whatever system can connect one human to the thing that makes their life better is great. And I believe that Shopify is sort of a platform that enables that kind of system.

0
💬 0

428.183 - 456.424 Lex Fridman

You can sign up for a $1 per month trial period at shopify.com slash Lex. That's all lowercase. Go to shopify.com slash Lex to take your business to the next level today. This episode is also brought to you by AG1, an all-in-one daily drink to support better health and peak performance. It is delicious. It is nutritious. And I ran out of words that rhyme. with those two.

0
💬 0

457.645 - 486.477 Lex Fridman

Actually, let me use a large language model to figure out what rhymes with delicious. Words that rhyme with delicious include ambitious, auspicious, capricious, fictitious, suspicious. So there you have it. Anyway, I drink it twice a day. Also put it in the fridge. And sometimes in the freezer, like it gets a little bit frozen. Just like a little bit, just a little bit frozen.

0
💬 0

486.497 - 510.579 Lex Fridman

You got that like slushy consistency. I'll do that too sometimes. And it's freaking delicious. It's delicious no matter what. It's delicious warm. It's delicious cold. It's delicious slightly frozen. All of it is just incredible. And of course it covers like the basic multivitamin stuff. foundation of what I think of as a good diet. So it's just a great multivitamin.

0
💬 0

510.659 - 534.066 Lex Fridman

That's the way I think about it. So all the crazy stuff I do, the physical challenges, the mental challenges, all of that, at least I got AG1. They'll give you one month supply of fish oil when you sign up at drinkag1.com slash Lex. This is the Lex Freeman Podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Jan LeCun.

0
💬 0

550.831 - 576.664 Lex Fridman

You've had some strong statements, technical statements about the future of artificial intelligence recently, throughout your career actually, but recently as well. You've said that auto-aggressive LLMs are not the way we're going to make progress towards superhuman intelligence. These are the large language models like GPT-4, like LAMA-2 and 3 soon and so on.

0
💬 0

577.145 - 579.926 Lex Fridman

How do they work and why are they not going to take us all the way?

0
💬 0

580.649 - 611.632 Yann LeCun

for a number of reasons. The first is that there is a number of characteristics of intelligent behavior. For example, the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason, and the ability to plan. Those are four essential characteristics of intelligent systems or entities, humans, animals.

0
💬 0

613.436 - 637.314 Yann LeCun

LLMs can do none of those, or they can only do them in a very primitive way. They don't really understand the physical world. They don't really have persistent memory. They can't really reason, and they certainly can't plan. If you expect the system to become intelligent just without having the possibility of doing those things, you're making a mistake.

0
💬 0

638.68 - 666.085 Yann LeCun

That is not to say that autoregressive LLMs are not useful. They're certainly useful. That they're not interesting, that we can't build a whole ecosystem of applications around them. Of course we can, but as it paths towards human-level intelligence, they're missing essential components. And then there is another tidbit or fact that I think is very interesting.

0
💬 0

666.905 - 689.8 Yann LeCun

Those LLMs are trained on enormous amounts of text, basically the entirety of all publicly available texts on the internet, right? That's typically on the order of 10 to the 13 tokens. Each token is typically two bytes. So that's two 10 to the 13 bytes as training data. It would take you or me 170,000 years to just read through this at eight hours a day.

0
💬 0

691.542 - 723.943 Yann LeCun

So it seems like an enormous amount of knowledge that those systems can accumulate. But then you realize it's really not that much data. If you talk to developmental psychologists and they tell you a four-year-old has been awake for 16,000 hours in his or her life, and the amount of information that has reached the visual cortex of that child in four years... is about 10 to the 15 bytes.

0
💬 0

724.924 - 750.085 Yann LeCun

And you can compute this by estimating that the optical nerve carry about 20 megabytes per second, roughly. And so 10 to the 15 bytes for a four-year-old versus two times 10 to the 13 bytes for 170,000 years worth of reading What that tells you is that through sensory input, we see a lot more information than we do through language.

0
💬 0

751.306 - 768.736 Yann LeCun

And that despite our intuition, most of what we learn and most of our knowledge is through our observation and interaction with the real world, not through language. Everything that we learn in the first few years of life and certainly everything that animals learn has nothing to do with language.

0
💬 0

769.856 - 795.632 Lex Fridman

So it would be good to maybe push against some of the intuition behind what you're saying. So it is true there's several orders of magnitude more data coming into the human mind much faster, and the human mind is able to learn very quickly from that, filter the data very quickly. Somebody might argue your comparison between sensory data versus language, that language is already very compressed.

0
💬 0

796.093 - 810.94 Lex Fridman

It already contains a lot more information than the bytes it takes to store them, if you compare it to visual data. So there's a lot of wisdom in language, there's words, and the way we stitch them together, it already contains a lot of information. So is it possible that

0
💬 0

812.279 - 828.705 Lex Fridman

language alone already has enough wisdom and knowledge in there to be able to, from that language, construct a world model and understanding of the world, an understanding of the physical world that you're saying LLMs lack.

0
💬 0

829.542 - 855.105 Yann LeCun

So it's a big debate among philosophers and also cognitive scientists, like whether intelligence needs to be grounded in reality. I'm clearly in the camp that, yes, intelligence cannot appear without some grounding in some reality. It doesn't need to be Physical reality could be simulated, but the environment is just much richer than what you can express in language.

0
💬 0

855.205 - 881.984 Yann LeCun

Language is a very approximate representation of our percepts and our mental models. There's a lot of tasks that we accomplish where we manipulate a mental model of the situation at hand, and that has nothing to do with language. Everything that's physical, mechanical, whatever, when we build something, when we accomplish a task, a model task of grabbing something, etc.,

0
💬 0

883.085 - 906.13 Yann LeCun

We plan our action sequences, and we do this by essentially imagining the result of the outcome of a sequence of actions that we might imagine. And that requires mental models that don't have much to do with language. And that's, I would argue, most of our knowledge is derived from that interaction with the physical world.

0
💬 0

906.61 - 936.864 Yann LeCun

So a lot of my colleagues who are more interested in things like computer vision are really on that camp that AI needs to be embodied, essentially. And then other people coming from the NLP side or maybe some other motivation don't necessarily agree with that. And philosophers are split as well. And the complexity of the world is hard to imagine. It's hard to

0
💬 0

940.266 - 962.844 Yann LeCun

represent all the complexities that we take completely for granted in the real world that we don't even imagine require intelligence, right? This is the old Moravec paradox from the pioneer of robotics, Hans Moravec, who said, you know, how is it that with computers it seems to be easy to do high-level complex tasks like playing chess and solving integrals and doing things like that, whereas

0
💬 0

964.056 - 986.482 Yann LeCun

The thing we take for granted that we do every day, like, I don't know, learning to drive a car or, you know, grabbing an object. We can't do it with computers. And, you know, we have LLMs that can pass the bar exam. So they must be smart. But then they can't learn to drive in 20 hours like any 17-year-old.

0
💬 0

988.223 - 1012.692 Yann LeCun

They can't learn to clear out the dinner table and fill up the dishwasher like any 10-year-old can learn in one shot. Why is that? Like, you know, what are we missing? What type of learning or reasoning architecture or whatever are we missing that basically prevent us from, you know, having level five self-driving cars and domestic robots?

0
💬 0

1013.686 - 1029.031 Lex Fridman

Can a large language model construct a world model that does know how to drive and does know how to fill a dishwasher, but just doesn't know how to deal with visual data at this time? So it can operate in a space of concepts.

0
💬 0

1030.014 - 1061.059 Yann LeCun

So yeah, that's what a lot of people are working on. So the short answer is no. And the more complex answer is you can use all kinds of tricks to get an LLM to basically digest visual representations of images Or video, or audio for that matter. And a classical way of doing this is you train a vision system in some way.

0
💬 0

1062.2 - 1089.074 Yann LeCun

And we have a number of ways to train vision systems, either supervised, semi-supervised, self-supervised, all kinds of different ways. That will turn any image into a high-level representation. basically a list of tokens that are really similar to the kind of tokens that typical LLM takes as an input. And then you just feed that to the LLM in addition to the text.

0
💬 0

1089.995 - 1105.366 Yann LeCun

And you just expect the LLM to kind of, during training, to kind of be able to use those representations to help make decisions. I mean, there's been work along those lines for quite a long time. And now you see those systems, right?

0
💬 0

1105.427 - 1122.659 Yann LeCun

I mean, there are LLMs that have some vision extension, but they're basically hacks in the sense that those things are not like trained end-to-end to handle, to really understand the world. They're not trained with video, for example. They don't really understand intuitive physics, at least not at the moment.

0
💬 0

1123.922 - 1135.07 Lex Fridman

So you don't think there's something special to you about intuitive physics, about sort of common sense reasoning about the physical space, about physical reality. That to you is a giant leap that LLMs are just not able to do.

0
💬 0

1135.631 - 1158.677 Yann LeCun

We're not going to be able to do this with the type of LLMs that we are working with today. And there's a number of reasons for this. But the main reason is The way LLMs are trained is that you take a piece of text, you remove some of the words in that text, you mask them, you replace them by blank markers, and you train a genetic neural net to predict the words that are missing.

0
💬 0

1160.338 - 1180.321 Yann LeCun

And if you build this neural net in a particular way so that it can only look at words that are to the left of the one it's trying to predict, then what you have is a system that basically is trying to predict the next word in a text, right? So then you can feed it a text, a prompt, and you can ask it to predict the next word. It can never predict the next word exactly.

0
💬 0

1180.982 - 1201.964 Yann LeCun

And so what it's going to do is produce a probability distribution over all the possible words in your dictionary. In fact, it doesn't predict words, it predicts tokens that are kind of subword units. And so it's easy to handle the uncertainty in the prediction there, because there is only a finite number of possible words in the dictionary, and you can just compute a distribution over them.

0
💬 0

1203.806 - 1224.38 Yann LeCun

Then what the system does is that it picks a word from that distribution. Of course, there's a higher chance of picking words that have a higher probability within that distribution. So you sample from the distribution to actually produce a word. And then you shift that word into the input. And so that allows the system not to predict the second word, right?

0
💬 0

1225.06 - 1250.577 Yann LeCun

And once you do this, you shift it into the input, et cetera. That's called autoregressive prediction, which is why those LLMs should be called autoregressive LLMs. But we just call them LLMs. And there is a difference between this kind of process and a process by which before producing a word, when you talk, when you and I talk, You and I are bilingual.

0
💬 0

1251.318 - 1271.594 Yann LeCun

We think about what we're going to say, and it's relatively independent of the language in which we're going to say it. When we talk about, I don't know, let's say a mathematical concept or something, the kind of thinking that we're doing and the answer that we're planning to produce is is not linked to whether we're going to see it in French or Russian or English.

0
💬 0

1272.234 - 1282.72 Lex Fridman

Chomsky just rolled his eyes, but I understand. So you're saying that there's a bigger abstraction that goes before language and maps onto language.

0
💬 0

1283.08 - 1286.562 Yann LeCun

Right. It's certainly true for a lot of thinking that we do.

0
💬 0

1286.762 - 1292.665 Lex Fridman

Is that obvious that we don't? Like you're saying your thinking is same in French as it is in English.

0
💬 0

1293.126 - 1293.666 Yann LeCun

Yeah, pretty much.

0
💬 0

1294.807 - 1300.531 Lex Fridman

pretty much, or is this like, how flexible are you? Like if there's a probability distribution.

0
💬 0

1301.892 - 1308.997 Yann LeCun

Well, it depends what kind of thinking, right? If it's just, if it's like producing puns, I get much better in French than English about that.

0
💬 0

1309.697 - 1323.887 Lex Fridman

No, but is there an abstract representation of puns? Like is your humor an abstract, like when you tweet and your tweets are sometimes a little bit spicy, is there an abstract representation in your brain of a tweet before it maps onto English?

0
💬 0

1324.608 - 1331.111 Yann LeCun

There is an abstract representation of imagining the reaction of a reader to that text.

0
💬 0

1331.531 - 1334.613 Lex Fridman

Or you start with laughter and then figure out how to make that happen.

0
💬 0

1334.833 - 1355.684 Yann LeCun

Or figure out a reaction you want to cause and then figure out how to say it so that it causes that reaction. But that's really close to language. But think about a mathematical concept or imagining something you want to build out of wood or something like this, right? the kind of thinking you're doing is absolutely nothing to do with language, really.

0
💬 0

1356.304 - 1373.955 Yann LeCun

It's not like you have necessarily an internal monologue in any particular language. You're imagining mental models of the thing, right? If I ask you to imagine what this water bottle will look like if I rotate it 90 degrees, that has nothing to do with language.

0
💬 0

1375.345 - 1407.694 Yann LeCun

And so clearly there is a more abstract level of representation in which we do most of our thinking and we plan what we're gonna say if the output is you know, uttered words as opposed to an output being, you know, muscle actions, right? We plan our answer before we produce it. And LLMs don't do that. They just produce one word after the other instinctively, if you want.

0
💬 0

1407.714 - 1428.201 Yann LeCun

It's like, it's a bit like the, you know, subconscious actions where you don't, Like you're distracted, you're doing something, you're completely concentrated, and someone comes to you and asks you a question, and you kind of answer the question. You don't have time to think about the answer, but the answer is easy, so you don't need to pay attention. You sort of respond automatically.

0
💬 0

1428.921 - 1444.895 Yann LeCun

That's kind of what an LLM does, right? It doesn't think about its answer, really. It retrieves it because it's accumulated a lot of knowledge, so it can retrieve some things, but it's going to just spit out one token after the other without planning the answer.

0
💬 0

1445.915 - 1471.238 Lex Fridman

But you're making it sound just one token after the other, one token at a time generation is bound to be simplistic. But if the world model is sufficiently sophisticated, that one token at a time the most likely thing it generates as a sequence of tokens is going to be a deeply profound thing.

0
💬 0

1471.258 - 1477.442 Yann LeCun

Okay, but then that assumes that those systems actually possess an eternal world model.

0
💬 0

1477.723 - 1491.054 Lex Fridman

So it really goes to the, I think the fundamental question is can you build a really, complete world model, not complete, but one that has a deep understanding of the world.

0
💬 0

1491.414 - 1513.585 Yann LeCun

Yeah. So can you build this, first of all, by prediction? Right. And the answer is probably yes. Can you build it by predicting words? And the answer is most probably no, because language is very poor in terms of weak or low bandwidth, if you want. There's just not enough information there.

0
💬 0

1514.285 - 1543.114 Yann LeCun

So building world models means observing the world and understanding why the world is evolving the way it is. And then the extra component of a world model is something that can predict how the world is going to evolve as a consequence of an action you might take, right? So one model really is, here is my idea of the state of the world at time t. Here is an action I might take.

0
💬 0

1543.854 - 1566.547 Yann LeCun

What is the predicted state of the world at time t plus 1? Now that state of the world does not need to represent everything about the world. It just needs to represent enough that's relevant for this planning of the action, but not necessarily all the details. Now here is the problem. You're not going to be able to do this with generative models.

0
💬 0

1567.708 - 1580.374 Yann LeCun

So a generative model that's trained on video, and we've tried to do this for 10 years. You take a video, show a system a piece of video, and then ask it to predict the reminder of the video. Basically, predict what's going to happen.

0
💬 0

1580.774 - 1587.257 Lex Fridman

One frame at a time. Do the same thing as sort of the auto-aggressive LLMs do, but for video. Right.

0
💬 0

1587.797 - 1605.404 Yann LeCun

Either one frame at a time or a group of frames at a time. But yeah, a large video model, if you want. The idea of doing this has been floating around for a long time, and at FAIR, some of my colleagues and I have been trying to do this for about 10 years.

0
💬 0

1607.784 - 1630.892 Yann LeCun

And you can't really do the same trick as with LLMs, because LLMs, as I said, you can't predict exactly which word is going to follow a sequence of words, but you can predict the distribution of words. Now, if you go to video, what you would have to do is predict the distribution over all possible frames in a video. And we don't really know how to do that properly.

0
💬 0

1633.013 - 1659.298 Yann LeCun

We do not know how to represent distributions over high-dimensional continuous spaces in ways that are useful. And there lies the main issue. And the reason we can do this is because the world is incredibly more complicated and richer in terms of information than text. Text is discrete. Video is highly dimensional and continuous. A lot of details in this.

0
💬 0

1660.078 - 1682.04 Yann LeCun

So if I take a video of this room, and the video is a camera panning around, there is no way I can predict everything that's going to be in the room as I pan around. The system cannot predict what's going to be in the room as the camera is panning. Maybe it's going to predict this is a room where there's a light and there is a wall and things like that.

0
💬 0

1682.14 - 1693.996 Yann LeCun

It can't predict what the painting on the wall looks like or what the texture of the couch looks like. Certainly not the texture of the carpet. So there's no way it can predict all those details. So the way to handle this

0
💬 0

1695.26 - 1722.959 Yann LeCun

is one way possibly to handle this, which we've been working for a long time, is to have a model that has what's called a latent variable, and the latent variable is fed to a neural net, and it's supposed to represent all the information about the world that you don't perceive yet, and that you need to augment the system for the prediction to do a good job at predicting pixels, including the fine texture of the

0
💬 0

1724.059 - 1745.764 Yann LeCun

the carpet and the couch, and the painting on the wall. That has been a complete failure, essentially. And we've tried lots of things. We tried just straight neural nets, we tried GANs, we tried VAEs, all kinds of regularized autoencoders, we tried many things.

0
💬 0

1746.764 - 1776.022 Yann LeCun

We also tried those kind of methods to learn good representations of images or video that could then be used as input to, for example, an image classification system. And that also has basically failed. All the systems that attempt to predict missing parts of an image or video form a corrupted version of it, basically. So I take an image or a video, corrupt it or transform it in some way,

0
💬 0

1776.942 - 1799.374 Yann LeCun

And then try to reconstruct the complete video or image from the corrupted version. And then hope that internally the system will develop good representations of images that you can use for object recognition, segmentation, whatever it is. That has been essentially a complete failure. And it works really well for text. That's the principle that is used for LLMs, right?

0
💬 0

1799.955 - 1822.988 Lex Fridman

So where is the failure exactly? Is it that it's very difficult to form a good representation of an image, like a good embedding of all the important information in the image? Is it in terms of the consistency of image to image to image to image that forms the video? Like what are the, if we do a highlight reel of all the ways you failed, what's that look like?

0
💬 0

1823.484 - 1847.633 Yann LeCun

Okay, so the reason this doesn't work is, first of all, I have to tell you exactly what doesn't work because there is something else that does work. So the thing that does not work is training the system to learn representations of images by training it to reconstruct a good image from a corrupted version of it. That's what doesn't work.

0
💬 0

1848.493 - 1871.413 Yann LeCun

And we have a whole slew of techniques for this that are a variant of denoising autoencoders. Something called MAE, developed by some of my colleagues at FAIR, masked autoencoder. So it's basically like the you know, LLMs or things like this, where you train the system by corrupting text, except you corrupt images, you remove patches from it, and you train a gigantic neural net to reconstruct.

0
💬 0

1872.273 - 1898.055 Yann LeCun

The features you get are not good. And you know they're not good because if you now train the same architecture, but you train it supervised, with label data, with textual descriptions of images, et cetera, you do get good representations. And the performance on recognition tasks is much better than if you do this self-supervised pre-training. So the architecture is good. The architecture is good.

0
💬 0

1898.095 - 1908.94 Yann LeCun

The architecture of the encoder is good. But the fact that you train the system to reconstruct images does not lead it to produce, to learn good generic features of images.

0
💬 0

1909.04 - 1910.561 Lex Fridman

When you train in a self-supervised way.

0
💬 0

1911.222 - 1921.367 Yann LeCun

Self-supervised by reconstruction. Yeah, by reconstruction. Okay, so what's the alternative? The alternative is joint embedding. What is joint embedding?

0
💬 0

1921.667 - 1923.907 Lex Fridman

What are these architectures that you're so excited about?

0
💬 0

1924.067 - 1943.39 Yann LeCun

Okay, so now instead of training a system to encode the image and then training it to reconstruct the full image from a corrupted version, you take the full image, you take the corrupted or transformed version. You run them both through encoders, which in general are identical, but not necessarily.

0
💬 0

1944.431 - 1969.703 Yann LeCun

And then you train a predictor on top of those encoders to predict the representation of the full input from the representation of the corrupted one. So joint embedding, because you're taking the full input and the corrupted version, or transformed version, run them both through encoders, so you get a joint embedding.

0
💬 0

1970.143 - 1992.434 Yann LeCun

And then you're saying, can I predict the representation of the full one from the representation of the corrupted one? And I call this a JEPA, so that means joint embedding predictive architecture, because there's joint embedding and there is this predictor that predicts the representation of the good guy from the bad guy. And the big question is, how do you train something like this?

0
💬 0

1993.655 - 2003.7 Yann LeCun

And until five years ago, six years ago, we didn't have particularly good answers for how you train those things, except for one called contrastive learning.

0
💬 0

2006.362 - 2030.66 Yann LeCun

And the idea of contractive learning is you take a pair of images that are, again, an image and a corrupted version or degraded version somehow, or transformed version of the original one, and you train the predicted representation to be the same as that. If you only do this, the system collapses. It basically completely ignores the input and produces representations that are constant.

0
💬 0

2032.902 - 2050.287 Yann LeCun

So the contrastive methods avoid this. And those things have been around since the early 90s. I had a paper on this in 1993. You also show pairs of images that you know are different. And then you push away the representations from each other.

0
💬 0

2050.367 - 2070.68 Yann LeCun

So you say, not only do representations of things that we know are the same, should be the same or should be similar, but representation of things that we know are different should be different. And that prevents the collapse, but it has some limitation. And there's a whole bunch of techniques that have appeared over the last six, seven years that can revive this type of method.

0
💬 0

2072.02 - 2093.998 Yann LeCun

Some of them from FAIR, some of them from Google and other places. But there are limitations to those contrasting methods. What has changed in the last... you know, three, four years, is now we have methods that are non-contrastive. So they don't require those negative contrastive samples of images that we know are different.

0
💬 0

2094.439 - 2107.807 Yann LeCun

You turn them on you with images that are, you know, different versions or different views of the same thing. And you rely on some other tweaks to prevent the system from collapsing. And we have half a dozen different methods for this now.

0
💬 0

2108.863 - 2129.63 Lex Fridman

So what is the fundamental difference between joint embedding architectures and LLMs? So can JAPA take us to AGI? Whether we should say that you don't like the term AGI, and we'll probably argue. I think every single time I've talked to you, we've argued about the G in AGI.

0
💬 0

2129.67 - 2129.83 Unknown

Yes.

0
💬 0

2130.151 - 2144.294 Lex Fridman

I get it. I get it. I get it. Well, we'll probably continue to argue about it. It's great. You like Ami, because you like French, and Ami is, I guess, friend in French.

0
💬 0
0
💬 0

2145.114 - 2148.295 Lex Fridman

And AMI stands for Advanced Machine Intelligence.

0
💬 0

2148.715 - 2148.915 Unknown

Right.

0
💬 0

2150.455 - 2155.296 Lex Fridman

But either way, can JAPA take us to that, towards that Advanced Machine Intelligence? Yes.

0
💬 0

2155.467 - 2181.378 Yann LeCun

Well, so it's a first step. So first of all, what's the difference with generative architectures like LLMs? So LLMs or vision systems that are trained by reconstruction generate the inputs. They generate the original input that is non-corrupted, non-transformed. So you have to predict all the pixels.

0
💬 0

2182.799 - 2201.501 Yann LeCun

And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. In a JEPA, you're not trying to predict all the pixels. You're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways.

0
💬 0

2202.361 - 2217.48 Yann LeCun

So what the JEPA system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay. So there's a lot of things in the world that we cannot predict.

0
💬 0

2217.5 - 2239.758 Yann LeCun

For example, if you have a self-driving car driving down the street or road, there may be trees around the road and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi-chaotic random ways that you can't predict and you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details.

0
💬 0

2239.778 - 2251.324 Yann LeCun

It will tell you there's moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf.

0
💬 0

2252.697 - 2277.724 Yann LeCun

And that not only is a lot simpler, but also it allows the system to essentially learn an abstract representation of the world where what can be modeled and predicted is preserved, and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation. If you think about this, this is something we do absolutely all the time.

0
💬 0

2278.304 - 2289.549 Yann LeCun

Whenever we describe a phenomenon, we describe it at a particular level of abstraction. We don't always describe every natural phenomenon in terms of quantum field theory. That would be impossible.

0
💬 0

2290.409 - 2317.266 Yann LeCun

We have multiple levels of abstraction to describe what happens in the world, starting from quantum field theory to atomic theory and molecules and chemistry, materials, and all the way up to concrete objects in the real world and things like that. We can't just only model everything at the lowest level. That's what the idea of JEPA is really about.

0
💬 0

2317.426 - 2339.167 Yann LeCun

Learn abstract representation in a self-supervised manner. You can do it hierarchically as well. That, I think, is an essential component of an intelligent system. In language, we can get away without doing this because language is already, to some level, abstract, and already has eliminated a lot of information that is not predictable.

0
💬 0

2340.707 - 2347.951 Yann LeCun

So we can get away without doing the chanter embedding, without lifting the abstraction level, and by directly predicting words.

0
💬 0

2349.232 - 2372.79 Lex Fridman

So joint embedding, it's still generative, but it's generative in this abstract representation space. Yeah. And you're saying language, we were lazy with language because we already got the abstract representation for free. And now we have to zoom out, actually think about generally intelligent systems. We have to deal with a full mess of physical reality, of reality.

0
💬 0

2373.01 - 2390.014 Lex Fridman

And you do have to do this step of jumping from Uh... the full, rich, detailed reality to a abstract representation of that reality based on which you can then reason and all that kind of stuff.

0
💬 0

2390.194 - 2409.82 Yann LeCun

Right. And the thing is, those self-supervised algorithms that learn by prediction, even in representation space, they learn more concepts if the input data you feed them is more redundant. The more redundancy there is in the data, the more they're able to capture some internal structure of it.

0
💬 0

2410.66 - 2432.203 Yann LeCun

And so there, there is way more redundancy and structure in perceptual inputs, sensory input, like vision, than there is in text, which is not nearly as redundant. This is back to the question you were asking. a few minutes ago. Language might represent more information really because it's already compressed. You're right about that, but that means it's also less redundant.

0
💬 0

2433.043 - 2435.946 Yann LeCun

And so self-supervised learning will not work as well.

0
💬 0

2436.547 - 2460.847 Lex Fridman

Is it possible to join the self-supervised training on visual data and self-supervised training on language data? There is a huge amount of knowledge, even though you talk down about those 10 to the 13 tokens. Those 10 to the 13 tokens represent the entirety, a large fraction of what us humans have figured out.

0
💬 0

2462.068 - 2474.398 Lex Fridman

Both the shit talk on Reddit and the contents of all the books and the articles and the full spectrum of human intellectual creation. So is it possible to join those two together?

0
💬 0

2475.026 - 2501.06 Yann LeCun

Well, eventually, yes. But I think if we do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment with vision language model. We're basically cheating. We're using language as a crutch to help the deficiencies of our vision systems to kind of learn good representations from images and video. And the problem with this is that

0
💬 0

2502.14 - 2523.989 Yann LeCun

We might improve our vision language system a bit, I mean, our language models by feeding them images, but we're not going to get to the level of even the intelligence or level of understanding of the world of a cat or a dog, which doesn't have language. They don't have language, and they understand the world much better than any LLM.

0
💬 0

2524.87 - 2545.375 Yann LeCun

They can plan really complex actions and sort of imagine the result of a bunch of actions, How do we get machines to learn that before we combine that with language? Obviously, if we combine this with language, this is going to be a winner. But before that, we have to focus on how do we get systems to learn how the world works.

0
💬 0

2546.089 - 2562.254 Lex Fridman

So this kind of joint embedding, predictive architecture, for you, that's going to be able to learn something like common sense, something like what a cat uses to predict how to mess with its owner most optimally by knocking over a thing.

0
💬 0

2562.714 - 2592.514 Yann LeCun

That's the hope. In fact, the techniques we're using are non-contrastive. So not only is the architecture non-generative, the learning procedures we're using are non-contrastive. We have two sets of techniques. One set is based on distillation, and there's a number of methods that use this principle. One by DeepMind called BYOL, a couple by FAIR, one called VicReg, and another one called IGEPA.

0
💬 0

2592.934 - 2613.357 Yann LeCun

And vcrag, I should say, is not a distillation method, actually, but ijpa and BYOL certainly are. And there's another one also called dino, also produced at FAIR. And the idea of those things is that you take the full input, let's say an image, you run it through an encoder, produces a representation.

0
💬 0

2614.097 - 2637.192 Yann LeCun

And then you corrupt that input or transform it, running through essentially what amounts to the same encoder, with some minor differences. And then train a predictor, sometimes the predictor is very simple, sometimes it doesn't exist, but train a predictor to predict a representation of the first uncorrupted input from the corrupted input. But you only train the second branch.

0
💬 0

2638.233 - 2659.122 Yann LeCun

You only train the part of the network that is fed with the corrupted input. The other network you don't train, but since they share the same weight, when you modify the first one, it also modifies the second one. And with various tricks, you can prevent the system from collapsing, with the collapse of the type I was explaining before, where the system basically ignores the input.

0
💬 0

2661.022 - 2671.344 Yann LeCun

So that works very well. The two techniques we've developed at FAIR, Deno and IGEPA, work really well for that.

0
💬 0

2672.085 - 2674.205 Lex Fridman

So what kind of data are we talking about here?

0
💬 0

2674.634 - 2692.548 Yann LeCun

So there's several scenarios. One scenario is you take an image, you corrupt it by changing the cropping, for example, changing the size a little bit, maybe changing the orientation, blurring it, changing the colors. doing all kinds of horrible things to it.

0
💬 0

2692.809 - 2694.009 Lex Fridman

But basic horrible things.

0
💬 0

2694.41 - 2713.705 Yann LeCun

Basic horrible things that sort of degrade the quality a little bit and change the framing, you know, crop the image. And in some cases, in the case of iJet, you don't need to do any of this. You just mask some parts of it, right? You just basically remove some regions, like a big block, essentially. Yeah.

0
💬 0

2714.686 - 2733.489 Yann LeCun

And then run through the encoders and train the entire system, encoder and predictor, to predict the representation of the good one from the representation of the corrupted one. So that's the IGEPA. It doesn't need to know that it's an image, for example, because the only thing it needs to know is how to do this masking.

0
💬 0

2735.13 - 2754.716 Yann LeCun

Whereas with Deno, you need to know it's an image because you need to do things like geometry transformation and blurring and things like that that are really image-specific. A more recent version of this that we have is called VJPA. So it's basically the same idea as iJPA, except it's applied to video. So now you take a whole video and you mask a whole chunk of it.

0
💬 0

2755.497 - 2762.764 Yann LeCun

And what we mask is actually kind of a temporal tube. So like a whole segment of each frame in the video over the entire video. Mm-hmm.

0
💬 0

2763.084 - 2766.828 Lex Fridman

And that tube was statically positioned throughout the frames?

0
💬 0

2766.888 - 2788.147 Yann LeCun

Throughout the tube, yeah. Typically it's 16 frames or something, and we masked the same region over the entire 16 frames. It's a different one for every video, obviously. And then again, train that system so as to predict the representation of the full video from the partially masked video. That works really well.

0
💬 0

2788.207 - 2807.553 Yann LeCun

It's the first system that we have that learns good representations of video so that when you feed those representations to a supervised classifier head, it can tell you what action is taking place in the video with pretty good accuracy. So that's the first time we get something of that quality.

0
💬 0

2809.075 - 2812.616 Lex Fridman

That's a good test that a good representation is formed. That means there's something to this.

0
💬 0

2813.096 - 2834.22 Yann LeCun

Yeah. We have also preliminary results that seem to indicate that the representation allows our system to tell whether the video is physically possible or completely impossible because some object disappeared or an object suddenly jumped from one location to another or changed shape or something.

0
💬 0

2834.66 - 2860.582 Lex Fridman

So it's able to capture some physics-based constraints about the reality represented in the video? Yeah. About the appearance and the disappearance of objects? Yeah. That's really new. Okay. But can this actually... get us to this kind of world model that understands enough about the world to be able to drive a car?

0
💬 0

2862.044 - 2884.764 Yann LeCun

Possibly. This is going to take a while before we get to that point, but there are robotic systems that are based on this idea. And what you need for this is a slightly modified version of this, where imagine that you have a complete video.

0
💬 0

2885.444 - 2908.378 Yann LeCun

And what you're doing to this video is that you're either translating it in time towards the future, so you only see the beginning of the video, but you don't see the latter part of it that is in the original one. Or you just mask the second half of the video, for example. And then you train a JEPA system of the type I described to predict the representation of the full video from the shifted one.

0
💬 0

2909.039 - 2929.387 Yann LeCun

But you also feed the predictor with an action. For example, the wheel is turned 10 degrees to the right or something. So if it's a dash cam in a car and you know the angle of the wheel, you should be able to predict to some extent what's going to happen to what you see.

0
💬 0

2929.407 - 2942.411 Yann LeCun

You're not going to be able to predict all the details of objects that appear in the view, obviously, but at an abstract representation level, you can probably predict what's going to happen. So now what you have is...

0
💬 0

2944.489 - 2960.041 Yann LeCun

an internal model that says, here is my idea of state of the world at time t, here is an action I'm taking, here is a prediction of the state of the world at time t plus one, t plus delta t, t plus two seconds, whatever it is. If you have a model of this type, you can use it for planning.

0
💬 0

2960.842 - 2991.542 Yann LeCun

So now you can do what LLMs cannot do, which is planning what you're going to do so as to arrive at a particular outcome or satisfy a particular objective. So you can have a number of objectives. I can predict that if I have an object like this and I open my hand, it's going to fall. And if I push it with a particular force on the table, it's going to move.

0
💬 0

2991.723 - 3016.988 Yann LeCun

If I push the table itself, it's probably not going to move with the same force. So we have this internal model of the world in our mind, which allows us to plan sequences of actions to arrive at a particular goal. And so now if you have this world model, we can imagine a sequence of actions, predict what the outcome of the sequence of action is going to be,

0
💬 0

3018.149 - 3043.281 Yann LeCun

measure to what extent the final state satisfies a particular objective, like moving the bottle to the left of the table, and then plan a sequence of actions that will minimize this objective at runtime. We're not talking about learning, we're talking about inference time. So this is planning, really. And in optimal control, this is a very classical thing. It's called model predictive control.

0
💬 0

3043.321 - 3064.426 Yann LeCun

You have a model of the system you want to control that can predict the sequence of states corresponding to a sequence of commands. And you're planning a sequence of commands so that, according to your world model, the end state of the system will satisfy an objective that you fix. This is the way...

0
💬 0

3067.528 - 3072.071 Yann LeCun

Rocket trajectories have been planned since computers have been around, so since the early 60s, essentially.

0
💬 0

3072.091 - 3081.338 Lex Fridman

So yes, for model predictive control, but you also often talk about hierarchical planning. Can hierarchical planning emerge from this somehow?

0
💬 0

3081.789 - 3105.411 Yann LeCun

Well, so no, you will have to build a specific architecture to allow for hierarchical planning. So hierarchical planning is absolutely necessary if you want to plan complex actions. If I want to go from, let's say, from New York to Paris, it's the example I use all the time, and I'm sitting in my office at NYU, my objective that I need to minimize is my distance to Paris at a high level.

0
💬 0

3106.311 - 3131.856 Yann LeCun

a very abstract representation of my location, I would have to decompose this into two sub-goals. First one is go to the airport. Second one is catch a plane to Paris. Okay, so my sub-goal is now going to the airport. My objective function is my distance to the airport. How do I go to the airport? Well, I have to go in the street and hail a taxi, which you can do in New York.

0
💬 0

3134.097 - 3154.525 Yann LeCun

Okay, now I have another sub-goal. Go down on the street. What that means, going to the elevator, going down the elevator, walk out the street. How do I go to the elevator? I have to... Stand up from my chair, open the door of my office, go to the elevator, push the button. How do I get up from my chair?

0
💬 0

3155.225 - 3174.437 Yann LeCun

Like, you know, you can imagine going down, all the way down to basically what amounts to millisecond by millisecond muscle control. Okay. And obviously you're not going to plan your entire trip from New York to Paris in terms of millisecond by millisecond muscle control. First, that would be incredibly expensive.

0
💬 0

3175.097 - 3193.963 Yann LeCun

But it will also be completely impossible because you don't know all the conditions of what's going to happen. How long it's going to take to catch a taxi or to go to the airport with traffic. You would have to know exactly the condition of everything to be able to do this planning. And you don't have the information.

0
💬 0

3194.284 - 3213.782 Yann LeCun

So you have to do this hierarchical planning so that you can start acting and then sort of replanning as you go. And nobody really knows how to do this in AI. Nobody knows how to train a system to learn the appropriate multiple levels of representation so that hierarchical planning works.

0
💬 0

3214.142 - 3241.401 Lex Fridman

Does something like that already emerge? So like, can you use an LLM? state-of-the-art LLM to get you from New York to Paris by doing exactly the kind of detailed set of questions that you just did, which is, can you give me a list of 10 steps I need to do to get from New York to Paris? And then for each of those steps, can you give me a list of 10 steps how I make that step happen?

0
💬 0

3242.081 - 3253.207 Lex Fridman

And for each of those steps, can you give me a list of 10 steps to make each one of those until you're moving your individual muscles? Maybe not. Whatever you can actually act upon using your mind.

0
💬 0

3254.348 - 3269.563 Yann LeCun

Right, so there's a lot of questions that are actually implied by this, right? So the first thing is, LLMs will be able to answer some of those questions down to some level of abstraction. under the condition that they've been trained with similar scenarios in their training set.

0
💬 0

3270.123 - 3277.009 Lex Fridman

They would be able to answer all of those questions, but some of them may be hallucinated, meaning non-factual.

0
💬 0

3277.089 - 3289.879 Yann LeCun

Yeah, true. I mean, they will probably produce some answer, except they're not going to be able to really kind of produce millisecond by millisecond muscle control of how you stand up from your chair, right? But down to some level of abstraction where you can describe things by words,

0
💬 0

3290.6 - 3305.365 Yann LeCun

They might be able to give you a plan, but only under the condition that they've been trained to produce those kind of plans, right? They're not going to be able to plan for situations that they never encountered before. They basically are going to have to regurgitate the template that they've been trained on.

0
💬 0

3305.725 - 3324.354 Lex Fridman

But where, like, just for the example of New York to Paris, is it going to start getting into trouble? Like, at which layer of abstraction do you think you'll start? Because, like, I can imagine almost every single part of that, an alum would be able to answer somewhat accurately. especially when you're talking about New York and Paris, major cities.

0
💬 0

3324.374 - 3354.379 Yann LeCun

Certainly, LLM would be able to solve that problem if you fine-tune it for it. I can't say that LLM cannot do this. It can do this if you train it for it. There's no question. Down to a certain level, where things can be formulated in terms of words. But if you want to go down to how you climb down the stairs or just stand up from your chair in terms of words, you can't do it.

0
💬 0

3354.599 - 3363.505 Yann LeCun

That's one of the reasons you need experience of the physical world, which is much higher bandwidth than what you can express in words. In human language.

0
💬 0

3363.845 - 3386.211 Lex Fridman

So everything we've been talking about on the joint embedding space, is it possible that that's what we need for the interaction with physical reality on the robotics front? And then just the LLMs are the thing that sits on top of it for the bigger reasoning about the fact that I need to book a plane ticket and I need to know how to go to the websites and so on.

0
💬 0

3386.511 - 3415.655 Yann LeCun

Sure. And, you know, a lot of plans that people know about that are relatively high level are actually learned. Most people don't invent the, you know, plans. They... We have some ability to do this, of course, obviously, but most plans that people use are plans that they've been trained on. They've seen other people use those plans or they've been told how to do things.

0
💬 0

3417.096 - 3433.401 Yann LeCun

You can't invent how you take a person who's never heard of airplanes and tell them, like, how do you go from New York to Paris and... They're probably not going to be able to deconstruct the whole plan unless they've seen examples of that before. So certainly LLMs are going to be able to do this.

0
💬 0

3433.481 - 3452.289 Yann LeCun

But then how you link this from the low level of actions, that needs to be done with things like JEPA that basically lifts the abstraction level of the representation without attempting to reconstruct every detail of the situation. That's why we need JEPAs for it.

0
💬 0

3453.581 - 3484.749 Lex Fridman

I would love to sort of linger on your skepticism around auto-aggressive LLMs. So one way I would like to test that skepticism is everything you say makes a lot of sense. But if I apply everything you said today and in general to like, I don't know, 10 years ago, maybe a little bit less, no, let's say three years ago, I wouldn't be able to predict the success of LLMs.

0
💬 0

3485.69 - 3508.911 Lex Fridman

So does it make sense to you that autoregressive LLMs are able to be so damn good? Yes. Yes. Can you explain your intuition? Because if I were to take your wisdom and intuition at face value, I would say there's no way autoregressive LLMs, one token at a time, would be able to do the kind of things they're doing.

0
💬 0

3509.15 - 3533.68 Yann LeCun

No, there's one thing that autoregressive LLMs, or that LLMs in general, not just the autoregressive one, but including the BERT-style bidirectional ones, are exploiting, and it's self-supervised learning. And I've been a very, very strong advocate of self-supervised learning for many years. So those things are an incredibly impressive demonstration that self-supervised learning actually works.

0
💬 0

3534.78 - 3568.355 Yann LeCun

It didn't start with Bert, but it was really a good demonstration with this. The idea that you take a piece of text, you corrupt it, and then you train some gigantic neural net to reconstruct the parts that are missing, that has produced an enormous amount of benefits. It allowed us to create systems that understand language, systems that can translate hundreds of languages in any direction.

0
💬 0

3569.335 - 3583.547 Yann LeCun

systems that are multilingual, so it's a single system that can be trained to understand hundreds of languages and translate in any direction, and produce summaries, and then answer questions and produce text.

0
💬 0

3584.551 - 3607.247 Yann LeCun

And then there's a special case of it, which is the autoregressive trick, where you constrain the system to not elaborate a representation of the text from looking at the entire text, but only predicting a word from the words that come before. And you do this by constraining the architecture of the network. And that's what you can build an autoregressive LLM from.

0
💬 0

3607.967 - 3633.408 Yann LeCun

So there was a surprise many years ago with what's called decoder-only LLMs. So, you know, systems of this type that are just trying to produce words from the previous one. And the fact that when you scale them up, they tend to really kind of understand more about language. When you train them on lots of data, you make them really big. That was kind of a surprise.

0
💬 0

3633.488 - 3649.296 Yann LeCun

And that surprise occurred quite a while back, like, you know, with work from Google Meta, OpenAI, et cetera, going back to the GPT kind of work, general pre-trained transformers.

0
💬 0

3649.736 - 3659.26 Lex Fridman

You mean like GPT-2? There's a certain place where you start to realize scaling might actually keep giving us an emergent benefit.

0
💬 0

3659.505 - 3670.158 Yann LeCun

Yeah, I mean, there were work from various places, but if you want to kind of place it in the GPT timeline, that would be around GPT-2, yeah.

0
💬 0

3671.809 - 3678.114 Lex Fridman

Well, I just, because you said it, you're so charismatic, and you said so many words, but self-supervised learning, yes.

0
💬 0

3678.634 - 3703.331 Lex Fridman

But again, the same intuition you're applying to saying that autoregressive LLMs cannot have a deep understanding of the world, if we just apply that same intuition, does it make sense to you that they're able to form enough of a representation of the world to be damn convincing? Essentially, passing the original Turing test with flying colors.

0
💬 0

3703.572 - 3718.343 Yann LeCun

Well, we're fooled by their fluency, right? We just assume that if a system is fluent in manipulating language, then it has all the characteristics of human intelligence. But that impression is false. We're really fooled by it.

0
💬 0

3719.343 - 3723.967 Lex Fridman

What do you think Alan Turing would say? Without understanding anything, just hanging out with it.

0
💬 0

3724.187 - 3733.877 Yann LeCun

Alan Turing would decide that the Turing test is a really bad test. Okay. This is what the AI community has decided many years ago, that the Turing test was a really bad test of intelligence.

0
💬 0

3734.818 - 3738.763 Lex Fridman

What would Hans Moravec say about the large language models?

0
💬 0

3739.003 - 3745.209 Yann LeCun

Hans Moravec would say the Moravec paradox still applies. Okay. Okay. Okay, we can pass.

0
💬 0

3745.229 - 3746.81 Lex Fridman

You don't think you would be really impressed?

0
💬 0

3746.97 - 3763.54 Yann LeCun

No, of course, everybody would be impressed. But, you know, it's not a question of being impressed or not. It's the question of knowing what the limit of those systems can do. Again, they are impressive. They can do a lot of useful things. There's a whole industry that is being built around them. They're going to make progress.

0
💬 0

3764.821 - 3784.788 Yann LeCun

But there is a lot of things they cannot do and we have to realize what they cannot do and then figure out how we get there. I'm seeing this from basically 10 years of research on the idea of self-supervised learning.

0
💬 0

3785.028 - 3804.597 Yann LeCun

Actually, that's going back more than 10 years, but the idea of self-supervised learning, so basically capturing the internal structure of a set of inputs without training the system for any particular task, learning representations. You know, the conference I co-founded 14 years ago is called International Conference on Learning Representations.

0
💬 0

3804.637 - 3817.951 Yann LeCun

That's the entire issue that deep learning is dealing with, right? And it's been my obsession for almost 40 years now. So learning representation is really the thing. For the longest time, you could only do this with supervised learning.

0
💬 0

3818.531 - 3843.635 Yann LeCun

And then we started working on what we used to call unsupervised learning and sort of revived the idea of unsupervised learning in the early 2000s with Yoshua Bengio and Jeff Hinton. Then discovered that supervised learning actually works pretty well if you can collect enough data. And so the whole idea of unsupervised self-supervised learning kind of took a backseat for a bit.

0
💬 0

3843.715 - 3864.672 Yann LeCun

And then I kind of tried to revive it in a big way, starting in 2014, basically when we started FAIR. and really pushing for finding new methods to do self-supervised learning, both for text and for images and for video and audio. And some of that work has been incredibly successful.

0
💬 0

3865.853 - 3886.523 Yann LeCun

I mean, the reason why we have a multilingual translation system, you know, things to do content moderation on Meta, for example, on Facebook, that are multilingual, that understand whether a piece of text is hate speech or not or something. is due to that progress using self-supervised learning for NLP, combining this with transformer architectures and blah, blah, blah.

0
💬 0

3886.543 - 3902.733 Yann LeCun

But that's the big success of self-supervised learning. We had similar success in speech recognition, a system called Wave2Vec, which is also a joint embedding architecture, by the way, trained with contrastive learning. And that system also can produce speech recognition systems that are multilingual,

0
💬 0

3903.413 - 3919.062 Yann LeCun

with mostly unlabeled data and only need a few minutes of labeled data to actually do speech recognition. That's amazing. We have systems now based on those combination of ideas that can do real-time translation of hundreds of languages into each other.

0
💬 0

3920.062 - 3928.268 Lex Fridman

Speech-to-speech. Speech-to-speech, even including just fascinating languages that don't have written forms. That's right. They're spoken only.

0
💬 0

3928.348 - 3945.179 Yann LeCun

That's right. We don't go through text. It goes directly from speech-to-speech using an internal representation of kind of speech units that are discrete. But it's called textless NLP. We used to call it this way. But yeah, so that, I mean, incredible success there. And then, you know, for 10 years, we tried to apply this idea

0
💬 0

3945.959 - 3968.532 Yann LeCun

to learning representations of images by training a system to predict videos, learning intuitive physics by training a system to predict what's going to happen in a video, and tried and tried and failed and failed with generative models, with models that predict pixels. We could not get them to learn good representations of images. We could not get them to learn good representations of videos.

0
💬 0

3969.232 - 3985.753 Yann LeCun

And we tried many times. We published lots of papers on it. They kind of sort of worked, but not really great. It started working. We abandoned this idea of predicting every pixel and basically just doing digital embedding and predicting in representation space. That works.

0
💬 0

3986.973 - 4003.043 Yann LeCun

So there's ample evidence that we're not going to be able to learn good representations of the real world using generative model. So I'm telling people, everybody's talking about generative AI. If you're really interested in human-level AI, abandon the idea of generative AI.

0
💬 0

4004.304 - 4019.877 Lex Fridman

Okay, but you really think it's possible to get far with a joint embedding representation? So there's common sense reasoning, and then there's high-level reasoning. I feel like those are two...

0
💬 0

4021.394 - 4041.685 Lex Fridman

The kind of reasoning that LLMs are able to do, okay, let me not use the word reasoning, but the kind of stuff that LLMs are able to do seems fundamentally different than the common sense reasoning we use to navigate the world. It seems like we're going to need both. You're not? Would you be able to get, with the joint embedding, would you jump a type of approach looking at video?

0
💬 0

4042.005 - 4068.242 Lex Fridman

Would you be able to learn, let's see, well, how to get from New York to Paris? Or... how to understand the state of politics in the world today. Right? These are things where various humans generate a lot of language and opinions on in the space of language, but don't visually represent that in any clearly compressible way.

0
💬 0

4068.888 - 4089.24 Yann LeCun

Right. Well, there's a lot of situations that might be difficult for a purely language-based system to know. Like, okay, you can probably learn from reading texts, the entirety of the publicly available texts in the world, that I cannot get from New York to Paris by snapping my fingers. That's not going to work, right? Yes.

0
💬 0

4091.201 - 4117.702 Yann LeCun

But there's probably more complex scenarios of this type, which an NLM may never have encountered and may not be able to determine whether it's possible or not. So that link from the low level to the high level. The thing is that the high level that language expresses is based on a common experience of the low level, which LLMs currently do not have.

0
💬 0

4119.623 - 4126.608 Yann LeCun

When we talk to each other, we know we have a common experience of the world. A lot of it is similar.

0
💬 0

4128.969 - 4158.046 Lex Fridman

And LLMs don't have that. But see, it's present. You and I have a common experience of the world in terms of the physics of how gravity works and stuff like this. And that... common knowledge of the world, I feel like is there in the language. We don't explicitly express it, but if you have a huge amount of text, you're going to get this stuff that's between the lines.

0
💬 0

4158.386 - 4184.176 Lex Fridman

In order to form a consistent world of morality, you're going to have to understand how gravity works, even if you don't have an explicit explanation of gravity. So even though in the case of gravity, there is explicit explanations of gravity in Wikipedia. But the stuff that we think of as common sense reasoning, I feel like to generate language correctly, you're going to have to figure that out.

0
💬 0

4184.656 - 4190.318 Lex Fridman

Now you could say, as you have, there's not enough text. Sorry, okay. You don't think so.

0
💬 0

4190.638 - 4200.381 Yann LeCun

No, I agree with what you just said, which is that to be able to do high-level common sense, to have high-level common sense, you need to have the low-level common sense to build on top of.

0
💬 0

4201.842 - 4202.842 Lex Fridman

But that's not there.

0
💬 0

4203.082 - 4219.328 Yann LeCun

And that's not there in LLMs. LLMs are purely trained from text. So then the other statement you made, I would not agree with the fact that implicit in all languages in the world is the underlying reality. There's a lot about underlying reality which is not expressed in language.

0
💬 0

4219.668 - 4239.192 Lex Fridman

Is that obvious to you? Yeah, totally. So like all the conversations we had, okay, there's the dark web, meaning whatever, the private conversations like DMs and stuff like this, which is much, much larger probably than what's available, what LLMs are trained on.

0
💬 0

4239.772 - 4241.653 Yann LeCun

You don't need to communicate the stuff that is common.

0
💬 0

4242.824 - 4258.311 Lex Fridman

but the humor, all of it. No, you do. You don't need to, but it comes through. If I accidentally knock this over, you'll probably make fun of me. In the content of the you making fun of me will be explanation of the fact that

0
💬 0

4258.931 - 4274.515 Lex Fridman

cups fall and then, you know, gravity works in this way and then you'll have some very vague information about what kind of things explode when they hit the ground and then maybe you'll make a joke about entropy or something like this and we'll never be able to reconstruct this again.

0
💬 0

4274.855 - 4298.525 Lex Fridman

Like, okay, you'll make a little joke like this and there'll be a trillion of other jokes and from the jokes you can piece together the fact that gravity works and mugs can break and all this kind of stuff. You don't need to see it'll be very inefficient. It's easier for like to knock the thing over. But I feel like it would be there if you have enough of that data.

0
💬 0

4299.565 - 4312.613 Yann LeCun

I just think that most of the information of this type that we have accumulated when we were babies is just not present in text, in any description, essentially.

0
💬 0

4312.733 - 4317.156 Lex Fridman

And the sensory data is a much richer source for getting that kind of understanding.

0
💬 0

4317.176 - 4342.209 Yann LeCun

I mean, that's the 16,000 hours of wake time of a four-year-old and 10 to the 15 bytes, you know, going through vision, just vision, right? There is a similar bandwidth, you know, of touch and a little less through audio. And then text doesn't, language doesn't come in until like, you know, a year in life. And by the time you are nine years old, you've learned everything.

0
💬 0

4342.955 - 4365.125 Yann LeCun

about gravity, you know, about inertia, you know, about gravity, you know, the stability, you know, you know, about the distinction between animate and inanimate objects, you know, by 18 months, you know, about like why people want to do things and you help them if they can't, you know, I mean, there's a lot of things that you learn mostly by observation, really not even through interaction.

0
💬 0

4365.185 - 4378.953 Yann LeCun

In the first few months of life, babies don't really have any influence on the world. They can only observe, right? And you accumulate a gigantic amount of knowledge just from that. So that's what we're missing from current AI systems.

0
💬 0

4380.359 - 4401.433 Lex Fridman

I think in one of your slides you have this nice plot that is one of the ways you show that LLMs are limited. I wonder if you could talk about hallucinations from your perspectives. The why hallucinations happen from large language models and why and to what degree is that a fundamental flaw of large language models?

0
💬 0

4402.273 - 4431.663 Yann LeCun

Right. So because of the autoregressive prediction, every time an LLM produces a token or a word, there is some level of probability for that word to take you out of the set of reasonable answers. And if you assume, which is a very strong assumption, that the probability of such error is that those errors are independent across a sequence of tokens being produced.

0
💬 0

4432.324 - 4441.007 Yann LeCun

What that means is that every time you produce a token, the probability that you stay within the set of correct answers decreases, and it decreases exponentially.

0
💬 0

4441.502 - 4450.887 Lex Fridman

So there's a strong, like you said, assumption there that if there's a non-zero probability of making a mistake, which there appears to be, then there's going to be a kind of drift.

0
💬 0

4451.547 - 4463.594 Yann LeCun

Yeah. And that drift is exponential. It's like errors accumulate, right? So the probability that an answer would be nonsensical increases exponentially with the number of tokens.

0
💬 0

4464.269 - 4481.237 Lex Fridman

Is that obvious to you, by the way? Well, mathematically speaking, maybe, but isn't there a kind of gravitational pull towards the truth? Because on average, hopefully, the truth is well represented in the training set.

0
💬 0

4481.883 - 4504.973 Yann LeCun

No, it's basically a struggle against the curse of dimensionality. So the way you can correct for this is that you fine-tune the system by having it produce answers for all kinds of questions that people might come up with. And people are people, so a lot of the questions that they have are very similar to each other, so you can probably cover 80% or whatever.

0
💬 0

4506.614 - 4533.766 Yann LeCun

of questions that people will ask by collecting data. And then you fine-tune the system to produce good answers for all of those things. And it's probably going to be able to learn that because it's got a lot of capacity to learn. But then there is... you know, the enormous set of prompts that you have not covered during training. And that set is enormous.

0
💬 0

4533.826 - 4551.935 Yann LeCun

Like within the set of all possible prompts, the proportion of prompts that have been used for training is absolutely tiny. It's a tiny, tiny, tiny subset of all possible prompts. And so the system will behave properly on the prompts that has been either trained, pre-trained or fine-tuned.

0
💬 0

4554.308 - 4582.457 Yann LeCun

But then there is an entire space of things that it cannot possibly have been trained on because the number is gigantic. So whatever training the system has been subject to to produce appropriate answers, you can break it by finding out a prompt that will be outside of the set of prompts it's been trained on, or things that are similar, and then it will just spew complete nonsense.

0
💬 0

4583.158 - 4598.953 Lex Fridman

When you say prompt, do you mean that exact prompt or do you mean a prompt that's like in many parts, very different than like, is it that easy to ask a question or to say a thing that hasn't been said before on the internet?

0
💬 0

4599.161 - 4621.727 Yann LeCun

I mean, people have come up with things where you put essentially a random sequence of characters in a prompt, and that's enough to kind of throw the system into a mode where it's going to answer something completely different than it would have answered without this. So that's a way to jailbreak the system, basically go outside of its conditioning, right?

0
💬 0

4622.107 - 4638.376 Lex Fridman

So that's a very clear demonstration of it, but of course... that goes outside of what it's designed to do. If you actually stitch together reasonably grammatical sentences, is it that easy to break it?

0
💬 0

4639.356 - 4657.263 Yann LeCun

Yeah, some people have done things like you write a sentence in English or you ask a question in English and it produces a perfectly fine answer. And then you just substitute a few words. by the same word in another language. And all of a sudden, the answer is complete nonsense.

0
💬 0

4657.283 - 4666.91 Lex Fridman

Yes, so I guess what I'm saying is like, which fraction of prompts that humans are likely to generate are going to break the system?

0
💬 0

4667.21 - 4688.532 Yann LeCun

So the problem is that there is a long tail. Yes. This is an issue that a lot of people have realized in social networks and stuff like that, which is there's a very, very long tail of things that people will ask. And you can fine-tune the system for the 80% or whatever of the things that most people will ask.

0