Menu
Sign In Pricing Add Podcast
Podcast Image

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Mon, 03 Feb 2025

Description

Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep459-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/deepseek-dylan-patel-nathan-lambert-transcript CONTACT LEX: Feedback - give feedback to Lex: https://lexfridman.com/survey AMA - submit questions, videos or call-in: https://lexfridman.com/ama Hiring - join our team: https://lexfridman.com/hiring Other - other ways to get in touch: https://lexfridman.com/contact EPISODE LINKS: Dylan's X: https://x.com/dylan522p SemiAnalysis: https://semianalysis.com/ Nathan's X: https://x.com/natolambert Nathan's Blog: https://www.interconnects.ai/ Nathan's Podcast: https://www.interconnects.ai/podcast Nathan's Website: https://www.natolambert.com/ Nathan's YouTube: https://youtube.com/@natolambert Nathan's Book: https://rlhfbook.com/ SPONSORS: To support this podcast, check out our sponsors & get discounts: Invideo AI: AI video generator. Go to https://invideo.io/i/lexpod GitHub: Developer platform and AI code editor. Go to https://gh.io/copilot Shopify: Sell stuff online. Go to https://shopify.com/lex NetSuite: Business management software. Go to http://netsuite.com/lex AG1: All-in-one daily nutrition drinks. Go to https://drinkag1.com/lex OUTLINE: (00:00) - Introduction (13:28) - DeepSeek-R1 and DeepSeek-V3 (35:02) - Low cost of training (1:01:19) - DeepSeek compute cluster (1:08:52) - Export controls on GPUs to China (1:19:10) - AGI timeline (1:28:35) - China's manufacturing capacity (1:36:30) - Cold war with China (1:41:00) - TSMC and Taiwan (2:04:38) - Best GPUs for AI (2:19:30) - Why DeepSeek is so cheap (2:32:49) - Espionage (2:41:52) - Censorship (2:54:46) - Andrej Karpathy and magic of RL (3:05:17) - OpenAI o3-mini vs DeepSeek r1 (3:24:25) - NVIDIA (3:28:53) - GPU smuggling (3:35:30) - DeepSeek training on OpenAI data (3:45:59) - AI megaclusters (4:21:21) - Who wins the race to AGI? (4:31:34) - AI agents (4:40:16) - Programming and AI (4:47:43) - Open source (4:56:55) - Stargate (5:04:24) - Future of AI

Audio
Transcription

0.109 - 26.389 Lex Fridman

The following is a conversation with Dylan Patel and Nathan Lampert. Dylan runs SemiAnalysis, a well-respected research and analysis company that specializes in semiconductors, GPUs, CPUs, and AI hardware in general. Nathan is a research scientist at the Allen Institute for AI and is the author of the amazing blog on AI called Interconnects.

0
💬 0

27.298 - 56.998 Lex Fridman

They are both highly respected, read, and listened to by the experts, researchers, and engineers in the field of AI. And personally, I'm just a fan of the two of them. So I used the DeepSeek moment that shook the AI world a bit as an opportunity to sit down with them and lay it all out. From DeepSeek, OpenAI, Google, XAI, Meta, Anthropic, to NVIDIA and TSMC, and to US, China, Taiwan relations,

0
💬 0

57.638 - 67.522 Lex Fridman

and everything else that is happening at the cutting edge of AI. This conversation is a deep dive into many critical aspects of the AI industry.

0
💬 0

68.063 - 91.788 Lex Fridman

While it does get super technical, we try to make sure that it's still accessible to folks outside of the AI field by defining terms, stating important concepts explicitly, spelling out acronyms, and in general, always moving across the several layers of abstraction and levels of detail. There is a lot of hype in the media about what AI is and isn't.

0
💬 0

92.549 - 117.628 Lex Fridman

The purpose of this podcast, in part, is to cut through the hype through the bullshit and the low-resolution analysis, and to discuss in detail how stuff works and what the implications are. Let me also, if I may, comment on the new OpenAI 03 Mini reasoning model, the release of which we were anticipating during the conversation, and it did indeed come out right after.

0
💬 0

118.388 - 147.664 Lex Fridman

Its capabilities and costs are on par with our expectations, as we stated. OpenAI O3 Mini is indeed a great model, but it should be stated that DeepSeek R1 has similar performance on benchmarks, is still cheaper, and it reveals its chain of thought reasoning, which O3 Mini does not. It only shows a summary of the reasoning. Plus, R1 is open weight, and O3 Mini is not.

0
💬 0

149.025 - 170.954 Lex Fridman

By the way, I got a chance to play with O3 Mini, and anecdotal vibe check-wise, I felt that O3 Mini, specifically O3 Mini High, is better than R1. Still, for me personally, I find that Claude Sonnet 3.5 is the best model for programming, except for tricky cases where I will use O1 Pro to brainstorm.

0
💬 0

172.074 - 187.286 Lex Fridman

Either way, many more better AI models will come, including reasoning models, both from American and Chinese companies. They will continue to shift the cost curve. But the, quote, deep-seek moment is indeed real.

0
💬 0

187.907 - 209.47 Lex Fridman

I think it will still be remembered five years from now as a pivotal event in tech history, due in part to the geopolitical implications, but for other reasons too, as we discuss in detail from many perspectives in this conversation. And now a quick few second mention of each sponsor. Check them out in the description. It's the best way to support this podcast.

0
💬 0

209.87 - 232.699 Lex Fridman

We got InVideo AI for video generation, GitHub for coding, Shopify for selling stuff online, NetSuite for running your business, and AG1 for staying healthy. Choose wisely, my friends. Also, if you want to get in touch with me for whatever reason, go to alexhrieman.com contact. And now onto the follow ad reads. No ads in the middle.

0
💬 0

233.059 - 256.377 Lex Fridman

I try to make this interesting, but if you skip them, please still check out our sponsors. I enjoy their stuff. Maybe you will too. This video is brought to you by a new sponsor, but I've known these folks for a long time and perfect fit for this podcast. They're called NVIDIA AI. It's a video generating app that allows you to create full length videos using just text prompts.

0
💬 0

256.817 - 281.835 Lex Fridman

It's intuitive, works amazing. It's truly incredible what you can do. I've been playing quite a bit in using it for stock footage. And by the way, they make it super easy for you to switch between actually available stock footage and AI-generated footage. I've been preparing a lot for a conversation with Tim Sweeney, who is the creator of Unreal Engine.

0
💬 0

281.855 - 310.592 Lex Fridman

And there's 3D worlds, and you get to think about the role of AI in generating those 3D worlds. That's what's coming 5, 10, 20 years from now. In video games and simulations, a fundamental part of our lives will be generated with AI. And I think NVIDIA AI does a masterful job of pushing us in that direction. in the 2D plane of video. Now, I think this is not a tool that replaces human creativity.

0
💬 0

311.333 - 346.036 Lex Fridman

I think it supercharges human creativity. I think now and for a long, long time to come, humans will be in the loop of creating great art because we're creating for each other. And only humans truly, deeply know what makes other humans go, ah, like the old Kerouac line. If you want to try out NVIDIA AI, you can do so for free at nvidia.io slash lexpod, saving time and money on production costs.

0
💬 0

347.489 - 373.71 Lex Fridman

This episode is brought to you by the thing that's brought me joy for many, many years and created a community for hundreds of thousands, millions, I don't know how many developers, and that place is called GitHub. It is a company that really has supercharged the developer community. I mean, where would the world be without GitHub?

0
💬 0

374.651 - 402.282 Lex Fridman

And they're also, as a company, pushing the limits of what's possible in terms of AI code generation, AI-assisted coding. They were pioneers on Copilot. They are still pioneers in Copilot. It's super competitive space, and they are doing their best to win. I will forever be a supporter of GitHub Copilot. Now, it integrates in a bunch of IDEs, not just into VS Code. I am, of course,

0
💬 0

403.664 - 428.632 Lex Fridman

A VS Code guy at this time. I did use JetBrains for a long time. I still dabble a little bit. For people who don't know, JetBrains has a plethora. Don't like using that word. It seems elitist. There's gotta be a better word. There is a lot of different sort of sub-IDEs inside JetBrains. I've even used DataGrip, which manages the MySQL. I should mention...

0
💬 0

429.656 - 458.977 Lex Fridman

and this might be embarrassing, but I have not, ooh, this might be interesting, but I have not used anything like Copilot on any database management GUIs. I wonder if DataGrip integrates Copilot. I'm gonna have to check that out. But everything I use, I'm writing SQL queries from scratch. inside the database management GUI.

0
💬 0

459.457 - 481.452 Lex Fridman

If I want to do complicated queries, I'll go to any of the LMS probably going to be cloth on a three five. Or if it's part of the code, then I'm going to be inside my ID. I just like having a GUI management of a database. I'm going to have to check that out. If DataGrip integrates Copilot, it's going to be incredible.

0
💬 0

481.892 - 507.933 Lex Fridman

If not, I'm going to yell from the top of my lungs, hoping it will eventually, because it'll make my life a bit easier. To have the visual component of a database together with the code component of SQL queries, yeah, it'll be amazing. Anyway, go check out GitHub Copilot at gh.io.copilot. This episode is brought to you by Shopify. Not Spotify. Shopify. Easily confused.

0
💬 0

508.293 - 534.784 Lex Fridman

The CEOs are tagged on X often. They're both great CEOs. But this is Shopify. You can sell anywhere with a great looking online store. Using Shopify. I've been learning a lot about the Silk Road, actually. Not the digital one. The one that for a lot of human history served as a place for merchants to travel and trade goods.

0
💬 0

535.985 - 566.822 Lex Fridman

And I'm reading a lot about Genghis Khan who enforced the rule of law on the Silk Road and that actually had a big impact. invigorating effect on the economy of the Eurasian region. Anyway, that was before computers. Imagine if they had computers. Boy, would the Genghis Khan force be terrifying. Or maybe not. Maybe each technological age has their own

0
💬 0

568.163 - 598.954 Lex Fridman

kind of military tactician, their own human that matches perfectly for that time in order to conquer the land and people. Still, what a terrifying time that was. Much of human history. Lots of beauty, but lots of ways to die. So I'm glad to be living in the 21st century where I can sit back with that margarita. I don't drink margaritas, but if I wanted to, I could.

0
💬 0

599.394 - 622.264 Lex Fridman

And then buy stuff on stores created by Shopify. Anyway, you can sign up for a $1 per month trial period at Shopify.com slash Lex. Go to Shopify.com slash Lex to take your business to the next level today. This episode is also brought to you by NetSuite, an all-in-one business management system. I'm not sure why I said that so slowly, but I did.

0
💬 0

622.284 - 650.855 Lex Fridman

I actually did a little intermission for five, six minutes for this episode, where I added in the middle of it an addendum after having tried OpenAI 03 Mini. That was such a weird feeling to sort of insert myself in the middle of an episode. I felt like a third wheel to myself. It's like, hey, hey everyone, what are you doing? Why'd you guys not invite me to this party? That's what I felt like.

0
💬 0

652.256 - 669.049 Lex Fridman

Hey Lux from the past, it's me Lux from the future. Right, I should be talking about NetSuite, which is an all-in-one cloud business management system. It's the machine inside the machine. And boy, are we increasingly building stacks of machines.

0
💬 0

671.291 - 691.104 Lex Fridman

Layers and layers and layers of abstraction until we're just sitting back on a beach somewhere talking to an AI system that's taking care of everything else. Anyway, you can download the CFO's guide to AI and machine learning at at netsuite.com slash lex. That's netsuite.com slash lex.

0
💬 0

692.405 - 716.063 Lex Fridman

This episode is also brought to you by AG1, an all-in-one daily drink to support better health and peak performance. I drank it today. I enjoyed it today. I've been sleeping very, very little. The amount of work I have to do is insane. And last night at 6 a.m., I went to bed at 7 a.m., 8 a.m., thinking about doing an all-nighter. It's madness.

0
💬 0

716.683 - 738.546 Lex Fridman

But anyway, at 6 a.m., I drank an AG-1, and I was sitting on a couch, and I was watching like 10 minutes of American Primeval. I watched like 5, 10 minutes of a show at a time. I was sipping on the AG-1, and I was thinking how lucky, how fucking lucky I am to be alive.

0
💬 0

740.188 - 760.255 Lex Fridman

First of all, because I'm watching the American frontier and people being just brutal to each other, the brutal reality of nature and war during that time and the lawlessness during that time. But also just how lucky I am to be on the spinning rock, enjoying this green, healthy drink.

0
💬 0

762.282 - 786.096 Lex Fridman

being able to watch a show, being able to work hard towards the thing I love, being able to love, being able to breathe, all of it, just amazing. Anyway, they'll give you one month's supply of fish oil when you sign up at drinkag1.com slash Lex. This is the Lex Friedman Podcast. To support it, please check out our sponsors in the description.

0
💬 0

786.676 - 823.37 Lex Fridman

And now, dear friends, here's Dylan Patel and Nathan Lambert. A lot of people are curious to understand China's DeepSeek AI models. So let's lay it out. Nathan, can you describe what DeepSeek v3 and DeepSeek R1 are, how they work, how they're trained? Let's look at the big picture and then we'll zoom in on the details.

0
💬 0

823.669 - 849.783 Dylan Patel

Yeah, so DeepSeq v3 is a new mixture of experts, transformer language model from DeepSeq, who is based in China. They have some new specifics in the model that we'll get into. Largely, this is a open weight model, and it's a instruction model like what you would use in ChatGPT. They also released what is called the base model, which is before these techniques of post-training.

0
💬 0

850.884 - 872.343 Dylan Patel

Most people use instruction models today, and those are what served in all sorts of applications. This was released on, I believe, December 26th, or that week. And then weeks later, on January 20th, DeepSeq released DeepSeq R1, which is a reasoning model, which... really accelerated a lot of this discussion.

0
💬 0

873.064 - 894.219 Dylan Patel

This reasoning model has a lot of overlapping training steps to DeepSeq v3, and it's confusing that you have a base model called v3 that you do something to to get a chat model, and then you do some different things to get a reasoning model. I think a lot of the AI industry is going through this challenge of communications right now where OpenAI makes fun of their own naming schemes.

0
💬 0

894.359 - 909.011 Dylan Patel

They have GPT-4.0. They have OpenAI-01. And there's a lot of types of models. So we're going to break down what each of them are. There's a lot of technical specifics on training and go from high level to specific and kind of go through each of them.

0
💬 0

909.271 - 917.738 Lex Fridman

There's so many places we can go here, but maybe let's go to open weights first. What does it mean for a model to be open weights? And what are the different flavors of open source in general?

0
💬 0

917.938 - 940.498 Dylan Patel

Yeah, so this discussion has been going on for a long time in AI. It became more important since ChatGPT or more focal since ChatGPT at the end of 2022. Open weights is the accepted term for when model weights of a language model are available on the internet for people to download. Those weights can have different licenses, which is effectively the terms by which you can use the model.

0
💬 0

941.098 - 967.812 Dylan Patel

There are licenses that come from history and open source software. There are licenses that are designed by companies specifically. All of Lama, DeepSeek, Quen, Mistral, these popular names in... open weight models have some of their own licenses. It's complicated because not all the same models have the same terms. The big debate is on what makes a model open weight. Why are we saying this term?

0
💬 0

967.852 - 988.241 Dylan Patel

It's kind of a mouthful. It sounds close to open source, but it's not the same. there's still a lot of debate on the definition and soul of open source AI. Open source software has a rich history on freedom to modify, freedom to take on your own, freedom from any restrictions on how you would use the software, and what that means for AI is still being defined.

0
💬 0

988.881 - 1011.45 Dylan Patel

So for what I do, I work at the Allen Institute for AI. We're a nonprofit. We want to make AI open for everybody. And we try to lead on what we think is truly open source. There's not full agreement in the community. But for us, that means releasing the training data, releasing the training code, and then also having open weights like this. And we'll get into the details of the models and

0
💬 0

1012.37 - 1030.234 Dylan Patel

Again and again, as we try to get deeper into how the models were trained, we will say things like the data processing, data filtering, data quality is the number one determinant of the model quality. And then a lot of the training code is the determinant on how long it takes to train and how fast your experimentation is.

0
💬 0

1030.794 - 1051.792 Dylan Patel

So without fully open source models where you have access to this data, it is... hard to know, or it's harder to replicate. So we'll get into cost numbers for DeepSeq v3 on mostly GPU hours and how much you could pay to rent those yourselves. But without the data, the replication cost is going to be far, far higher. And same goes for the code.

0
💬 0

1052.153 - 1072.891 Lex Fridman

We should also say that this is probably one of the more open models out of the frontier models. So like, In this full spectrum, where probably the fullest open source, like you said, open code, open data, open weights. This is not open code. This is probably not open data. And

0
💬 0

1074.973 - 1087.596 Lex Fridman

this is open weights and the licensing is a MIT license or it's, I mean, there's some nuance in the different models, but it's towards the free in terms of the open source movement. These are the kind of the good guys.

0
💬 0

1087.696 - 1112.627 Dylan Patel

Yeah. DeepSeek is doing fantastic work for disseminating understanding of AI. Their papers are extremely detailed in what they do. And for other companies, teams around the world, they're very actionable in terms of improving your own training techniques. And we'll talk about licenses more. The DeepSeq R1 model has a very permissive license. It's called the MIT license.

0
💬 0

1112.767 - 1132.918 Dylan Patel

That effectively means there's no downstream restrictions on commercial use. There's no use case restrictions. You can use the outputs from the models to create synthetic data. And this is all fantastic. I think the closest peer is something like Lama, where you have the weights and you have a technical report. And the technical report is very good for Lama.

0
💬 0

1133.038 - 1151.83 Dylan Patel

One of the most read PDFs of the year last year is the Lama 3 paper. But in some ways, it's slightly less actionable. It has less details on the training specifics and less plots. And so on. And the Lama 3 license is more restrictive than MIT. And then between the DeepSeek custom license and the Lama license, we could get into this whole rabbit hole.

0
💬 0

1151.87 - 1155.853 Dylan Patel

I think we'll make sure we want to go down the license rabbit hole before we do specifics.

0
💬 0

1156.073 - 1179.403 Lex Fridman

Yeah. And I mean, so it should be stated that one of the implications of DeepSeek, it puts pressure on Lama and everybody else on open AI to push towards open source. And that's the other side of open source that you mentioned is how much is published in detail about it. So how open are you with the sort of the insights behind the code? So like, how good is the technical reports?

0
💬 0

1180.084 - 1186.546 Lex Fridman

Are they hand wavy? Or is there actual details in there? And that's one of the things that deep seek did well as they publish a lot of the details.

0
💬 0

1186.886 - 1204.913 Dylan Patel

Yeah, especially in the DeepSeq v3, which is their pre-training paper. They were very clear that they are doing interventions on the technical stack that go at many different levels. For example, to get highly efficient training, they're making modifications at or below the CUDA layer for NVIDIA chips.

0
💬 0

1206.214 - 1218.958 Dylan Patel

I have never worked there myself, and there are a few people in the world that do that very well, and some of them are at DeepSeq. And these types of people are... at DeepSeek and leading American frontier labs, but there are not many places.

0
💬 0

1219.505 - 1249.045 Lex Fridman

To help people understand the other implication of open weights. There's a topic we return to often here. There's a fear that China, the nation, might have interest in stealing American data, violating privacy of American citizens. What can we say about open weights to help us understand what the weights are able to do? in terms of stealing people's data.

0
💬 0

1249.305 - 1262.65 Dylan Patel

Yeah, so these weights that you can download from Hugging Face or other platforms are very big matrices of numbers. You can download them to a computer in your own house that has no internet and you can run this model and you're totally in control of your data.

0
💬 0

1263.79 - 1282.197 Dylan Patel

That is something that is different than how a lot of language model usage is actually done today, which is mostly through APIs, where you send your prompt to GPUs run by certain companies. And these companies will have different distributions and policies on how your data is stored, if it is used to train future models, where it is stored, if it is encrypted, and so on.

0
💬 0

1282.977 - 1291.2 Dylan Patel

So the open weights are, you have your fate of data in your own hands. And that is something that is deeply connected to the soul of open source computing.

0
💬 0

1291.533 - 1312.2 Lex Fridman

So it's not the model that steals your data, it's Quover's hosting the model, which could be China, if you're using the DeepSeq app, or it could be Perplexity. You know, you're trusting them with your data. Or OpenAI, you're trusting them with your data, and some of these are American companies, some of these are Chinese companies, but the model itself is not doing the stealing. It's the host.

0
💬 0

1313.32 - 1325.363 Lex Fridman

All right. So back to the basics. What's the difference between deep seek v3 and deep seek r1? Can we try to like, lay out the confusion potential?

0
💬 0

1325.856 - 1342.877 Dylan Patel

Yes. So for one, I have very understanding of many people being confused by these two model names. So I would say the best way to think about this is that when training a language model, you have what is called pre-training, which is when you're predicting the large amounts of mostly internet text. You're trying to predict the next token.

0
💬 0

1343.597 - 1359.389 Dylan Patel

And what to know about these new DeepSeq models is that they do this internet large-scale pre-training once to get what is called DeepSeq v3 base. This is a base model. It's just going to finish your sentences for you. It's going to be harder to work with than ChatGPT.

0
💬 0

1360.309 - 1382.38 Dylan Patel

And then what DeepSeek did is they've done two different post-training regimes to make the models have specific desirable behaviors. So what is the more normal model in terms of the last few years of AI, an instruct model, a chat model, a quote-unquote aligned model, a helpful model? There are many ways to describe this. is more standard post-training.

0
💬 0

1382.541 - 1400.112 Dylan Patel

So this is things like instruction tuning, reinforced learning from human feedback. We'll get into some of these words. And this is what they did to create the DeepSeq v3 model. This was the first model to be released, and it is very high-performance. It's competitive with GPT-4, LAMA-405b, so on.

0
💬 0

1402.333 - 1421.966 Dylan Patel

And then when this release was happening, we don't know their exact timeline or soon after they were finishing the training of a different training process from the same next token prediction based model that I talked about, which is when this new reasoning training that people have heard about comes in in order to create the model that is called DeepSeq R1.

0
💬 0

1423.007 - 1442.022 Dylan Patel

The R through this conversation is good for grounding for reasoning, and the name is also similar to OpenAI's O1, which is the other reasoning model that people have heard about. And we'll have to break down the training for R1 in more detail because for one, we have a paper detailing it, but also it is a far newer set of techniques for the AI community.

0
💬 0

1442.482 - 1445.745 Dylan Patel

So it's a much more rapidly evolving area of research.

0
💬 0

1446.252 - 1461.953 Lex Fridman

Maybe we should also say the big two categories of training of pre-training and post-training. These umbrella terms that people use. So what is pre-training and what is post-training? And what are the different flavors of things underneath post-training umbrella?

0
💬 0

1462.185 - 1480.675 Dylan Patel

Yeah, so pre-training, I'm using some of the same words to really get the message across is you're doing what is called autoregressive prediction to predict the next token in a series of documents. This is done over standard practices, trillions of tokens. So this is a ton of data that is mostly scraped from the web.

0
💬 0

1481.355 - 1498.349 Dylan Patel

In some of DeepSeq's earlier papers, they talk about their training data being distilled for math. I shouldn't use this word yet, but taken from Common Crawl. And that's a public access that anyone listening to this could go download data from the Common Crawl website. This is a crawler that is maintained publicly.

0
💬 0

1498.85 - 1517.323 Dylan Patel

Yes, other tech companies eventually shift to their own crawler, and DeepSeq likely has done this as well as most frontier labs do. But this sort of data is something that people can get started with. And you're just predicting text in a series of documents. This can be scaled to be very efficient.

0
💬 0

1517.763 - 1541.665 Dylan Patel

And there's a lot of numbers that are thrown around in AI training, like how many floating point operations or flops are used. And then you can also look at how many hours of these GPUs that are used. And it's largely one loss function taken to a very large amount of compute usage, you just you set up really efficient systems. And then at the end of that, you have the space model.

0
💬 0

1542.345 - 1566.403 Dylan Patel

And pre training is where there is a lot more of complexity in terms of how the process is emerging or evolving and the different types of training losses that you will use. I think this is a lot of techniques grounded in the natural language processing literature. The oldest technique which is still used today is something called instruction tuning or also known as supervised fine tuning.

0
💬 0

1566.923 - 1574.908 Dylan Patel

These acronyms will be IFT or SFT. People really go back and forth throughout them and I will probably do the same which is where you add this

0
💬 0

1576.309 - 1595.747 Dylan Patel

formatting to the model where it knows to take a question that is like, explain the history of the Roman Empire to me, or a sort of question you'll see on Reddit or Stack Overflow, and then the model will respond in a information-dense but presentable manner. The core of that formatting is in this instruction tuning phase.

0
💬 0

1596.508 - 1616.979 Dylan Patel

And then there's two other categories of loss functions that are being used today. One I will classify as preference fine tuning. Preference fine tuning is a generalized term for what came out of reinforcement learning from human feedback, which is RLHF. This reinforcement learning from human feedback is credited as the technique that helped

0
💬 0

1618.64 - 1637.149 Dylan Patel

ChatGPT breakthrough is a technique to make the responses that are nicely formatted, like these Reddit answers, more in tune with what a human would like to read. This is done by collecting pairwise preferences from actual humans out in the world to start. And now AIs are also labeling this data and we'll get into those trade-offs.

0
💬 0

1638.19 - 1656.157 Dylan Patel

And you have this kind of contrastive loss function between a good answer and a bad answer. And the model learns to pick up these trends. There's different implementation ways. You have things called reward models. You could have direct alignment algorithms. There's a lot of really specific things you can do, but all of this is about fine tuning to human preferences.

0
💬 0

1657.237 - 1678.278 Dylan Patel

And the final stage is much newer and will link to what is done in R1. And these reasoning models is, I think, OpenAI's name for this. They had this new API in the fall, which they called the Reinforcement Fine Tuning API. This is the idea that you use the techniques of reinforcement learning, which is a whole framework of AI. There's a deep literature here.

0
💬 0

1678.838 - 1699.536 Dylan Patel

To summarize, it's often known as trial and error learning or the subfield of AI where you're trying to make sequential decisions in a certain potentially noisy environment. There's a lot of ways we could go down that. but fine tuning language models where they can generate an answer and then you check to see if the answer matches the true solution.

0
💬 0

1699.736 - 1719.511 Dylan Patel

For math or code, you have an exactly correct answer for math. You can have unit tests for code. And what we're doing is we are checking the language models work and we're giving it multiple opportunities on the same questions to see if it is right. And if you keep doing this, the models can learn to improve in verifiable domains. to a great extent. It works really well.

0
💬 0

1719.851 - 1734.823 Dylan Patel

It's a newer technique in the academic literature. It's been used at frontier labs in the US that don't share every detail for multiple years. So this is the idea of using reinforcement learning with language models, and it has been taking off, especially in this deep-seek moment.

0
💬 0

1735.233 - 1755.465 Lex Fridman

And we should say that there's a lot of exciting stuff going on on the, again, across the stack, but the post-training probably this year, there's going to be a lot of interesting developments in the post-training. We'll talk about it. I almost forgot to talk about the difference between DeepSeek v3 and R1 on the user experience side. So, Forget the technical stuff, forget all of that.

0
💬 0

1755.625 - 1765.895 Lex Fridman

Just people that don't know anything about AI, they show up. What's the actual experience? What's the use case for each one when they actually type and talk to it? What is each good at? That kind of thing.

0
💬 0

1766.161 - 1789.597 Dylan Patel

So let's start with DeepSeq v3 again. It's what more people would have tried something like it. You ask it a question. It'll start generating tokens very fast. And those tokens will look like a very human legible answer. It'll be some sort of markdown list. It might have formatting to help you draw to the core details in the answer. And it'll generate tens to hundreds of tokens.

0
💬 0

1789.757 - 1813.366 Dylan Patel

A token is normally a word for common words or a subword part in a longer word. And it'll look like a very high quality Reddit or Stack Overflow answer. These models are really getting good at doing these across a wide variety of domains. Even things that if you're an expert, things that are close to the fringe of knowledge, they will still be fairly good at.

0
💬 0

1814.026 - 1834.597 Dylan Patel

Cutting edge AI topics that I do research on, these models are capable for study aid and they're regularly updated. Where this changes is with the DeepSeq R1, what is called these reasoning models, is when you see tokens coming from these models to start, it will be a large chain of thought process.

0
💬 0

1835.057 - 1852.311 Dylan Patel

We'll get back to chain of thought in a second, which looks like a lot of tokens where the model is explaining the problem. The model will often break down the problem and be like, okay, they asked me for this. Let's break down the problem. I'm going to need to do this. and you'll see all of this generating from the model. It'll come very fast in most user experiences.

0
💬 0

1852.371 - 1870.405 Dylan Patel

These APIs are very fast, so you'll see a lot of tokens, a lot of words show up really fast. It'll keep flowing on the screen, and this is all the reasoning process. And then eventually the model will change its tone in R1, and it'll write the answer, where it summarizes its reasoning process and writes a similar answer to the first types of model.

0
💬 0

1870.926 - 1883.829 Dylan Patel

But in DeepSeq's case, which is part of why this was so popular even outside the AI community, is that you can see how the language model is breaking down problems. And then you get this answer on a technical side.

0
💬 0

1883.869 - 1901.025 Dylan Patel

They train the model to do this specifically where they have a section, which is reasoning, and then it generates a special token, which is probably hidden from the user most of the time, which says, okay, I'm starting to answer. So the model is trained to do this two-stage process on its own. If you use a similar model in, say, OpenAI, OpenAI's user interface is...

0
💬 0

1901.986 - 1917.855 Dylan Patel

trying to summarize this process for you nicely by kind of showing the sections that the model is doing. And it'll kind of click through, it'll say breaking down the problem, making X calculation, cleaning the result, and then the answer will come for something like OpenAI.

0
💬 0

1918.195 - 1922.218 Lex Fridman

Maybe it's useful here to go through like an example of a DeepSeq R1 reasoning

0
💬 0

1923.502 - 1938.816 Dylan Patel

Yeah, so if you're looking at the screen here, what you'll see is a screenshot of the DeepSea chat app. And at the top is thought for 151.7 seconds with the dropdown arrow. Underneath that, if we were in an app that we were running, the dropdown arrow would have the reasoning.

0
💬 0

1939.195 - 1963.973 Lex Fridman

So in this case, the specific question which, you know, I'm philosophically slash pothead inclined. So this is asking Deep Sikhar I for one truly novel insight about humans. And it reveals the reasoning and basically the truly novel aspect is what's pushing the reasoning to constantly sort of the model asking itself, is this truly novel?

0
💬 0

1964.013 - 1987.285 Lex Fridman

So it's actually challenging itself to be more novel, more counterintuitive. less cringe, I suppose. So some of the reasoning says, this is just snapshots, alternatively, humans have a unique meta-emotion where they feel emotions about their own emotions, e.g. feeling guilty about being angry. This recursive emotional layering creates complex motivational drives that don't exist in other animals.

0
💬 0

1987.685 - 1995.429 Lex Fridman

The insight is that human emotions are nested. So it's like, it's reasoning through how humans feel emotions. It's reasoning about meta-emotions.

0
💬 0

1995.489 - 2000.712 Dylan Patel

It's going to have pages and pages of this. It's almost too much to actually read, but it's nice to skim as it's coming.

0
💬 0

2000.852 - 2021.025 Lex Fridman

It's a James Joyce-like stream of consciousness. And then it goes, wait, the user wants something that's not seen anywhere else. Let me dig deeper. And consider the human ability to hold contradictory beliefs simultaneously. Cognitive dissonance is known, but perhaps the function is to allow flexible adaptation, so on and so forth.

0
💬 0

2021.686 - 2038.267 Lex Fridman

I mean, that really captures the public imagination that holy shit, this isn't, uh, mean intelligence slash almost like an inkling of sentience because like you're thinking through, you're self-reflecting, you're deliberating.

0
💬 0

2038.948 - 2065.449 Lex Fridman

And the final result of that after 157 seconds is humans instinctively convert selfish desires into cooperative systems by collectively pretending abstract rules, money, laws, rights are real. These shared hallucinations act as, quote, games, where competition is secretly redirected to benefit the group, turning conflict into society's fuel. Pretty profound. I mean, you know.

0
💬 0

2066.11 - 2082.12 Dylan Patel

This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. That is a at least interesting example, I think, depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there. Yeah.

0
💬 0

2082.58 - 2107.678 Lex Fridman

Well, I mean, some of the, we'll talk about different benchmarks and so on, but some is just a vibe. Like that in itself is a, let's say, quote, fire tweet. Yeah. If I'm trying to produce something where people are like, oh, shit. Okay, so that's a chain of thought. We'll probably return to it more. How were they able to achieve such low cost on the training and the inference?

0
💬 0

2107.758 - 2109.379 Lex Fridman

Maybe you could talk the training first.

0
💬 0

2109.95 - 2130.036 Nathan Lambert

Yeah. So there's two main techniques that they implemented that are probably the majority of their efficiency. And then there's a lot of implementation details that maybe we'll gloss over or get into later that sort of contribute to it. But those two main things are, one is they went to a mixture of experts model, which we'll define in a second.

0
💬 0

2130.056 - 2145.963 Nathan Lambert

And then the other thing is that they invented this new technique called MLA latent attention. Both of these are big deals. Mixture of experts is something that's been in the literature for a handful of years. And OpenAI with GPT-4 was the first one to productize a mixture of experts model.

0
💬 0

2146.343 - 2168.135 Nathan Lambert

And what this means is when you look at the common models around that most people have been able to interact with that are open, right? Think LAMA. LAMA is a dense model. i.e. every single parameter or neuron is activated as you're going through the model for every single token you generate, right? Now, with a mixture of experts model, you don't do that, right?

0
💬 0

2168.195 - 2184.525 Nathan Lambert

How does the human actually work, right? It's like, oh, well, my visual cortex is active when I'm thinking about, you know, vision tasks and like, you know, other things, right? My amygdala is when I'm scared, right? These different aspects of your brain are focused on different things. A mixture of experts model attempts to approximate this to some extent.

0
💬 0

2184.685 - 2198.093 Nathan Lambert

It's nowhere close to what a brain architecture is, but different portions of the model activate, right? You'll have a set number of experts in the model and a set number that are activated each time. And this dramatically reduces both your training and inference costs.

0
💬 0

2198.473 - 2217.324 Nathan Lambert

Because now if you think about the parameter count as the sort of total embedding space for all of this knowledge that you're compressing down during training, When you're embedding this data in, instead of having to activate every single parameter every single time you're training or running inference, now you can just activate a subset.

0
💬 0

2217.945 - 2238.535 Nathan Lambert

And the model will learn which expert to route to for different tasks. And so this is a humongous innovation in terms of, hey, I can continue to grow the total embedding space of parameters. And so DeepSeq's model is, you know, 600 something billion parameters, right? Relative to LAMA-405b, it's 405 billion parameters, right? Relative to LAMA-70b, it's 70 billion parameters, right?

0
💬 0

2238.775 - 2257.848 Nathan Lambert

So this model technically has more embedding space for information, right? To compress all of the world's knowledge that's on the internet down. But at the same time, it is only activating around 37 billion of the parameters. So only 37 billion of these parameters actually need to be computed every single time you're training data or inferencing data out of it.

0
💬 0

2258.408 - 2271.52 Nathan Lambert

And so versus, again, the LAMA model, 70 billion parameters must be activated or 405 billion parameters must be activated. So you've dramatically reduced your compute cost when you're doing training and inference. with this mixture of experts architecture.

0
💬 0

2271.76 - 2282.382 Dylan Patel

Should we break down where it actually applies and go into the transformer? Is that useful? Let's go. Let's go into the transformer. So the transformer is a thing that is talked about a lot, and we will not cover every detail.

0
💬 0

2283.662 - 2298.766 Dylan Patel

Essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional, dense, fully connected, multilayer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details.

0
💬 0

2299.286 - 2318.204 Dylan Patel

And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.

0
💬 0

2318.504 - 2334.593 Lex Fridman

We should also say that a transformer is a giant neural network. And then there's for 15 years now, there's what's called the deep learning revolution. Networks gotten larger and larger and at a certain point, the scaling laws appeared where people realized

0
💬 0

2334.994 - 2358.791 Lex Fridman

is a scaling law shirt by the way representing scaling laws where it became more and more formalized that bigger is better across multiple dimensions of what bigger means so uh and but these are all sort of neural networks we're talking about and we're talking about different architectures of how construct to construct these neural networks such that the training and the inference on them is super efficient yeah

0
💬 0

2358.951 - 2376.466 Dylan Patel

Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big.

0
💬 0

2376.927 - 2395.378 Dylan Patel

At training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's going to be a wide variation depending on your implementation details and stuff.

0
💬 0

2395.898 - 2415.042 Dylan Patel

But it is just important to realize that this type of technical innovation is something that gives value. huge gains. And I expect most companies that are serving their models to move to this mixture of experts implementation. Historically, the reason why not everyone might do it is because it's an implementation complexity, especially when doing these big models.

0
💬 0

2415.503 - 2435.156 Dylan Patel

So this is one of the things that's DeepSeq gets credit for is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeq MOE, MOE is the shortened version of mixture of experts, is multiple papers old. This part of their training infrastructure is not new to these models. alone.

0
💬 0

2435.276 - 2447.306 Dylan Patel

And same goes for what Dylan mentioned with multi-head latent attention. It's all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math.

0
💬 0

2448.026 - 2466.519 Dylan Patel

If you get into the details with this latent attention, it's one of those things I look at and say, okay, they're doing really complex implementations because there's other parts of language models such as embeddings that are used to extend the context length. The common one that DeepSeq uses is rotary positional embeddings, which is called rope.

0
💬 0

2467.18 - 2488.072 Dylan Patel

And if you want to use rope with a normal MOE, it's kind of a sequential thing. You take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher.

0
💬 0

2488.652 - 2503.263 Dylan Patel

So they're managing all of these things. And these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.

0
💬 0

2503.603 - 2514.911 Lex Fridman

And some of this requires low-level engineering. It's a giant mess and trickery. So as I understand, they went below CUDA. So they go super low programming of GPUs.

0
💬 0

2515.327 - 2532.903 Nathan Lambert

Effectively, NVIDIA builds this library called Nickel, right? In which, you know, when you're training a model, you have all these communications between every single layer of the model, and you may have over 100 layers. What does Nickel stand for? It's NCCL? NVIDIA Communications Collectives Library. Nice. And so...

0
💬 0

2535.846 - 2557.183 Nathan Lambert

When you're training a model, you're going to have all these all-reduces and all-gathers. Between each layer, between the multi-layer perceptron or feed-forward network and the attention mechanism, you'll have basically the model synchronized. Or you'll have all-reducer and all-gather. And this is a communication between all the GPUs in the network, whether it's in training or inference.

0
💬 0

2557.483 - 2570.473 Nathan Lambert

So NVIDIA has a standard library. This is one of the reasons why it's really difficult to use anyone else's hardware. for training is because no one's really built a standard communications library. And NVIDIA has done this at a sort of a higher level, right?

0
💬 0

2570.753 - 2589.006 Nathan Lambert

DeepSeq, because they have certain limitations around the GPUs that they have access to, the interconnects are limited to some extent by the restrictions of the GPUs that were shipped into China legally, not the ones that are smuggled, but legally shipped in that they use to train this model. They had to figure out how to get efficiencies,

0
💬 0

2589.606 - 2609.14 Nathan Lambert

And one of those things is that instead of just calling the NVIDIA library nickel, they instead scheduled their own communications, which some of the labs do. Emeta talked about in Lama 3 how they made their own custom version of nickel. They didn't talk about the implementation details. This is some of what they did.

0
💬 0

2609.381 - 2621.809 Nathan Lambert

Probably not as well as, maybe not as well as DeepSeq because DeepSeq, necessity is the mother of innovation and they had to do this. Whereas in the case, you know, OpenAI has people that do this sort of stuff, Anthropic, et cetera.

0
💬 0

2622.449 - 2645.784 Nathan Lambert

But, you know, DeepSeek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have access to. And so they scheduled communications, you know, by scheduling specific SMs. SMs you could think of as like the core on a GPU. Right. So there's hundreds of cores or there's, you know, a bit over 100 cores, SMs on a GPU.

0
💬 0

2645.824 - 2661.599 Nathan Lambert

And they were specifically scheduling, hey, which ones are running the model, which ones are doing all reduce, which one are doing all gather. Right. And they would flip back and forth between them. And this requires extremely low level programming. This is what Nickel does automatically or other NVIDIA libraries handle this automatically, usually. Yeah, exactly.

0
💬 0

2661.74 - 2681.351 Nathan Lambert

And so technically they're using, you know, PTX, which is like sort of like you could think of it as like an assembly type language. It's not exactly that or instruction set, right? Like coding directly to assembly instruction set. It's not exactly that, but that's still part of technically CUDA. But it's like, do I want to write in Python, you know, PyTorch equivalent and call NVIDIA libraries?

0
💬 0

2681.371 - 2697.876 Nathan Lambert

Do I want to go down to the C level? right? Or, you know, encode even lower level? Or do I want to go all the way down to the assembly or ISO level? And, and there are cases where you go all the way down there at the very big labs, but most companies just do not do that, right? Because it's a waste of time. And the efficiency gains you get are not worth it.

0
💬 0

2698.096 - 2718.085 Nathan Lambert

But deep seeks implementation is so complex, right? Especially with their mixture of experts, right? People have done mixture of experts, but they're generally eight, 16 experts, right? And they activate two. So, you know, one of the words that we like to use is like sparsity factor, right? Or usage, right? So you might have four, you know, one fourth of your model activate, right?

0
💬 0

2718.245 - 2738.596 Nathan Lambert

And that's what Mistral's, Mistral model, right? Their model that really catapulted them to like, oh my God, they're really, really good. OpenAI has also had models that are MOE and so have all the other labs that are major closed. But But what DeepSeq did that maybe only the leading labs have only just started recently doing is have such a high sparsity factor, right?

0
💬 0

2738.616 - 2744.878 Nathan Lambert

It's not one fourth of the model, right? Two out of eight experts activating every time you go through the model, it's eight out of 256.

0
💬 0

2745.878 - 2763.699 Dylan Patel

And there's different implementations for mixture of experts where you can have... some of these experts that are always activated, which this just looks like a small neural network. And then all the tokens go through that. And then they also go through some that are selected by this routing mechanism. And one of the

0
💬 0

2764.795 - 2783.762 Dylan Patel

innovations in DeepSeq's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss, which effectively means during training, you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that

0
💬 0

2784.927 - 2804.215 Dylan Patel

When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.

0
💬 0

2804.695 - 2823.467 Dylan Patel

But if you think about the loss functions of deep learning, this even connects to the You want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts could be seen as intention with the prediction accuracy of the tokens.

0
💬 0

2824.067 - 2842.56 Dylan Patel

So we don't know the exact extent that the deep seek MOE change, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches, they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time.

0
💬 0

2842.58 - 2860.035 Dylan Patel

And this is the sort of thing that just points to them innovating. And I'm sure all the labs that are training big MOEs are looking at this sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep accumulating gains. And we'll talk about... the philosophy of training and how you organize these organizations.

0
💬 0

2860.095 - 2877.632 Dylan Patel

And a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. And DeepSeek does the same thing, and some of them are shared. We have to take them on face value that they share their most important details. I mean, the architecture and the weights are out there, so we're seeing what they're doing.

0
💬 0

2878.493 - 2879.254 Dylan Patel

And it adds up.

0
💬 0

2879.714 - 2900.668 Nathan Lambert

Going back to sort of the like efficiency and complexity point, right? It's 32 versus four, right? For like mixed draw and other MOE models that have been publicly released. So this ratio is extremely high. And sort of what Nathan was getting at there was when you have such a different level of sparsity, you can't just have every GPU have the entire model, right? The model's too big.

0
💬 0

2900.708 - 2917.426 Nathan Lambert

There's too much complexity there. So you have to split up the model. um, with different types of parallelism. Right. And so you might have different experts on different GPU nodes, but now what, what happens when a, you know, this set of data that you get, Hey, all of it looks like this one way and all of it should route to one part of my, you know, model. Right. Um,

0
💬 0

2919.288 - 2940.791 Nathan Lambert

When all of it routes to one part of the model, then you can have this overloading of a certain set of the GPU resources or a certain set of the GPUs, and then the rest of the training network sits idle because all of the tokens are just routing to that. This is one of the biggest complexities with running a very sparse mixture of experts model.

0
💬 0

2941.691 - 2963.552 Nathan Lambert

I, you know, this 32 ratio versus this four ratio is that you end up with so many of the experts just sitting there idle. So how do I load balance between them? How do I schedule the communications between them? This is a lot of the like extremely low level detailed work that they figured out in the public first and potentially like second or third in the world and maybe even first in some cases.

0
💬 0

2964.176 - 2983.944 Lex Fridman

What lesson do you, in the direction of the better lesson, do you take from all of this? Is this going to be the direction where a lot of the gain is going to be, which is this kind of low-level optimization, or is this a short-term thing where the biggest gains will be more on the algorithmic, high-level side of post-training

0
💬 0

2985.024 - 2995.21 Lex Fridman

Is this like a short-term leap because they've figured out like a hack because constraints, necessities, the mother of invention, or is there still a lot of gain?

0
💬 0

2995.39 - 3018.303 Dylan Patel

I think we should summarize what the bitter lesson actually is about. The bitter lesson, essentially, if you paraphrase it, is that the types of training that will win out in deep learning as we go are those methods that are which are scalable in learning and search is what it calls out. And This scale word gets a lot of attention in this.

0
💬 0

3019.104 - 3030.852 Dylan Patel

The interpretation that I use is effectively to avoid adding human priors to your learning process. And if you read the original essay, this is what it talks about, is how

0
💬 0

3031.773 - 3051.204 Dylan Patel

Researchers will try to come up with clever solutions to their specific problem that might get them small gains in the short term, while simply enabling these deep learning systems to work efficiently and for these bigger problems in the long term might be more likely to scale and continue to drive success.

0
💬 0

3052.802 - 3071.396 Dylan Patel

And therefore, we were talking about relatively small implementation changes to the mixture of experts model. And therefore, it's like, okay, like, we will need a few more years to know if one of these are actually really crucial to the bitter lesson. But the bitter lesson is really this long term arc of how. Simplicity can often win.

0
💬 0

3072.057 - 3083.267 Dylan Patel

There's a lot of sayings in the industry like the models just want to learn. You have to give them the simple loss landscape where you put compute through the model and they will learn and getting barriers out of the way.

0
💬 0

3084.259 - 3099.303 Lex Fridman

That's where the power of something like Nickel comes in, where standardized code that can be used by a lot of people to create sort of simple innovations that can scale, which is why the hacks, I imagine that the code base for DeepSeek is probably a giant mess.

0
💬 0

3099.563 - 3113.972 Dylan Patel

I'm sure they have, DeepSeek definitely has code bases that are extremely messy where they're testing these new ideas. multi-head latent attention. Probably could start in something like a Jupyter notebook, where somebody tries something on a few GPUs, and that is really messy.

0
💬 0

3114.373 - 3124.502 Dylan Patel

But the stuff that trains the DeepSeq v3 and DeepSeq R1, those libraries, if you were to present them to us, I would guess are extremely high-quality code.

0
💬 0

3124.722 - 3126.864 Lex Fridman

High-quality, readable code. Yeah.

0
💬 0

3127.024 - 3147.745 Nathan Lambert

I think there is one aspect to note, though, right? Is that there is the general ability for that to transfer across different types of runs, right? You may make really, really high quality code for one specific model architecture at one size. And then that is not transferable to, hey, when I make this architecture tweak, everything's broken again, right?

0
💬 0

3147.785 - 3166.579 Nathan Lambert

Like that's something that could be, you know, with their specific low level coding of like scheduling SMs is specific to this model architecture and size. Right. And whereas like NVIDIA's collectives library is more like, hey, it'll work for anything. Right. You want to do an all reduce? Great. I don't care what your model architecture is. It'll work.

0
💬 0

3167.06 - 3178.167 Nathan Lambert

And you're giving up a lot of performance when you do that in many cases. But it's worthwhile for them to do the specific optimization for the specific run, given the constraints that they have regarding compute.

0
💬 0

3178.612 - 3194.518 Lex Fridman

I wonder how stressful it is to like, you know, these frontier models, like initiate training, like to have the code to push the button that like you're now spending a large amount of money and time to train this.

0
💬 0

3195.178 - 3206.902 Lex Fridman

Like there must, I mean, there must be a lot of innovation on the debugging stage of like making sure there's no issues that you're monitoring and visualizing every aspect of the training, all that kind of stuff.

0
💬 0

3207.368 - 3226.028 Nathan Lambert

When people are training, they have all these various dashboards, but like the most simple one is your loss, right? And it continues to go down. But in reality, especially with more complicated stuff like MOE, the biggest problem with it or FP8 training, which is another innovation, you know, going to a lower precision number format, i.e. less accurate, is that you end up with loss spikes.

0
💬 0

3226.708 - 3229.249 Nathan Lambert

And no one knows why the lost spike happened.

0
💬 0

3229.289 - 3246.432 Dylan Patel

Some of them you do. Some of them are bad data. Can I give an AI2's example of what blew up our earlier models? It's a subreddit called Microwave Gang. We love to shout this out. It's a real thing. You can pull up Microwave Gang. Essentially, it's a subreddit where everybody makes posts that are just the letter M. So it's like, mmm.

0
💬 0

3246.792 - 3264.064 Dylan Patel

So there's extremely long sequences of the letter M. And then the comments are like, beep, beep, because it's in the microwave ends. But if you pass this into a model that's trained to be a normal producing text, it's extremely high loss. Because normally you see an M. You don't predict M's for a long time. So this is something that causes a lot of spikes for us.

0
💬 0

3264.384 - 3277.595 Dylan Patel

But when you have much like this is old. This is not recent. And when you have more mature data systems, that's not the thing that causes the loss spike. And what Dylan is saying is true. But it's levels to this sort of idea. With regards to the stress, right?

0
💬 0

3277.675 - 3292.451 Nathan Lambert

Yeah. These people are like, you know, you'll go out to dinner with like a friend that works at one of these labs and they'll just be like looking at their phone every like 10 minutes. And they're not like, you know, it's one thing if they're texting, but they're just like, like, is the loss. Yeah.

0
💬 0

3292.471 - 3296.715 Dylan Patel

Tokens per second. Lost, not blown up. They're just walking, watching this.

0
💬 0

3297.296 - 3299.398 Lex Fridman

And the heart rate goes up if there's a spike.

0
💬 0

3299.85 - 3310.178 Nathan Lambert

And some level of spikes is normal, right? It'll recover and be back. Sometimes a lot of the old strategy was like, you just stop the run, restart from the old version, and then like change the data mix. And then it keeps going.

0
💬 0

3310.438 - 3325.668 Dylan Patel

There are even different types of spikes. So Dirk Greneveld has a theory that I do that's like fast spikes and slow spikes, where there are sometimes where you're looking at the loss and there are other parameters, you can see it start to creep up. and then blow up. And that's really hard to recover from. So you have to go back much further.

0
💬 0

3325.788 - 3341.193 Dylan Patel

So you have the stressful period where it's like flat or might start going up. And you're like, what do I do? Whereas there are also lost spikes that are, it looks good. And then there's one spiky data point. And what you can do is you just skip those. You see that there's a spike. You're like, okay, I can ignore this data. Don't update the model and do the next one and it'll recover quickly.

0
💬 0

3341.653 - 3352.382 Dylan Patel

But these like, on trickier implementations. So as you get more complex in your architecture and you scale up to more GPUs, you have more potential for your loss blowing up.

0
💬 0

3352.442 - 3368.417 Nathan Lambert

So it's like there's a distribution. The whole idea of grokking also comes in, right? It's like just because it slowed down from improving and loss doesn't mean it's not learning because all of a sudden it could be like this and it could just spike down and loss again because it learned, truly learned something, right? And it took some time for it to learn that.

0
💬 0

3368.777 - 3375.29 Nathan Lambert

It's not like a gradual process, right? And that's what humans are like. That's what models are like. So it's really a stressful task, as you mentioned.

0
💬 0

3375.53 - 3378.296 Lex Fridman

And the whole time the dollar count is going up.

0
💬 0

3378.746 - 3398.836 Dylan Patel

Every company has failed runs. You need failed runs to push the envelope on your infrastructure. So a lot of news cycles are made of X company had Y failed run. Every company that's trying to push the frontier of AI has these. So yes, it's noteworthy because it's a lot of money and it can be week to month setback, but it is part of the process.

0
💬 0

3399.256 - 3423.001 Lex Fridman

But how do you get, if you're deep seek, how do you get to a place where, holy shit, there's a successful combination of hyperparameters? A lot of small failed runs. So rapid iteration through failed runs until... And successful ones. And then you build up some intuition like this mixture of expert works and then this implementation of MLA works.

0
💬 0

3423.381 - 3443.896 Dylan Patel

Key hyperparameters like learning rate and regularization and things like this. And you find the regime that works for your code base. I've... Talking to people at Frontier Labs, there's a story that you can tell where training language models is kind of a path that you need to follow. So you need to unlock the ability to train a certain type of model or a certain scale.

0
💬 0

3443.976 - 3456.125 Dylan Patel

And then your code base and your internal know-how of which hyperparameters work for it is kind of known. And you look at the DeepSeq papers and models, they've scaled up, they've added complexity, and it's just continuing to build the capabilities that they have.

0
💬 0

3456.425 - 3474.888 Nathan Lambert

There's the concept of a YOLO run. So YOLO, you only live once. And what it is, is like, you know, there's all this experimentation you do at the small scale, right? Research ablations, right? Like you have your Jupyter notebook where you're experimenting with MLA on like three GPUs or whatever. And you're doing all these different

0
💬 0

3475.388 - 3500.235 Nathan Lambert

uh things like hey do i do four expert four active experts 128 experts do i arrange the experts this way you know all these different uh model architecture things you're testing at a very small scale right couple researchers few gpus tens of gpus hundreds of gpus whatever it is and then all of a sudden you're like okay guys no more no more fucking around right uh no more screwing around everyone take all the resources we have let's pick what we think will work and just go for it right yolo

0
💬 0

3500.675 - 3513.522 Nathan Lambert

And this is where that sort of stress comes in as like, well, I know it works here, but some things that work here don't work here. And some things that work here don't work down here, right? In terms of scale, right? So it's really truly a YOLO run.

0
💬 0

3514.482 - 3527.389 Nathan Lambert

And sort of like there is this like discussion of like certain researchers just have like this methodical nature, like they can find the whole search space and like figure out all the ablations of different research and really see what is best. And there's certain researchers who just kind of like

0
💬 0

3528.109 - 3550.715 Dylan Patel

you know have that innate gut instinct of like this is the yolo run like you know looking at the data this is it this is why you want to work in post-training because the gpu cost for training is lower so you can make a higher percentage of your training runs yolo runs yeah for now yeah for now for now so some of this is fundamentally luck still luck is skill right in many cases

0
💬 0

3551.124 - 3565.876 Dylan Patel

Yeah, I mean, it looks lucky, right, when you're... But the hill to climb, if you're on one of these labs, you have an evaluation, you're not crushing. There's a repeated playbook of how you improve things. There are localized improvements, which might be data improvements, and these add up into the whole model just being much better.

0
💬 0

3565.896 - 3585.409 Dylan Patel

And when you zoom in really close, it can be really obvious that this model is just really bad at this thing, and we can fix it, and you just add these up. So some of it feels like luck, but on the ground, especially with these new reasoning models we're talking to, it's just... so many ways that we can poke around. And normally it's that some of them give big improvements.

0
💬 0

3585.429 - 3605.197 Nathan Lambert

The search space is near infinite, right? And yet the amount of compute and time you have is very low. And you have to hit release schedules. You have to not get blown past by everyone. Otherwise, you know, what happened with DeepSeek, you know, crushing Meta and Mistral and Cohere and all these guys, they moved too slow, right? They maybe were too methodical. I don't know.

0
💬 0

3605.217 - 3612.261 Nathan Lambert

They didn't hit the YOLO run, whatever the reason was. Maybe they weren't as skilled. You can call it luck if you want, but at the end of the day, it's skill.

0
💬 0

3612.521 - 3619.025 Lex Fridman

So 2025 is the year of the YOLO run. It seems like all the labs are going in.

0
💬 0

3619.046 - 3635.459 Nathan Lambert

I think it's even more impressive what OpenAI did in 2022. At the time, no one believed in mixture of experts models at Google, who had all the researchers. OpenAI had such little compute. And they devoted all of their compute for many months, right?

0
💬 0

3635.619 - 3655.141 Nathan Lambert

All of it, 100% for many months to GPT-4 with a brand new architecture with no belief that, hey, let me spend a couple hundred million dollars, which is all of the money I have on this model, right? That is truly YOLO, right? Now, you know, people are like, all these like training run failures that are in the media, right? It's like, okay, great.

0
💬 0

3655.181 - 3673.675 Nathan Lambert

But like, actually a lot, a huge chunk of my GPs are doing inference. I still have a bunch doing research constantly. And yes, my biggest cluster is training, but like on, on this YOLO run, but like that YOLO run is much less risky than like what opening I did in 2022 or maybe what deep seek did now, or, you know, like sort of like, Hey, we're just going to throw everything at it.

0
💬 0

3673.956 - 3684.747 Lex Fridman

The big winners throughout human history are the ones who are willing to do YOLO at some point. Okay, what do we understand about the hardware it's been trained on, DeepSeq?

0
💬 0

3685.228 - 3699.655 Nathan Lambert

DeepSeq is very interesting. This is where it's second to take us to zoom out out of who they are, first of all, right? High Flyer is a hedge fund that has historically done quantitative trading in China as well as elsewhere. And they have always had a significant number of GPUs, right?

0
💬 0

3699.955 - 3718.639 Nathan Lambert

In the past, a lot of these high frequency trading algorithmic quant traders used FPGAs, but it shifted to GPUs definitely. And there's both, right? But GPUs especially and High Flyer, which is the hedge fund that owns DeepSeek and everyone who works for DeepSeek is part of High Flyer to some extent, right? Same parent company, same owner, same CEO.

0
💬 0

3719.42 - 3733.521 Nathan Lambert

They had all these resources and infrastructure for trading, and then they devoted a humongous portion of them to training models, both language models and otherwise, right? Because these techniques were heavily AI-influenced.

0
💬 0

3735.223 - 3752.73 Nathan Lambert

More recently, people have realized, hey, trading with... Even when you go back to Renaissance and all these quantitative firms, natural language processing is the key to trading really fast, understanding a press release and making the right trade. And so DeepSeek has always been really good at this.

0
💬 0

3753.39 - 3768.819 Nathan Lambert

And even as far back as 2021, they have press releases and papers saying, hey, we're the first company in China with an A100 cluster this large. It was 10,000 A100 GPUs, right? This is in 2021. Now, this wasn't all for training, you know, large language models.

0
💬 0

3768.839 - 3786.453 Nathan Lambert

This was mostly for training models for their quantitative aspects, their quantitative trading, as well as, you know, a lot of that was natural language processing, to be clear, right? And so this is the sort of history, right? So verifiable fact is that in 2021, they built the largest Chinese cluster. At least they claim it was the largest cluster in China, 10,000 GPUs.

0
💬 0

3786.653 - 3788.275 Dylan Patel

Before expert controls started.

0
💬 0

3788.755 - 3804.269 Nathan Lambert

Yeah. It's like they've had a huge cluster before any conversation of export controls. So then you step it forward to like, what have they done over the last four years since then, right? Obviously, they've continued to operate the hedge fund, probably make tons of money. And the other thing is that they've leaned more and more and more into AI.

0
💬 0

3804.63 - 3829.306 Nathan Lambert

The CEO, Liang Qingfeng, Liang... You're not putting me spot on this. We discussed this before. Yeah. Leon Feng, the CEO, he owns maybe a little bit more than half the company, allegedly, is an extremely Elon Jensen kind of figure where he's just involved in everything. And so over that time period, he's gotten really in-depth into AI. He actually has a bit of a...

0
💬 0

3830.427 - 3833.629 Nathan Lambert

If you see some of the statements, a bit of an EAC vibe almost, right?

0
💬 0

3833.989 - 3855.221 Dylan Patel

Total AGI vibes. They're like, we need to do this. We need to make a new ecosystem of open AI. We need China to lead on this sort of ecosystem because historically the Western countries have led on software ecosystems. And he straight up acknowledges, like, in order to do this, we need to do something different. DeepSeek is his way of doing this.

0
💬 0

3855.681 - 3857.682 Dylan Patel

Some of the translated interviews with him are fantastic.

0
💬 0

3857.702 - 3862.505 Lex Fridman

So he has done interviews? Do you think you would do a Western interview or no? Or is there controls on the channel?

0
💬 0

3862.525 - 3865.647 Dylan Patel

There hasn't been one yet, but I would try it.

0
💬 0

3866.327 - 3878.533 Lex Fridman

I just got a Chinese translator, so it was great. This is El Push. So fascinating figure, engineer, pushing full-on into AI, leveraging the success from the high-frequency trading.

0
💬 0

3878.613 - 3897.107 Dylan Patel

Very direct quotes, like, we will not switch to closed source when asked about this stuff. Very long-term motivated and... how the ecosystem of AI should work. And I think from a Chinese perspective, he wants a Chinese company to build this vision.

0
💬 0

3897.327 - 3917.559 Nathan Lambert

And so this is sort of like the quote-unquote visionary behind the company, right? This hedge fund still exists, right? This quantitative firm. And so... DeepSeek is the sort of, you know, slowly he got turned to this full view of like AI, everything about this, right? But at some point it slowly maneuvered and he made DeepSeek. And DeepSeek has done multiple models since then.

0
💬 0

3917.819 - 3934.229 Nathan Lambert

They've acquired more and more GPUs. They share infrastructure with the fund. Right. And so, you know, there is no exact number of public GPU resources that they have. But besides this 10,000 GPUs that they bought in 2021. Right. And they were fantastically profitable. Right.

0
💬 0

3934.489 - 3945.477 Nathan Lambert

And then this paper claims they did only 2,000 H800 GPUs, which are a restricted GPU that was previously allowed in China, but no longer allowed. And there's a new version. But it's basically NVIDIA's H100 for China.

0
💬 0

3946.417 - 3973.032 Nathan Lambert

right um and there's some restrictions on it specifically around the communications uh sort of uh speed the interconnect speed right which is why they had to do this crazy sm you know scheduling stuff right so going back to that right looks like this is obviously not true in terms of their total gpu count obvious available gpus but for this training run you think 2000 is the correct number or no so this is where it takes um you know a significant amount of sort of like zoning in right like

0
💬 0

3974.25 - 3987.719 Nathan Lambert

What do you call your training run, right? Do you count all of the research and ablations that you ran, right? Picking all this stuff, because yes, you can do a YOLO run, but at some level, you have to do the test at the small scale, and then you have to do some test at medium scale before you go to a large scale.

0
💬 0

3988.059 - 3997.385 Dylan Patel

Accepted practice is that for any given model that is a notable advancement, you're going to do two to four X compute of the full training run in experiments alone.

0
💬 0

3997.726 - 4002.869 Lex Fridman

So a lot of this compute that's being scaled up is probably used in large part at this time for research.

0
💬 0

4003.402 - 4007.449 Nathan Lambert

Yeah, and research begets the new ideas that let you get huge efficiency.

0
💬 0

4007.53 - 4011.937 Dylan Patel

Research gets you O1. Research gets you breakthroughs, and you need to bet on it.

0
💬 0

4012.078 - 4016.546 Lex Fridman

So some of the pricing strategy that we'll discuss has the research baked into the price.

0
💬 0

4016.932 - 4034.18 Nathan Lambert

So the numbers that DeepSeq specifically said publicly, right, are just the 10,000 GPUs in 2021 and then 2,000 GPUs for only the pre-training for V3. They did not discuss cost on R1. They did not discuss cost on all the other RL, right, for the instructive model that they made, right?

0
💬 0

4034.28 - 4050.525 Nathan Lambert

They only discussed the pre-training for the base model and they did not discuss anything on research and ablations. And they do not talk about any of the resources that are shared in terms of, hey, the fund is using all these GPUs, right? And we know that they're very profitable and that 10,000 GPUs in 2021.

0
💬 0

4050.885 - 4056.707 Nathan Lambert

So some of the research that we've found is that we actually believe they have closer to 50,000 GPUs.

0
💬 0

4058.021 - 4074.064 Lex Fridman

We is semi-analysis. So we should say that you're sort of one of the world experts in figuring out what everybody's doing in terms of the semiconductor, in terms of cluster build-outs, in terms of who's doing what in terms of training runs. So yeah, so that's the we. Okay, go ahead.

0
💬 0

4074.084 - 4080.866 Nathan Lambert

Yeah, sorry. We believe they actually have something closer to 50,000 GPUs, right? Now, this is split across many tasks, right? Again, the fund.

0
💬 0

4082.486 - 4095.363 Dylan Patel

Research and ablations. For ballpark, how much would OpenAI or Anthropic have? I think the clearest example we have, because Meta is also open, they talk about like order of 60k to 100k H100 equivalent GPUs in their training clusters.

0
💬 0

4095.851 - 4110.496 Nathan Lambert

Right. So like Llama 3, they trained on 16,000 H100s, right? But the company of Meta last year publicly disclosed they bought like 400 something thousand GPUs. Yeah. Right. So of course, tiny percentage on the training. Again, like most of it is like serving me the best Instagram reels, right?

0
💬 0

4110.736 - 4121.76 Dylan Patel

Or whatever, right? I mean, we could get into a cost of like, what is the cost of ownership for a 2000 GPU cluster, 10,000? There's just different sizes of companies that can afford these things. And DeepSeek is...

0
💬 0

4122.74 - 4134.026 Dylan Patel

Can you in general actually just zoom out and also talk about the Hopper architecture, the NVIDIA Hopper GPU architecture, and the difference between H100 and H800, like you mentioned, the interconnects?

0
💬 0

4144.071 - 4165.201 Nathan Lambert

Yeah, so there's, you know, Ampere was the A100 and then H100 Hopper, right? People use them synonymously in the US because really there's just H100 and now there's H200, right? But same thing, mostly. In China, there've been different salvos of export restrictions. So initially the US government limited on a two-factor scale, right? Which is chip interconnect versus flops, right?

0
💬 0

4165.521 - 4189.345 Nathan Lambert

So any chip that had interconnects above a certain level and flops above a certain floating point operations above a certain level was restricted. Later, the government realized that this was a flaw in the restriction and they cut it down to just floating point operations. And so H-800 had high flops, low communication? Exactly. So the H-800 was the same performance as H-100 on flops.

0
💬 0

4190.886 - 4209.785 Nathan Lambert

But it just had the interconnect bandwidth cut. DeepSeq knew how to utilize this. Hey, even though we're cut back on the interconnect, we can do all this fancy stuff to figure out how to use the GPU fully anyways. And so that was back in October 2022. But later in 2023, end of 2023, implemented in 2024, the US government banned the H800, right? Yeah.

0
💬 0

4214.669 - 4233.998 Nathan Lambert

And so, by the way, this H800 cluster, these 2,000 GPUs, was not even purchased in 2024, right? It was purchased in late 2023. And they're just getting the model out now, right, because it takes a lot of research, et cetera. H800 was banned, and now there's a new chip called the H20. The H20 is cut back on only flops, but the interconnect bandwidth is the same.

0
💬 0

4234.318 - 4246.365 Nathan Lambert

And in fact, in some ways, it's better than the H100 because it has better memory bandwidth and memory capacity. So there are, you know, NVIDIA is working within the constraints of what the government says and then builds the best possible GPU for China.

0
💬 0

4246.665 - 4259.378 Lex Fridman

Can we take this actual tangent and we'll return back to the hardware? Is the philosophy, the motivation, the case for export controls? What is it? Dari Amadej has published a blog post about export controls.

0
💬 0

4260.049 - 4288.271 Lex Fridman

The case he makes is that if AI becomes super powerful, and he says by 2026, we'll have AGI or super powerful AI, and that's going to give a significant, whoever builds that will have a significant military advantage. And so, because the United States is a democracy, and as he says, China is authoritarian or has authoritarian elements, you want a unipolar world where the super powerful military

0
💬 0

4288.911 - 4316.286 Lex Fridman

because of the AI is one that's a democracy. It's a much more complicated world geopolitically when you have two superpowers with super powerful AI and one is authoritarian. So that's the case he makes. And so we want to, the United States wants to use export controls to slow down, to make sure that China can't do these gigantic training runs that will be presumably required to build AGI.

0
💬 0

4317.066 - 4339.542 Dylan Patel

This is very abstract. I think this can be the goal of how some people describe export controls, is this super powerful AI. You touched on the training run idea. There's not many worlds where China cannot train AI models. I think export controls are kneecapping the amount of compute or the density of compute that China can have.

0
💬 0

4340.402 - 4359.532 Dylan Patel

And if you think about the AI ecosystem right now, as all of these AI companies, revenue numbers are up and to the right. Their AI usage is just continuing to grow. More GPUs are going to inference. A large part of export controls, if they work, is just that the amount of AI that can be run in China is going to be much lower.

0
💬 0

4359.952 - 4378.139 Dylan Patel

So on the training side, DeepSeek v3 is a great example, which you have a very focused team that can still get to the frontier of AI. This 2,000 GPUs is not that hard to get, all considering in the world. They're still going to have those GPUs. They're still going to be able to train models. But if there's going to be a huge market for AI, if you have strong export controls and you

0
💬 0

4382.54 - 4404.311 Dylan Patel

With good expert controls, it also just makes it so that AI can be used much less. And I think that is a much easier goal to achieve than trying to debate on what AGI is. And if you have these extremely intelligent, autonomous AIs and data centers, those are the things that could be running in these GPU clusters in the United States, but not in China.

0
💬 0

4404.511 - 4427.987 Nathan Lambert

To some extent, training a model does effectively nothing, right? Yeah. The thing that Dario is sort of speaking to is the implementation of that model once trained to then create huge economic growth, huge increases in military capabilities, huge increases in productivity of people, betterment of lives, whatever you want to direct super powerful AI towards, you can't.

0
💬 0

4428.327 - 4443.454 Nathan Lambert

But that requires significant amounts of compute, right? And so the US government has effectively said, And forever, right? Training will always be a portion of the total compute. We mentioned Meta's 400,000 GPUs, only 16,000 made Lama, right?

0
💬 0

4443.494 - 4464.486 Nathan Lambert

So the percentage that Meta's dedicating to inference, now this might be for recommendation systems that are trying to hack our mind into spending more time and watching more ads, or if it's for a super powerful AI that's doing productive things, doesn't matter about the exact use that our economic system decides, it's that that can be delivered in whatever way we want. Whereas with China, right.

0
💬 0

4464.566 - 4478.502 Nathan Lambert

You know, you're, you know, expert restrictions. Great. You're never going to be able to cut everything off. Right. And that's, that's like, I think that's quite well understood by the U S government is that you can't cut everything off. You know, they'll make their own chips and they're trying to make their own chips. They'll be worse than ours.

0
💬 0

4478.542 - 4490.988 Nathan Lambert

But you know, this is the whole point is to just keep a gap. Right. And therefore, at some point as the AI, you know, in a world where two, three percent economic growth, this is really dumb, by the way, right, to cut off, you know, high tech and not make money off of it.

0
💬 0

4491.248 - 4509.034 Nathan Lambert

But in a world where super powerful AI comes about and then starts creating significant changes in society, which is what all the AI leaders and big tech companies believe, I think super powerful AI is going to change society massively. And therefore, this compounding effect of the difference in compute is really important. There's some sci-fi out there where like

0
💬 0

4509.634 - 4532.321 Nathan Lambert

ai is is like measured in the power of in like how much power is delivered to compute right or how much uh is being you know that's sort of a way of thinking about what's the economic output is just how much power you directing towards that ai should we talk about reasoning models with this as a way that this might be actionable as something that people can actually see so the reasoning models that are coming out with r1 and o1 they're designed to use more compute there's a lot of

0
💬 0

4533.061 - 4547.051 Dylan Patel

buzzy words in the AI community about this, test time compute, inference time compute, whatever. But Dylan has good research on this. You can get to the specific numbers on the ratio of when you train a model, you can look at things about the amount of compute used at training and amount of compute used at inference.

0
💬 0

4547.632 - 4560.339 Dylan Patel

These reasoning models are making inference way more important to doing complex tasks. In the fall, in December, OpenAI announced this O3 model. There's another thing in AI when things move fast, we get both announcements and releases.

0
💬 0

4560.52 - 4575.208 Dylan Patel

Announcements are essentially blog posts where you pat yourself on the back and you say you did things and releases are on the models out there, the papers out there, etc. So OpenAI has announced O3 and we can check if O3 Mini is out as of recording potentially. But that doesn't really change the point, which is that

0
💬 0

4575.888 - 4599.867 Dylan Patel

The breakthrough result was something called Arc AGI task, which is the abstract reasoning corpus, a task for artificial general intelligence. Francois Chollet is the guy who's been, it's a multi-year old paper. It's a brilliant benchmark. And the number for OpenAI 03 to solve this was that it used some sort of number of samples in the API. The API has like thinking effort and number of samples.

0
💬 0

4600.367 - 4619.188 Dylan Patel

They used a thousand samples to solve this task and it comes out to be like $5 to $20 per question, which you're putting in effectively a math puzzle. And then it takes orders of dollars to answer one question. And this is a lot of compute. If those are going to take off in the US, OpenAI needs a ton of GPUs on inference to capture this.

0
💬 0

4619.228 - 4632.78 Dylan Patel

They have this OpenAI ChatGPT Pro subscription, which is $200 a month. Which Sam said they're losing money on. Which means that people are burning a lot of GPUs on inference. And I'm I've signed up with it. I've played with it. I don't think I'm a power user, but I use it.

0
💬 0

4632.841 - 4651.557 Dylan Patel

And it's like, that is the thing that a Chinese company with mediumly strong expert controls, there will always be loopholes, might not be able to do at all. And if that, the main result for O3 is also a spectacular coding performance. And if that feeds back into AI companies being able to experiment better.

0
💬 0

4651.98 - 4670.841 Lex Fridman

So presumably the idea is for an AGI, a much larger fraction of the compute would be used for this test time compute for the reasoning. For the AGI goes into a room and thinks about how to take over the world and come back in 2.7 hours And that it's going to take a lot of computing.

0
💬 0

4670.861 - 4684.506 Dylan Patel

This is what people like CEO or leaders of open AI and anthropic talk about is like autonomous AI models, which is you give them a task and they work on it in the background. I think my personal definition of AGI is much, simpler.

0
💬 0

4684.826 - 4695.311 Dylan Patel

I think language models are a form of AGI and all of this super powerful stuff is a next step that's great if we get these tools, but a language model has so much value in so many domains. It is a general intelligence to me.

0
💬 0

4695.891 - 4706.736 Dylan Patel

But this next step of agentic things where they're independent and they can do tasks that aren't in the training data is what the few year outlook that these AI companies are driving for.

0
💬 0

4707.222 - 4728.479 Lex Fridman

I think the terminology here that Dario uses is super powerful AI. So I agree with you on the AGI. I think we already have something like that's exceptionally impressive that Alan Turing would for sure say is AGI. But he's referring more to something once in possession of, then you would have a significant military and geopolitical advantage over other nations.

0
💬 0

4728.979 - 4732.622 Lex Fridman

So it's not just like you can ask it how to cook an omelette.

0
💬 0

4732.802 - 4750.171 Dylan Patel

And he has a much more positive view in his essay, Machines of Love and Grace. I've read into this. I don't have enough background in physical sciences to gauge exactly how competent I am in if AI can revolutionize biology. I'm safe saying that AI is going to accelerate the progress of any computational science.

0
💬 0

4750.511 - 4778.802 Lex Fridman

So we're doing a depth-first search here on topics, taking tangent of a tangent. So let's continue on that depth-first search topic. You said that you're both feeling the AGI. So what's your timeline? Dario's 2026 for the super powerful AI that's basically agentic to a degree where it's a real security threat, that level of AGI. What's your timeline?

0
💬 0

4779.096 - 4799.538 Dylan Patel

I don't like to attribute specific abilities because predicting specific abilities and when is very hard. I think mostly if you're going to say that I'm feeling the AGI is that I expect continued rapid surprising progress over the next few years. So something like R1 is less surprising to me from deep seek because I expect there to be new paradigms where substantial progress can be made.

0
💬 0

4799.938 - 4815.145 Dylan Patel

I think DeepSeq R1 is so unsettling because we're kind of on this path with ChatGPT. It's like, it's getting better, it's getting better, it's getting better. And then we have a new direction for changing the models. And we took one step like this, and we took a step up. So it looks like a really fast slope, and then we're going to just take more steps.

0
💬 0

4815.465 - 4832.133 Dylan Patel

So it's just really unsettling when you have these big steps. And I expect that to keep happening. I've tried OpenAI Operator. I've tried Cloud computer use. They're not there yet. I understand the idea. But it's just so hard to predict what is the breakthrough that will make something like that work.

0
💬 0

4832.273 - 4847.704 Dylan Patel

And I think it's more likely that we have breakthroughs that work and things that we don't know what they're going to do. So everyone wants agents. Dario has a very eloquent way of describing this. And I just think that it's like, there's going to be more than that. So just expect these things to come.

0
💬 0

4848.254 - 4872.206 Lex Fridman

I'm gonna have to try to pin you down to a date on the AGI timeline. Like the nuclear weapon moment. So moment where on the geopolitical stage, There's a real, like, you know, because we're talking about export controls. When do you think, just even to throw out a date, when do you think that would be? Like, for me, it's probably after 2030.

0
💬 0

4872.387 - 4884.914 Nathan Lambert

So I'm not as... That's what I would say. So define that, right? Because to me, it kind of almost has already happened, right? You look at elections in India and Pakistan, people get AI voice calls and think they're talking to the politician, right?

0
💬 0

4885.254 - 4903.694 Nathan Lambert

The AI diffusion rules, which was enacted in the last couple of weeks of the Biden admin and looks like the Trump admin will keep and potentially even strengthen, limit cloud computing and GPU sales to countries that are not even related to China. Portugal and all these normal countries are on the you need approval from the US list.

0
💬 0

4904.054 - 4916.078 Nathan Lambert

Like, yeah, Portugal and like, you know, like, like all these countries that are allies, right? Singapore, right? Like they, they freaking have F-35s and we don't let them buy GPUs. Like this is, this to me is already to the scale of like, you know.

0
💬 0

4916.278 - 4939.755 Lex Fridman

Well, that just means that the US military is really nervous about this new technology. That doesn't mean the technology is already there. So they might be just very cautious about this thing that they don't quite understand. But that's a really good point, sort of the robocalls. Swarms of semi-intelligent bots could be a weapon, could be doing a lot of social engineering.

0
💬 0

4939.915 - 4955.925 Nathan Lambert

I mean, there's tons of talk about, you know, from the 2016 elections, like Cambridge Analytica and all this stuff, Russian influence. I mean, every country in the world is pushing stuff onto the internet and has narratives they want, right? Like that's every, every like technically competent, whether it's Russia, China, US, Israel, et cetera, right?

0
💬 0

4955.985 - 4963.249 Nathan Lambert

You know, people are pushing viewpoints onto the internet en masse and language models crash the cost of like very intelligent sounding language.

0
💬 0

4963.269 - 4980.957 Dylan Patel

There's some research that shows that the distribution is actually the limiting factor. So language models haven't yet made misinformation particularly change the equation there. The internet is still ongoing. I think there's a blog, AI Snake Oil, and some of my friends at Princeton that write on this stuff. So there is research.

0
💬 0

4981.558 - 4995.511 Dylan Patel

It's a default that everyone assumes, and I would have thought the same thing, is that Misinformation doesn't get far worse with language models. I think in terms of internet posts and things that people have been measuring, it hasn't been a exponential increase or something extremely measurable.

0
💬 0

4995.651 - 5017.389 Dylan Patel

And things you're talking about with like voice calls and stuff like that, it could be in modalities that are harder to measure. So it's something that it's too soon to tell in terms of, I think that's like political instability via the web is very, it's monitored by a lot of researchers to see what's happening. I think that you're asking about like the AGI thing.

0
💬 0

5017.409 - 5037.867 Dylan Patel

If you make me give a year, I would be like, okay, I have AI CEOs saying this. They've been saying two years for a while. I think that they're... People like Dario Anthropic, the CEO, had thought about this so deeply. I need to take their word seriously, but also understand that they have different incentives.

0
💬 0

5037.887 - 5042.013 Dylan Patel

So I'd be like, add a few years to that, which is how you get something similar to 2030 or a little after 2030.

0
💬 0

5042.794 - 5053.297 Nathan Lambert

I think to some extent we have capabilities that hit a certain point where any one person could say, oh, okay, if I can leverage those capabilities for X amount of time, this is AGI, right? Call it 27, 28.

0
💬 0

5054.557 - 5069.701 Nathan Lambert

But then the cost of actually operating that capability is so, so extreme that no one can actually deploy it at scale and mass to actually completely revolutionize the economy on a snap of a finger. So I don't think it will be like a snap of the fingerprint.

0
💬 0

5069.921 - 5086.059 Nathan Lambert

physical constraint rather it'll be a you know oh the capabilities are here but i can't deploy it everywhere right and so one one simple example going back sort of to 2023 was when uh you know being with gpt4 came out and everyone was freaking out about search right Perplexity came out.

0
💬 0

5086.339 - 5106.268 Nathan Lambert

If you did the cost on like, hey, implementing GPT-3 into every Google search, it was like, oh, okay, this is just like physically impossible to implement, right? And as we step forward to like going back to the test time compute thing, right? A query for, you know, you ask ChatGPT a question, it costs cents, right? For their most capable model of chat, right? To get a query back.

0
💬 0

5106.628 - 5131.65 Nathan Lambert

To solve an Arc AGI problem, though... cost five to 20 bucks, right? And this is- It's only going up from there. This is a thousand, 10,000 X factor difference in cost to respond to a query versus do a task. And the task of Arc AGI, it's not like it's like, it's simple to some extent, but it's also like, What are the tasks that we want? Okay, AGI, quote unquote, what we have today can do Arc AGI.

0
💬 0

5132.091 - 5153.023 Nathan Lambert

Three years from now, it can do much more complicated problems, but the cost is going to be measured in thousands and thousands and hundreds of thousands of dollars of GPU time, and there just won't be enough power, GPUs, infrastructure to operate this and therefore shift everything in the world on the snap of the finger. But at that moment, who gets to control and point the AGI at a task?

0
💬 0

5153.283 - 5175.899 Nathan Lambert

And so this was in Dario's post that he's like, hey, China can effectively and more quickly than us point their AGI at military tasks, right? And they have been in many ways faster at adopting certain new technologies into their military, right? Especially with regards to drones, right? The U.S. maybe has a longstanding, you know, large air sort of, you know, fighter jet type of thing, bombers.

0
💬 0

5175.939 - 5192.397 Nathan Lambert

But when it comes to asymmetric arms such as drones, they've completely leapfrogged the U.S. and the West. And the fear that Dario is sort of pointing out there, I think, is that Yeah, great. We'll have AGI in the commercial sector. The US military won't be able to implement it super fast.

0
💬 0

5192.758 - 5209.194 Nathan Lambert

Chinese military could, and they could direct all their resources to implementing it in the military and therefore solving military logistics or solving some other aspect of disinformation for targeted certain set of people so they can flip a country's politics or something like that that is actually catastrophic.

0
💬 0

5209.634 - 5218.838 Nathan Lambert

versus, you know, the US just wants to, you know, because it'll be more capitalistically allocated just towards whatever is the highest return on income, which might be like building, you know, factories better or whatever.

0
💬 0

5219.079 - 5240.328 Lex Fridman

So everything I've seen, people's intuition seems to fail on robotics. So you have this kind of general optimism. I've seen this on self-driving cars. People think it's much easier problem than it is. Similar with drones. Here I understand it a little bit less, but I've just seen the reality of the war in Ukraine and the usage of drones on both sides.

0
💬 0

5241.208 - 5265.2 Lex Fridman

And it seems that humans still far outperform any fully autonomous systems. AI is an assistant. but humans drive. FPV drones, where the humans control most of it, just far, far, far outperforms AI systems. So I think it's not obvious to me that we're going to have swarms of autonomous robots anytime soon in the military context.

0
💬 0

5265.84 - 5283.392 Lex Fridman

Maybe the fastest I can imagine is 2030, which is why I said 2030 for the super powerful AI. Whenever you have large-scale robots swarms of robots doing military actions, that's when the world just starts to look different to me. So that's the thing I'm really worried about.

0
💬 0

5283.952 - 5307.14 Lex Fridman

But there could be cyber war, cyber war type of technologies that from social engineering to actually just swarms of robots that find attack vectors in our code bases and shut down power grids, that kind of stuff. And it could be one of those things like on any given weekend or something, Power goes out, nobody knows why, and the world changes forever.

0
💬 0

5307.781 - 5329.829 Lex Fridman

Just power going out for two days in all of the United States, that will lead to murder, to chaos. But going back to expert controls, do you see that as a useful way to control the balance of power geopolitically in the context of AI?

0
💬 0

5330.444 - 5351.626 Nathan Lambert

And I think going back to my viewpoint is if you believe we're in this sort of stage of economic growth and change that we've been in for the last 20 years, the export controls are absolutely guaranteeing that China will win long term, right? If you do not believe AI is going to make significant changes to society in the next 10 years or five years.

0
💬 0

5352.507 - 5378.208 Nathan Lambert

Five-year timelines are sort of what the more executives and such of AI companies and even big tech companies believe. But even 10-year timelines, it's reasonable. But once you get to, hey, these timelines are below that time period, then the only way to create a sizable advantage or disadvantage for America versus China is if you constrain compute. Because

0
💬 0

5379.369 - 5391.492 Nathan Lambert

Talent is not really something that's constraining, right? China arguably has more talent, right? More STEM graduates, more programmers. The US can draw upon the world's people, which it does. There's tons of foreigners in the AI industry.

0
💬 0

5391.512 - 5395.354 Dylan Patel

So many of these AI teams are all people without a US passport.

0
💬 0

5395.374 - 5409.142 Nathan Lambert

Yeah. I mean, many of them are Chinese people who are moving to America, right? And that's great. That's There's that talent is one aspect, but I don't think that's one that is a measurable advantage for the US or not.

0
💬 0

5409.602 - 5421.655 Nathan Lambert

It truly is just whether or not compute right now, even on the compute side, when we look at chips versus data centers, right, China has the unprecedented ability to build ridiculous sums of power.

0
💬 0

5422.95 - 5448.142 Nathan Lambert

clockwork right they're always building more and more power they've got steel mills that that like individually are the size of the entire u.s industry right and they've got aluminum mills that consume gigawatts and gigawatts of power right and when we talk about what's the biggest data center right opening i made this huge thing about stargate their announcement there that's not that's like once it's fully built out in a few years it'll be two gigawatts right of power

0
💬 0

5448.722 - 5461.867 Nathan Lambert

And this is still smaller than the largest industrial facilities in China. China, if they wanted to build the largest data center in the world, if they had access to the chips, could. So it's just a question of when, not if.

0
💬 0

5462.007 - 5472.471 Lex Fridman

So their industrial capacity far exceeds the United States? Exactly. To manufacture stuff. So long term, they're going to be manufacturing chips there.

0
💬 0

5472.901 - 5488.906 Nathan Lambert

Chips are a little bit more specialized. I'm specifically referring to the data centers. Chips, fabs take huge amounts of power. Don't get me wrong. That's not necessarily the gating factor there. The gating factor on how fast people can build the largest clusters today in the US is power.

0
💬 0

5488.986 - 5506.091 Nathan Lambert

Now, it could be power generation, power transmission, substations, and all these sorts of transformers and all these things. building the data center. These are all constraints on the US industry's ability to build larger and larger training systems, as well as deploying more and more inference compute.

0
💬 0

5506.351 - 5523.376 Dylan Patel

I think we need to make the point clear on why the time is now for people that don't think about this, because essentially with export controls, you're making it so China cannot make or get cutting edge chips. And the idea is that If you time this wrong, China is pouring a ton of money into their chip production.

0
💬 0

5523.476 - 5536.878 Dylan Patel

And if you time it wrong, they are going to have more capacity for production, more capacity for energy, and figure out how to make the chips and have more capacity than the rest of the world to make the chips because everybody can buy. They're going to sell their Chinese chips to everybody. They might subsidize them.

0
💬 0

5537.358 - 5557.964 Dylan Patel

And therefore, if AI takes a long time to become differentiated, we've kneecapped the financial performance of American companies. NVIDIA can sell less. TSMC cannot sell to China. So therefore, we have less demand to therefore keep driving the production cycle. So that's the assumption behind the timing being important.

0
💬 0

5557.984 - 5581.551 Nathan Lambert

Less than 10 years or five years to above, right? China will win because of these restrictions long term unless AI does something in the short term, which I believe AI will do, you know, make massive changes to society in the medium short term, right? And so that's the big unlocker there. And even today, right, if Xi Jinping decided to get, you know, quote unquote, scale-pilled, right, i.e.

0
💬 0

5582.131 - 5591.053 Nathan Lambert

decide that scaling laws are what matters, right, just like the U.S. executives like Satya Nadella and Mark Zuckerberg and Sundar and all these U.S.

0
💬 0

5591.093 - 5609.018 Nathan Lambert

executives of the biggest, most powerful tech companies have decided they're scale-pilled and they're building multi-gigawatt data centers, right, whether it's in Texas or Louisiana or Wisconsin, wherever it is, they're building these massive things that cost millions of dollars. as much as their entire budget for spending on data centers globally in one spot, right?

0
💬 0

5609.258 - 5630.014 Nathan Lambert

This is what they've committed to for next year, year after, et cetera. And so they're so convinced that this is the way, that this is what they're doing. But if China decided to, they could do it faster than us. But this is where the restrictions come in. It is not clear that China as a whole has decided, you know, from the highest levels that this is a priority. The U.S. sort of has, right?

0
💬 0

5630.034 - 5650.285 Nathan Lambert

You know, you see Trump talking about DeepSeek and Stargate within the same week, right? And the Biden admin as well had a lot of discussions about AI and such. It's clear that they think about it. Only just last week did DeepSeek meet the second in command of China, right? Like they have not even met the top, right? They haven't met Xi. Xi hasn't sat down.

0
💬 0

5651.085 - 5675.598 Nathan Lambert

And they only just released a subsidy of a trillion RMB, you know, roughly $160 billion, which is closer to the spending of like Microsoft and Meta and Google combined, right, for this year. So it's like they're realizing it just now, but... But that's where these export restrictions come in and say, hey, you can't ship the most powerful US chips to China. You can ship a cut down version.

0
💬 0

5675.678 - 5682.925 Nathan Lambert

You can't ship the most powerful chips to all these countries who we know are just going to rent it to China. You have to limit the numbers, right?

0
💬 0

5682.965 - 5683.506 Dylan Patel

And the tools.

0
💬 0

5683.946 - 5701.662 Nathan Lambert

And same with manufacturing equipment, tools, all these different aspects. But it all stems from AI and then what downstream can slow them down in AI. And so the entire semiconductor restrictions, you read them, they are very clear. It's about AI and military civil fusion of technology. Right. It's very clear.

0
💬 0

5701.682 - 5717.547 Nathan Lambert

And then from there it goes, oh, well, we're banning them from buying like lithography tools and etch tools and deposition tools. And, oh, this random like, you know, subsystem from a random company that's like tiny. Right. Like, why are we banning this? Because all of it, the U.S. government has decided is critical to AI systems.

0
💬 0

5717.915 - 5735.001 Dylan Patel

I think the fulcrum point is the transition from 7 nanometer to 5 nanometer chips, where I think it was Huawei that had the 7 nanometer chip a few years ago, which caused another political brouhaha, almost like this moment. And then it's the ASML deep UV, what is that?

0
💬 0

5735.461 - 5757.388 Nathan Lambert

Extreme ultraviolet lithography. To set context on the chips, right, what Nathan's referring to is in 2020, Huawei released their Ascend 910 chip, which was an AI chip, first one on 7 nanometer before Google did, before NVIDIA did. And they submitted it to the MLPerf benchmark, which is sort of a industry standard for machine learning performance benchmark. And it did quite well.

0
💬 0

5757.648 - 5774.358 Nathan Lambert

And it was the best chip at the submission, right? This was a huge deal. The Trump admin, of course, banned the Huawei from getting 7 nanometer chips from TSMC. And so then they had to switch to using internal domestically produced chips, which was a multi-year setback.

0
💬 0

5774.578 - 5790.005 Dylan Patel

Many companies have done 7 nanometer chips. And the question is, we don't know how much Huawei was subsidizing production of that chip. Intel has made 7 nanometer chips that are not profitable. Right. and things like this. So this is how it all feeds back into the economic engine of export controls.

0
💬 0

5790.485 - 5806.383 Lex Fridman

Well, so you're saying that for now Xi Jinping has not felt the AGI, but it feels like the deep-seek moment might, like... There might be meetings going on now where he's going to start wearing the same T-shirt and things are going to escalate.

0
💬 0

5806.624 - 5819.33 Nathan Lambert

I mean, he may have woken up last week, right? Leon Feng met the second command guy and they had a meeting. And then the next day they announced the AI subsidies, which are a trillion RMB.

0
💬 0

5819.771 - 5823.913 Lex Fridman

So it's possible that this deep seek moment is truly the beginning of a Cold War.

0
💬 0

5825.011 - 5830.594 Dylan Patel

That's what a lot of people are worried about. People in AI have been worried that this is going towards a Cold War, or already is.

0
💬 0

5830.634 - 5848.904 Lex Fridman

But it's not DeepSeek's fault, but there's something, a bunch of factors came together where it was like this explosion. I mean, it all has to do with Nvidia stock going down. It's just some mass hysteria that happened that eventually led to Xi Jinping having meetings and waking up to this idea.

0
💬 0

5849.204 - 5870.019 Nathan Lambert

And the US government realized in October 7th, 2022, before ChatGPT released, that restriction on October 7th, which dropped and shocked everyone. And it was very clearly aimed at AI. Everyone was like, what the heck are you doing? Stable diffusion was out then, but not ChatGPT. Yeah, but not ChatGPT. So it was like starting to be rumblings. Of what Gen AI can do to society.

0
💬 0

5870.099 - 5878.945 Nathan Lambert

But it was very clear, I think, to at least like National Security Council and those sort of folks that this was where the world is headed, this Cold War that's happening. Yeah.

0
💬 0

5879.232 - 5889.148 Lex Fridman

So is there any concerns that the export controls push China to take military action on Taiwan?

0
💬 0

5889.91 - 5910.545 Nathan Lambert

This is the big risk, right? The further you push China away from having access to cutting edge American and global technologies, the more likely they are to say, well, because I can't access it, I might as well... No one should access it, right? And there's a few interesting aspects of that, right? China has a urban-rural divide like no other.

0
💬 0

5910.565 - 5931.067 Nathan Lambert

They have a male-female birth ratio like no other, to the point where if you look in Most of China, it's like the ratio is not that bad. But when you look at single dudes in rural China, it's like a 30 to 1 ratio. And those are disenfranchised dudes, right? Like, quote unquote, like the US has an incel problem like China does too. It's just they're placated in some way or cut, crushed down. What

0
💬 0

5931.107 - 5945.371 Nathan Lambert

you do with these people? And at the same time, you're not allowed to access the most important technology. At least the US thinks so. China's maybe starting to think this is the most important technology by starting to dump subsidies in it, right? They thought EVs and renewables were the most important technology. They dominate that now, right?

0
💬 0

5945.931 - 5965.619 Nathan Lambert

Now they started thinking about semiconductors in the late 2010s and early 2020s, and now they've been dumping money and they're catching up rapidly. And And they're going to do the same with AI, right? Because they're very talented, right? So the question is like, when does this hit a breaking point, right?

0
💬 0

5966.819 - 5988.412 Nathan Lambert

And if China sees this as, hey, they can continue... If not having access and starting a true hot war, right? Taking over Taiwan or trying to subvert its democracy in some way or blockading it hurts the rest of the world far more than it hurts them, this is something they could potentially do, right? And And so is this pushing them towards that? Potentially, right?

0
💬 0

5988.452 - 6001.221 Nathan Lambert

I'm not quite a geopolitical person, but it's obvious that the world regime of peace and trade is super awesome for economics. But at some point it could break, right?

0
💬 0

6001.241 - 6013.03 Dylan Patel

I think we should comment that why Chinese economy would be hurt by that is that they're export heavy. I think the United States buys so much. If that goes away, that's how their economy- Well, also they just would not be able to import

0
💬 0

6013.81 - 6035.81 Nathan Lambert

raw materials from all over the world. The US would just shut down the Strait of Malacca. At the same time, the US entire... You could argue almost all the GDP growth in America since the 70s has been either population growth or tech. Right. Because, you know, your life today is not that much better than someone from the 80s outside of tech. Right.

0
💬 0

6035.951 - 6053.279 Nathan Lambert

You still, you know, you know, cars, they all have semiconductors in them everywhere. Fridges, semiconductors everywhere. There's these funny stories about how Russians were taking apart laundry machines because they had certain like Texas instrument chips that they could then repurpose and put into like their machines. They're anti-missile missile things, right? Like their S-400 or whatever.

0
💬 0

6053.54 - 6060.004 Nathan Lambert

You would know more about this, but there's all sorts of like everything about semiconductors is so integral to every part of our lives.

0
💬 0

6060.624 - 6070.971 Lex Fridman

So can you explain the role of TSMC in the story of semiconductors and maybe also how the United States can break the reliance on TSMC?

0
💬 0

6071.423 - 6098.449 Nathan Lambert

I don't think it's necessarily breaking the reliance. I think it's getting TSMC to build in the US. So taking a step back, TSMC produces most of the world's chips, especially on the foundry side. There's a lot of companies that build their own chips. Samsung, Intel, STMicro, Texas Instruments, analog devices, all these kinds of companies build their own chips and XP.

0
💬 0

6098.649 - 6103.992 Nathan Lambert

But more and more of these companies are outsourcing to TSMC and have been for multiple decades.

0
💬 0

6104.012 - 6109.215 Lex Fridman

Can you explain the supply chain there and where most of TSMC is in terms of manufacturing?

0
💬 0

6109.475 - 6126.453 Nathan Lambert

Sure. So historically, supply chain was companies would build their own chips. It would be a company started. They'd build their own chips. And then they'd design the chip and build the chip and sell it. Over time, this became really difficult because the cost of building a fab continues to compound. every single generation.

0
💬 0

6126.533 - 6139.362 Nathan Lambert

Of course, figuring out the technology for it is incredibly difficult regardless, but just the dollars and cents that are required, ignoring, saying, hey, yes, I have all the technical capability, which it's really hard to get that by the way, right? Intel's failing, Samsung's failing, et cetera.

0
💬 0

6140.483 - 6152.132 Nathan Lambert

But if you look at just the dollars to spend to build that next generation fab, it keeps growing, right? Sort of like Moore's law is halving the cost of chips every two years. There's a separate law that's sort of like doubling the cost of fabs every handful of years.

0
💬 0

6152.492 - 6168.203 Nathan Lambert

And so you look at a leading edge fab that is going to be profitable today that's building, you know, three nanometer chips or two nanometer chips in the future. That's going to cost north of 30, 40 billion dollars. Right. And that's just for like a token amount. That's for a like that's like the base building block. You probably need to build multiple. Right.

0
💬 0

6168.583 - 6184.914 Nathan Lambert

And so when you look at the industry over the last, you know, if I go back 20, 30 years ago. there were 20, 30 companies that could build the most advanced chips, and then they would design them themselves and sell them, right? So companies like AMD would build their own chips. Intel, of course, still builds their own chips. They're very famous for it. IBM would build their own chips.

0
💬 0

6184.954 - 6203.025 Nathan Lambert

And you could keep going down the list. All these companies built their own chips. Slowly, they kept falling like flies. And that's because of what TSMC did, right? They created the foundry business model, which is, I'm not going to design any chips. I'm just going to contract manufacturer chips for other people. And one of their early customers is NVIDIA, right?

0
💬 0

6203.105 - 6221.834 Nathan Lambert

NVIDIA is the only semiconductor company that's doing more than a billion dollars of revenue that was started in the era of Foundry, right? Every other company started before then and at some point had fabs. which is actually incredible, right? You know, like AMD and Intel and Broadcom. Such a great fact.

0
💬 0

6222.294 - 6237.458 Nathan Lambert

It's like everyone had fabs at some point or, you know, some companies like Broadcom, it was like a merger, amalgamation of various companies that rolled up. But even today, Broadcom has fabs, right? They build iPhone RF radio chips sort of in... Colorado for Apple, right?

0
💬 0

6237.939 - 6253.287 Nathan Lambert

All these companies had fabs, and for most of the fabs, they threw them away or sold them off or they got rolled into something else. And now everyone relies on TSMC, right? Including Intel, their latest PC chip uses TSMC chips, right? It also uses some Intel chips, but it uses TSMC process.

0
💬 0

6253.407 - 6260.892 Lex Fridman

Can you explain why the Foundry model is so successful for these companies? Why are they going with... Economies of scale. Scale.

0
💬 0

6261.073 - 6276.364 Nathan Lambert

Yeah. So, I mean, like I mentioned, right, the cost of building a fab is so high. The R&D is so difficult. And when you look at, like, these companies that had their own vertical stack, there was an antiquated process of, like, okay, like, I'm so hyper-customized to each specific chip.

0
💬 0

6276.964 - 6293.392 Nathan Lambert

But as we've gone through the history of sort of like the last 50 years of electronics and semiconductors, A, you need more and more specialization, right? Because Moore's law has died. Denard scaling has died, i.e. chips are not getting better just for free, right? You know, from manufacturing, you have to make real architectural innovations, right?

0
💬 0

6293.732 - 6310.644 Nathan Lambert

Google is not just running on Intel CPUs for web serving. They have a YouTube chip, they have TPUs, they have pixel chips, they have a wide diversity of chips that, you know, generate all the economic value of Google, right? Running, you know, it's running all the services and stuff. And so, and this is just Google and you could go across any company in the industry and it's like this, right?

0
💬 0

6310.984 - 6325.673 Nathan Lambert

Cars contain 5,000 chips, you know, 200 different varieties of them, right? All these random things. A Tesla door handle has two chips, right? It's ridiculous. And it's a cool door handle, right? You don't think about it, but it has two really chipped penny chips in there, right?

0
💬 0

6326.053 - 6339.661 Nathan Lambert

Anyway, so as you have more diversity of chips, as you have more specialization required, and the cost of fabs continues to grow, you need someone who is laser focused on building the best process technology and making it as flexible as possible.

0
💬 0

6340.261 - 6361.657 Dylan Patel

I think you could say it simply, which is the cost per fab goes up. And if you are a small player that makes a few types of chips, you're not going to have the demand to pay back the cost of the fab. Whereas NVIDIA can have many different customers and aggregate all this demand into one place. And then they're the only person that makes enough money building chips to build the next fab.

0
💬 0

6363.078 - 6379.83 Dylan Patel

So this is kind of why the companies slowly get killed, because they have 10 years ago a chip that is profitable and is good enough, but the cost to build the next one goes up. They may try to do this, fail because they don't have the money to make it work, and then they don't have any chips, or they build it and it's too expensive and they just have...

0
💬 0

6381.131 - 6395.605 Nathan Lambert

You know, there's more failure points, right? You know, you could have one little process related to like some sort of like chemical etch or some sort of like plasma etch or, you know, some little process that screws up. You didn't engineer it right. And now the whole company falls apart. You can't make chips. Right.

0
💬 0

6395.685 - 6430.373 Nathan Lambert

And so super, super powerful companies like Intel, they had like the weathering storm to like, hey, they still exist today, even though they really screwed up their manufacturing process. Right. Right. and focusing on specific workloads rather than all of these different things. And so you get more diversity of chips.

0
💬 0

6430.393 - 6449.037 Nathan Lambert

You have more companies than ever designing chips, but you have fewer companies than ever manufacturing them, right? And this is where TSMC comes in is they've just been the best, right? They are so good at it, right? They're customer focused. They make it easy for you to fabricate your chips. They take all of that complexity and like kind of try and abstract a lot of it away from you.

0
💬 0

6449.617 - 6458.368 Nathan Lambert

They make good money. They don't make insane money, but they make good money. And they're able to aggregate all this demand and continue to build the next fab, the next fab, the next fab.

0
💬 0

6458.468 - 6465.44 Lex Fridman

So why is Taiwan so special for TSMC? Why is it happening there? Can it be replicated inside the United States?

0
💬 0

6466.181 - 6485.425 Nathan Lambert

Yeah, so there's aspects of it that I would say yes and aspects that I'd say no, right? TSMC is way ahead because former executive Morris Chang of Texas Instruments wasn't promoted to CEO. And he's like, screw this, I'm going to go make my own chip company, right? And he went to Taiwan and made TSMC, right? And there's a whole lot more story there.

0
💬 0

6486.385 - 6495.771 Nathan Lambert

So it could have been Texas Instruments, could have been TSMC, but Texas Instruments. semiconductor manufacturing, right? Instead of, you know, Texas instruments, right? But, you know, so there is that whole story there.

0
💬 0

6496.432 - 6500.776 Lex Fridman

Sitting here in Texas. I mean, and that sounds like a human story, like it didn't get promoted.

0
💬 0

6501.116 - 6521.618 Nathan Lambert

Just the brilliance of Morris Chang, you know, which I wouldn't underplay, but there's also like a different level of like how this works, right? So in Taiwan... you know, like the number, top percent of graduates, of students that go to the best school, which is NTU, the top percent of those all go work to TSMC, right? And guess what their pay is?

0
💬 0

6521.838 - 6541.583 Nathan Lambert

Their starting pay is like $80,000, $70,000, right? Which is like, that's like starting pay for like a good graduate in the US, right? Not the top, the top graduates are making hundreds of thousands of dollars at the Googles and the Amazons. And now I guess the open AIs of the world, right? So there is a large dichotomy of like, what is the top 1% of the society doing?

0
💬 0

6541.903 - 6563.112 Nathan Lambert

And where are they headed because of economic reasons, right? Intel never paid that crazy good, right? And it didn't make sense to them, right? That's one aspect, right? Where's the best going? Second is the work ethic, right? Like, you know, We like to work. You work a lot. We work a lot. But at the end of the day, what is the time and amount of work that you're doing and what does a fab require?

0
💬 0

6563.473 - 6584.266 Nathan Lambert

Fabs are not work-from-home jobs. You go into the fab and grueling work. There's, hey, if there is any amount of vibration, an earthquake happens. vibrates the machines. They're all, you know, they're either broken, you've scrapped some of your production. And then in many cases, they're like not calibrated properly. So when TSMC, when there's an earthquake, right?

0
💬 0

6584.366 - 6600.535 Nathan Lambert

Recently, there's been an earthquake. TSMC doesn't call their employees. They just go to the fab and like, they just show up, the parking lot gets slammed and people just go into the fab and fix it, right? Like it's like ants, right? Like it's like, you know, a hive of ants doesn't get told by the queen what to do,

0
💬 0

6601.095 - 6611.562 Dylan Patel

The ants just know. It's like one person just specializes on these one task. And it's like, you're going to take this one tool and you're the best person in the world. And this is what you're going to do for your whole life is this one task in the fab.

0
💬 0

6611.582 - 6630.975 Nathan Lambert

Which is like some special chemistry plus nanomanufacturing on one line of tools that continues to get iterated. And yeah, it's just like, it's like specific plasma edge for removing silicon dioxide, right? That's all you focus on your whole career. And it's like such a specialized thing. And so it's not like the tasks are transferable. AI today is awesome because like people can pick it up like,

0
💬 0

6631.055 - 6654.335 Nathan Lambert

that. Semiconductor manufacturing is very antiquated and difficult. None of the materials are online for people to read easily and learn. The papers are very dense and it takes a lot of experience to learn. And so it makes the barrier to entry much higher too. So when you talk about, hey, you have all these people that are super specialized, they will work 80 hours a week in a factory, in a fab,

0
💬 0

6655.857 - 6673.756 Nathan Lambert

And if anything goes wrong, they'll go show up in the middle of the night because some earthquake. Their wife is like, there was an earthquake. He's like, great, I'm gonna go to the fab. Would you, as an American, do that? These sorts of things are the exemplifying why TSMC is so amazing. Now, can you replicate it in the U.S. ?

0
💬 0

6674.356 - 6696.843 Nathan Lambert

Let's not ignore, Intel was the leader in manufacturing for over 20 years. They brought every technology to market first, besides EUV. Strained silicon, high-k metal gates, FinFET, you know, and the list goes on and on and on of technologies that Intel brought to market first, made the most money from, and manufactured at scale first, best technology. highest profit margins, right?

0
💬 0

6697.063 - 6712.697 Nathan Lambert

So we shouldn't ignore that Intel can't do this, right? It's that the culture has broken, right? You've invested in the wrong things. They said no to the iPhone. They had all these different things regarding like, you know, mismanagement of the fabs, mismanagement of designs, this lockup, right?

0
💬 0

6713.517 - 6727.22 Nathan Lambert

At the same time, all these brilliant people, these 50,000 PhDs or masters that have been working on specific chemical or physical processes or nanomanufacturing processes for decades in Oregon, they're still there. They're still producing amazing work.

0
💬 0

6727.52 - 6744.225 Nathan Lambert

It's just like getting it to the last mile of production at high yield where you can manufacture dozens and hundreds of different kinds of chips online. you know, and, and it's good. You customer experience has broken, right? You know, it's that customer experience. It's like the, like part of it is like, people will say Intel was too pompous in the 2000s, 2010s, right?

0
💬 0

6744.245 - 6761.192 Nathan Lambert

They just thought they were better than everyone. The tool guys were like, Oh, I don't think that this is mature enough. And they're like, ah, you just don't know. We know, right. This sort of stuff would happen. Um, and so can the U S bring it to the, uh, can the U S bring leading edge semiconductor manufacturing to the U S emphatically? Yes. Right. And we are right. It's happening.

0
💬 0

6761.212 - 6786.23 Nathan Lambert

Like Arizona is getting better and better as time goes on. TSMC has built roughly 20% of their capacity for 5 nanometer in the US, right? Now, this is nowhere near enough, right? 20% of capacity in the US is like nothing, right? And furthermore, this is still dependent on Taiwan existing, right? There's sort of important way to separate it out. There's R&D and there's high volume manufacturing.

0
💬 0

6787.171 - 6808.494 Nathan Lambert

Effectively, there are three places in the world that are doing leading edge R&D. There's Hsinchu, Taiwan, there's Hillsborough, Oregon, and there is Pyongyang, South Korea. These three places are doing the leading edge R&D for the rest of the world's leading edge semiconductors. Now, manufacturing can be distributed more globally. right?

0
💬 0

6809.435 - 6831.065 Nathan Lambert

And this is sort of where this dichotomy exists of who's actually modifying the process, who's actually developing the next generation one, who's improving them, is Hsinchu, is Hillsborough, is Pyongyang, right? It is not the rest of these fabs like Arizona, right? Arizona is a paperweight. If Hsinchu disappeared off the face of the planet, within a year, couple years,

0
💬 0

6832.751 - 6843.259 Nathan Lambert

Arizona would stop producing too, right? It's actually like pretty critical. One of the things I like to say is if I had like a few missiles, I know exactly where I could cause the most economic damage, right? It's not targeting the White House, right?

0
💬 0

6843.279 - 6844.56 Lex Fridman

It's the R&D centers.

0
💬 0

6844.64 - 6849.344 Nathan Lambert

It's the R&D centers for TSMC, Intel, Samsung, and then some of the memory guys, Micron and Hynix.

0
💬 0

6849.464 - 6861.555 Lex Fridman

Because they define the future evolution of these semiconductors and everything's moving so rapidly that it really is fundamentally about R&D. And it is all about TSMC, huh?

0
💬 0

6861.955 - 6885.187 Nathan Lambert

And so TSMC, you cannot purchase a vehicle without TSMC chips, right? You cannot purchase a fridge without TSMC chips. I think one of the few things you can purchase, ironically, is a Texas Instruments graphing calculator, right? Because they actually manufacture in Texas. But outside of that, a laptop, a phone. It's depressing. Anything you, servers, right? GPUs, none of this stuff can exist.

0
💬 0

6885.427 - 6899.097 Nathan Lambert

And this is without, without TSMC. And in many cases, it's not even like the leading edge, you know, sexy five nanometer chip, three nanometer chip, two nanometer chip. Oftentimes it's just like some stupid power IC that's like converting from like, you know, some voltage to another, right? And it's made at TSMC, right?

0
💬 0

6899.117 - 6914.473 Dylan Patel

This is what China is investing in. as well. It's like they can build out this long tail fab where the techniques are much more known. You don't have to figure out these problems with EUV. They're investing in this. And then they have large supply for things like the car door handles and the random stuff.

0
💬 0

6914.613 - 6923.342 Dylan Patel

And that trickles down into this whole economic discussion as well, which is they have far more than we do. And having supply for things like this is crucial to normal life.

0
💬 0

6923.74 - 6929.002 Lex Fridman

So they're starting to invest in high-volume manufacture, but they're not doing R&D.

0
💬 0

6929.042 - 6949.711 Nathan Lambert

So they do R&D on their own. They're just way behind, right? So I would say in 2015, China had a five-year plan where they defined by 2025 and 2020 certain goals, including 80% domestic production of semiconductors. Mm-hmm. They're not going to hit that, right, to be clear. But they are in certain areas really, really close, right?

0
💬 0

6949.751 - 6970.445 Nathan Lambert

Like BYD is probably going to be the first company in the world to not have to use TSMC for making, because they have their own FAPs, right, for making chips. Now, they still have to buy some chips from foreign, for example, like around like self-driving ADAS capabilities, because those are really high end. But at least like, you know, like an internal combustion engine has 40 chips and an EVE

0
💬 0

6971.325 - 6990.614 Nathan Lambert

just for controlling flow rates and all these things. And EVs are even more complicated. So all these different power ICs and battery management controllers and all these things, they're insourcing, right? And this is something that China has been doing since 2015. Now, as far as the trailing edge, they're getting so much capacity there. as far as the leading edge, right?

0
💬 0

6990.734 - 6999.461 Nathan Lambert

IE this five nanometer and so on, so forth, right? Where GPUs, they are still behind. And this is the US restrictions are trying to stop them in the ladder.

0
💬 0

6999.761 - 7013.152 Nathan Lambert

But, you know, all that's happened, you know, is yes, they've slowed down their five nanometer, three nanometer, et cetera, but they've accelerated their, hey, 45 nanometer, 90 nanometer power IC or analog IC or, you know, random chip in my keyboard, right? That kind of stuff.

0
💬 0

7014.116 - 7031.139 Nathan Lambert

So, so there is an angle of like the U S's actions have been so from these export, you know, from the angle of the expert controls have been so inflammatory at slowing down China's progress on the leading edge that they've turned around and have accelerated their progress elsewhere because they know that this is so important, right?

0
💬 0

7031.179 - 7047.167 Nathan Lambert

If the U S is going to lock them out here or if they lock us out here as well, uh, in the trailing edge. And so going back, can the U S build it here? Um, Yes, but it's going to take a ton of money. I truly think like to revolutionize and completely in-source semiconductors would take a decade and a trillion dollars.

0
💬 0

7047.488 - 7053.372 Lex Fridman

Is some of it also culture? Like you said, extreme competence, extreme work ethic in Taiwan.

0
💬 0

7053.392 - 7064.161 Dylan Patel

I think if you have the demand and the money is on the line, the American companies figure it out. It's going to take handholding with the government, but I think that the culture helps TSMC break through and it's easier for them.

0
💬 0

7064.641 - 7081.294 Nathan Lambert

TSMC has some like 90,000 employees, right? It's not actually that insane an amount. The Arizona fab has 3,000 from Taiwan. And these people, their wives were like, yeah, we're not going to have kids unless you sign up for the Arizona fab. We go to Arizona and we have our kids there. There's also a Japan fab where the same thing happened, right?

0
💬 0

7081.374 - 7099.868 Nathan Lambert

And so these wives drove these dudes to go to Japan or America to have the kids there. It's an element of culture. Yeah, sure. Taiwan works that hard, but also the US has done it in the past. They could do it now, right? We can just import, I say import, the best people in the world if we want to.

0
💬 0

7100.068 - 7112.456 Lex Fridman

That's where the immigration conversation is a tricky one, and there's been a lot of debate over that, but yeah. It seems absurdly controversial to import the best people in the world. I don't understand why it's controversial. That's one of the ways of winning.

0
💬 0

7112.476 - 7113.396 Dylan Patel

I'm sure we agree with you.

0
💬 0

7113.976 - 7123.921 Nathan Lambert

And even if you can't import those people, I still think you could do a lot to manufacture most of them in the US if the money's there, right? It's just way more expensive. It's not profitable for a long time.

0
💬 0

7124.261 - 7142.128 Nathan Lambert

And that's the context of like the CHIPS Act is only like $50 billion relative to some of the renewable initiatives that were passed in the Inflation Reduction Act and the Infrastructure Act, which total in the hundreds of billions of dollars, right? And so like the amount of money that the US is spending on the semiconductor industry is nothing, right?

0
💬 0

7142.728 - 7161.541 Nathan Lambert

Whereas all these other countries have structural advantages in terms of like work ethic and amount of work and things like that, but also a number of STEM graduates, the percentile of their best going to that, right? But they also have differences in terms of like, hey, there's just tax benefits in the law and have been in the law for 20 years, right?

0
💬 0

7162.162 - 7182.372 Nathan Lambert

And then some countries have massive subsidies, right? China has something like $200 billion of semiconductor subsidies a year. We're talking about $50 billion in the So the girth or difference in the subsidy amounts is also huge, right? And so I think Trump has been talking about tariffing Taiwan recently.

0
💬 0

7183.692 - 7199.355 Nathan Lambert

That's sort of like one of these things that's like, oh, okay, well, maybe he doesn't want to subsidize the semiconductor industry. Obviously, tariffing Taiwan is going to cost a lot of things to go get much more expensive, but does it change the equation for TSMC building more fabs in the US? That's what he's sort of positing, right?

0
💬 0

7199.375 - 7206.763 Lex Fridman

Yeah. So can you lay out, so we laid out the importance. By the way, it's incredible how much you know about so much.

0
💬 0

7207.444 - 7209.206 Dylan Patel

We told you Dylan knows all this stuff.

0
💬 0

7210.267 - 7246.212 Lex Fridman

Yeah. So, okay, you laid out why TSMC is really important. If we look out into the future, 10, 20 years out, US-China relationship seems like it can go to a dark place of cold war, escalated cold war, even hot war, or to a good place of anything from frenemies to cooperation to working together. So in this game theory, complicated game, What are the different trajectories?

0
💬 0

7246.272 - 7257.988 Lex Fridman

What should US be doing? What do you see as the different possible trajectories of US-China relations as both leaders start to feel the AGI more and more and see the importance of chips and the importance of AI?

0
💬 0

7258.76 - 7281.293 Dylan Patel

I mean, ultimately, the export controls are pointing towards a separate future economy. I think the US has made it clear to Chinese leaders that we intend to control this technology at whatever cost to global economic integration. It's hard to unwind that. The card has been played.

0
💬 0

7281.373 - 7304.343 Nathan Lambert

To the same extent, they've also limited US companies from entering China. It's been a long time coming. At some point, there was a convergence. But over at least the last decade, it's been branching further and further out. US companies can't enter China. Chinese companies can't enter the US. The US is saying, hey, China, you can't get access to our technologies in certain areas.

0
💬 0

7304.683 - 7318.232 Nathan Lambert

And China's rebuttaling with the same thing around like, you know, they've done some sort of specific materials in, you know, gallium and things like that, that they've tried to limit the U.S. on. There's a U.S. drone company that's not allowed to buy batteries and they have like military customers.

0
💬 0

7318.572 - 7335.264 Nathan Lambert

And this drone company just tells the military customers like, hey, just get it from Amazon because I can't actually physically get them, right? Like there's all these things that are happening that point to further and further divergence. I have zero idea. And I would love if we could We could all hold hands and sing Kumbaya, but I have zero idea how that could possibly happen.

0
💬 0

7335.544 - 7349.457 Lex Fridman

Is the divergence good or bad for avoiding war? Is it possible that the divergence in terms of manufacturer chips, of training AI systems is actually good for war? avoiding military conflict.

0
💬 0

7349.477 - 7362.886 Nathan Lambert

It's an objective fact that the world has been the most peaceful it has ever been when there are global hegemons, right? Or regional hegemons, right? In historical context, right? The Mediterranean was the most peaceful ever when the Romans were there, right?

0
💬 0

7363.146 - 7381.018 Nathan Lambert

China had very peaceful and warring times, and the peaceful times were when dynasties had a lockhold over not just themselves, but all their tributaries around them, right? And likewise, the most peaceful time in human history has been when the US was the global hegemon, right? The last, you know, decades. Now, we've sort of seen things start to slide, right?

0
💬 0

7381.038 - 7401.233 Nathan Lambert

With Russia, Ukraine, with what's going on in the Middle East and, you know, Taiwan risk, all these different things are starting to bubble up, still objectively extremely peaceful. Now, what happens when it's not one global hegemon, but it's two, obviously, and, you know, China will be, you know, competitive or even overtake the US like it's possible, right? And so this change in global hegemony

0
💬 0

7402.334 - 7422.308 Nathan Lambert

I don't think it ever happens like super peacefully, right? When empires fall, right, which is a possible trajectory for America, they don't fall gracefully, right? Like they don't just slide out of irrelevance. Usually there's a lot of shaking. And so, you know, what the US is trying to do is maintain its top position. And what China is trying to do is become the top position, right?

0
💬 0

7422.789 - 7427.112 Nathan Lambert

And obviously there's butting of heads here in the most simple terms.

0
💬 0

7427.532 - 7431.334 Lex Fridman

And that could take shape in all kinds of ways, including proxy wars.

0
💬 0

7432.334 - 7443.039 Dylan Patel

It seems like it's already happening. As much as I want there to be centuries of prolonged peace, it looks like further instability internationally is ahead.

0
💬 0

7443.56 - 7465.675 Nathan Lambert

And the U.S. 's current task is like, hey, if we control AI, if we're the leader in AI, and AI significantly accelerates progress, then we can maintain the global hegemony position. I hope that works. And as an American, like, you know, kind of like, okay, I guess that's going to lead to peace for us. Now, obviously, other people around the world get affected negatively.

0
💬 0

7465.695 - 7478.008 Nathan Lambert

You know, obviously, the Chinese people are not going to be in as advantageous of a position if that happens. But, you know, this is sort of the reality of like what's being done and the actions that are being carried out.

0
💬 0

7478.578 - 7499.371 Lex Fridman

So can we go back to the specific detail of the different hardware? There's this nice graphic in the export controls of which GPUs are allowed to be exported and which are not. Can you kind of explain the difference? Is there, from a technical perspective, are the H20s promising?

0
💬 0

7502.751 - 7523.107 Nathan Lambert

Yeah, so this goes, and I think we'd have to like, we need to dive really deep into the reasoning aspect and what's going on there. But the H20, you know, the US has gone through multiple iterations of the export controls, right? This H800 was at one point allowed back in 23, but then it got canceled. And by then, you know, DeepSeek had already built their cluster of, they claim 2K.

0
💬 0

7523.127 - 7540.3 Nathan Lambert

I think they actually have like many more, like something like 10K of those. And now this H20 is the legally allowed chip, right? NVIDIA shipped a million of these last year to China. For context, it was like four or five million GPUs. So the percentage of GPUs that were this China-specific H20 is quite high, roughly 20%, 25%, 20% or so.

0
💬 0

7540.32 - 7568.816 Nathan Lambert

And so this H20 has been neutered in one way, but it's actually upgraded in other ways. And you could think of chips along three axes for AI, ignoring software stack and exact architecture, just raw specifications. There's floating point operations, flops. There is memory bandwidth, i.e. in memory capacity, IO, memory. And then there is interconnect, chip-to-chip interconnections.

0
💬 0

7569.176 - 7591.779 Nathan Lambert

All three of these are incredibly important for... making AI systems, right? Because AI systems involve a lot of compute. They involve a lot of moving memory around, whether it be to memory or to other chips, right? And so these three vectors, the US initially had two of these vectors controlled and one of them not controlled, which was flops and interconnect bandwidth were initially controlled.

0
💬 0

7592.459 - 7613.394 Nathan Lambert

And then they said, no, no, no, no, we're going to remove the interconnect bandwidth and just make it a very simple only flops. But now NVIDIA can now make a chip that has, okay, it's cut down on flops. It's like one third that of the H100 on spec sheet paper performance for flops. In real world, it's closer to like half or maybe even like 60% of it.

0
💬 0

7613.974 - 7632.367 Nathan Lambert

But then on the other two vectors, it's just as good for interconnect bandwidth. And then for memory bandwidth and memory capacity, the H20 has more memory bandwidth and and more memory capacity than the H100, right? Now, recently, you know, we at our research, we cut NVIDIA's production for H20 for this year down drastically.

0
💬 0

7632.387 - 7646.636 Nathan Lambert

They were going to make another 2 million of those this year, but they just canceled all the orders a couple of weeks ago. In our view, that's because we think that they think they're going to get restricted. Because why would they cancel all these orders for H20? Because they shipped a million of them last year.

0
💬 0

7646.656 - 7670.366 Nathan Lambert

They had orders in for a couple million this year and just gone for H20, B20, a successor to H20. And now they're all gone. Now, why would they do this? I think it's very clear. The H20 is actually better for certain tasks. And that certain task is reasoning. right? Reasoning is incredibly different than... When you look at the different regimes of models, right?

0
💬 0

7670.606 - 7690.874 Nathan Lambert

Pre-training is all about flops, right? It's all about flops. There's things you do, like mixture of experts that we talked about, to trade off interconnect... Or to trade off other aspects and lower the flops and rely more on interconnect and memory. But at the end of the day, it's flops is everything, right? We talk about models in terms of how many flops they are, right?

0
💬 0

7691.255 - 7707.882 Nathan Lambert

So, like, you know, we talk about, oh, GPT-4 is 2E25, right? 2 to the 25th, you know, 25 zeros, right? Flop, right? Floating point operations. For training. For training, right? And we're talking about the restrictions for the 2E24, right?

0
💬 0

7708.243 - 7724.19 Nathan Lambert

The US has an executive order that Trump recently unsigned, which was, hey, 1E26, once you hit that number of floating point operations, you must notify the government, And you must share your results with us, right? There's a level of model where the US government must be told, right? And that's 1E26.

0
💬 0

7724.791 - 7741.74 Nathan Lambert

And so as we move forward, this is an incredibly important... Flop is the vector that the government has cared about historically, but the other two vectors are arguably just as important, right? And especially when we come to this new paradigm, which the world is only just learning about over the last six months, right? Reasoning.

0
💬 0

7742.136 - 7751.34 Lex Fridman

And do we understand firmly which of the three dimensions is best for reasoning? So interconnect, the flops don't matter as much. Is it memory?

0
💬 0

7751.8 - 7753.261 Dylan Patel

Memory, right?

0
💬 0

7753.501 - 7760.424 Nathan Lambert

We're going to get into technical stuff real fast. There's two articles in this one that I could show, maybe graphics that might be interesting for you to pull up.

0
💬 0

7761.305 - 7766.327 Lex Fridman

For the listeners, we're looking at the section of O1 inference architecture tokenomics.

0
💬 0

7767.574 - 7771.476 Nathan Lambert

You want to explain KVCache before we talk about this? I think it's better to... Okay, yeah.

0
💬 0

7771.576 - 7776.877 Dylan Patel

We need to go through a lot of specific technical things of transformers to make this easy for people.

0
💬 0

7777.258 - 7793.606 Nathan Lambert

Because it's incredibly important because this changes how models work. But I think resetting, right? Why is memory... so important. It's because so far we've talked about parameter counts, right? And mixture of experts, you can change how many active parameters versus total parameters to embed more data, but have less flops.

0
💬 0

7794.206 - 7812.917 Nathan Lambert

But more important, you know, another aspect of, you know, what's part of this humongous revolution in the last handful of years is the transformer, right? And the attention mechanism. Attention mechanism is that the model understands the relationships between all the words in its context, right? And that is separate from the parameters themselves. right?

0
💬 0

7813.198 - 7825.855 Nathan Lambert

And that is something that you must calculate, right? How each token, right, each word in the context length is relatively connected to each other, right? And I think, Nathan, you should explain KVCache better.

0
💬 0

7826.116 - 7827.458 Lex Fridman

KVCache is one of the optimizations.

0
💬 0

7827.478 - 7840.509 Dylan Patel

Yeah, so the attention operator has three core things. It's queries, keys, and values. QKV is the thing that goes into this. You'll look at the equation. You see that these matrices are multiplied together.

0
💬 0

7840.969 - 7856 Dylan Patel

These words, query, key, and value come from information retrieval backgrounds where the query is the thing you're trying to get the values for and you access the keys and the values is reweighting. My background's not in information retrieval and things like this. It's just fun to have backlinks. And

0
💬 0

7856.801 - 7876.545 Dylan Patel

What effectively happens is that when you're doing these matrix multiplications, you're having matrices that are of the size of the context length. So the number of tokens that you put into the model and the KV cache is effectively some form of compressed representation of all the previous tokens in the model. So when you're doing this, we talk about autoregressive models.

0
💬 0

7877.025 - 7894.35 Dylan Patel

You predict one token at a time. You start with whatever your prompt was. You ask a question like who was the president in 1825? the model then is going to generate its first token. For each of these tokens, you're doing the same attention operator where you're multiplying these query key value matrices.

0
💬 0

7894.55 - 7917.928 Dylan Patel

But the math is very nice so that when you're doing this repeatedly, this KV cache, this key value matrix, operation, you can keep appending the new values to it. So you keep track of what your previous values you're inferring over in this autoregressive chain. You keep it in memory the whole time. And this is a really crucial thing to manage when serving inference at scale.

0
💬 0

7918.709 - 7935.599 Dylan Patel

There are far bigger experts in this, and there are so many levels of detail that you can go into. Essentially, one of the key quote-unquote drawbacks of the attention operator and the transformer is that there is a form of quadratic memory cost in proportion to the context length.

0
💬 0

7935.979 - 7954.368 Dylan Patel

So as you put in longer questions, the memory used in order to make that computation is going up in the form of a quadratic. You'll hear about a lot of other language model architectures that are like sub-quadratic or linear attention forms, which is like state-space models. We don't need to go down all these now.

0
💬 0

7954.528 - 7964.553 Dylan Patel

And then there's innovations on attention to make this memory usage and the ability to attend over long contexts much more accurate and high performance.

0
💬 0

7964.613 - 7968.277 Lex Fridman

And those innovations are going to help you with... I mean, you're highly memory constrained.

0
💬 0

7968.297 - 7987.217 Dylan Patel

They help with memory constraint and performance. So if you put in a book into... I think Gemini is the model that has the longest context length that people are using. Gemini is known for 1 million and now 2 million context length. You put a whole book into Gemini and... Sometimes it'll draw facts out of it. It's not perfect. They're getting better. So there's two things.

0
💬 0

7987.697 - 8005.113 Dylan Patel

One, to be able to serve this on the memory level. Google has magic with their TPU stack where they can serve really long contexts. And then there's also many decisions along the way to actually make long context performance work. This implies the data. There's subtle changes to these computations and attention. And it changes the architecture.

0
💬 0

8005.533 - 8020.06 Dylan Patel

But serving long context is extremely memory-consuming. constrained, especially when you're making a lot of predictions. I actually don't know why input and output tokens are more expensive, but I think essentially output tokens, you have to do more computation because you have to sample from the model.

0
💬 0

8020.26 - 8044.275 Nathan Lambert

I can explain that. So today, if you use a model, like you look at an API, OpenAI charges a certain price per million tokens, right? And that price for input and output tokens is different, right? And the reason is that when you're inputting a query into the model, right? Let's say you have a book, right? That book, you must now calculate the entire KV cache for it, right? This key value cache.

0
💬 0

8044.696 - 8062.889 Nathan Lambert

And so when you do that, that is a parallel operation. All of the tokens can be processed at one time. And therefore, you can dramatically reduce how much you're spending, right? The flop requirements for generating a token and an input token are identical, right? If I input one token or if I generate one token, it's completely identical. I have to go through the model.

0
💬 0

8063.27 - 8073.017 Nathan Lambert

But the difference is that I can do that input, i.e. the pre-fill, i.e. the prompt, simultaneously in a batch nature. And therefore, it is all flopped.

0
💬 0

8073.246 - 8078.59 Lex Fridman

I think the pricing model mostly they use for input tokens is about one-fourth the price of the output tokens.

0
💬 0

8078.61 - 8095.761 Nathan Lambert

Correct. But then output tokens, the reason why it's so expensive is because I can't do it in parallel, right? It's autoregressive. Every time I generate a token, I must not only read the whole entire model into memory and activate it, calculate it to generate the next token, I also have to read the entire KV cache.

0
💬 0

8096.141 - 8114.251 Nathan Lambert

and I generate a token, and I append that KV, that one token I generated, and it's KV cash, and then I do it again, right? And so therefore, this is a non-parallel operation. And this is one where you have to, you know, in the case of pre-fill or prompt, you pull the whole model in and you calculate 20,000 tokens at once, right?

0
💬 0

8114.291 - 8132.5 Dylan Patel

So these are features that APIs are shipping, which is like, prompt caching, pre-filling, because you can drive prices down and you can make APIs much faster. If you know you're going to keep, if you run a business and you're going to keep passing the same initial content to Cloud's API, you can load that in to the Anthropic API and always keep it there.

0
💬 0

8132.96 - 8145.727 Dylan Patel

But it's very different than we're kind of leading to the reasoning models, which we talked, we showed this example earlier and read some of this kind of mumbling stuff. And what happens is that the output context length is so much higher.

0
💬 0

8145.827 - 8161.557 Dylan Patel

And I mean, I learned a lot about this from Dylan's work, which is essentially, as the output length gets higher, you're writing this quadratic in terms of memory used. And then the GPUs that we have, effectively, you're going to run out of memory, and they're all trying to serve multiple requests at once.

0
💬 0

8161.577 - 8179.749 Dylan Patel

So they're doing this batch processing where not all of the prompts are exactly the same, really complex handling. And then as context length gets longer, there's this, I think you call it critical batch size, where your ability to serve So how much you can parallelize your inference plummets because of this long contract.

0
💬 0

8179.789 - 8189.116 Dylan Patel

So your memory usage is going way up with these reasoning models, and you still have a lot of users. So effectively, the cost to serve multiplies by a ton.

0
💬 0

8189.416 - 8193.939 Lex Fridman

And we're looking at a plot when the x-axis is sequence length.

0
💬 0

8194.326 - 8203.853 Nathan Lambert

i.e. how many tokens are being generated slash prompt, right? So if I put in a book, that's a million tokens, right? But, you know, if I put in, you know, the sky is blue, then that's like six tokens or whatever.

0
💬 0

8203.873 - 8209.857 Lex Fridman

And we should say that what we're calling reasoning and chain of thought is extending this sequence length.

0
💬 0

8209.877 - 8229.232 Nathan Lambert

It's mostly output tokens. So before, you know, three months ago, whenever O1 launched, all of the use cases for long context length were like, let me put a ton of documents in and then get an answer out, right? And it's a single, you know, Pre-fill, compute a lot in parallel, and then output a little bit. Now, with reasoning and agents, this is a very different idea, right?

0
💬 0

8229.532 - 8242.404 Nathan Lambert

Now, instead, I might only have like, hey, do this task, or I might have all these documents. But at the end of the day, the model is not just like producing a little bit, right? It's producing tons. Tons of information, this chain of thought just continues to go and go and go and go.

0
💬 0

8242.744 - 8262.862 Nathan Lambert

And so the sequence length is effectively that, you know, if it's generated 10,000 tokens, it's 10,000 sequence length, right? Or, and plus whatever you inputted in the prompt. And so what this chart is showing, and it's a logarithmic chart, right? Is, you know, as you go from 1K to 4K or 4K to 16K, the memory requirements grow so fast

0
💬 0

8263.522 - 8286.133 Dylan Patel

for your kv cache that you end up not being able to run uh a certain number of you know uh you know your sequence length is capped or the number of users let's say the model so this is this is showing for a 405b model and batch size 64 llama 3145b yeah and batch size is crucial to essentially they just like you want to have higher batch size to parallelize parallel your throughput.

0
💬 0

8286.274 - 8304.124 Nathan Lambert

64 different users at once, right? Yeah. And therefore your serving costs are lower, right? Because the server costs the same, right? This is eight H100s, roughly $2 an hour per GPU. That's $16 an hour, right? That is like somewhat of a fixed cost. You can do things to make it lower, of course, but like it's like $16 an hour. Now, how many users can you serve? How many tokens can you generate?

0
💬 0

8304.424 - 8323.04 Nathan Lambert

And then you divide the two and that's your cost, right? And so with reasoning models, this is where a lot of the complexity comes about and why memory is so important. Because if you have limited amounts of memory, then you can't serve so many users. If you have limited amounts of memory, your serving speeds get lower, right? And so your costs get a lot, lot worse, right?

0
💬 0

8323.36 - 8340.737 Nathan Lambert

Um, because all of a sudden, if I was used to, Hey, on the $16 an hour server, I'm serving Lama four or five B or if I'm serving, you know, deep seek V3, um, and it's all chat style applications, i.e. we're just chatting the sequence sensor thousand few thousand, right? Uh, you know, when you use the language model, it's a few thousand context length.

0
💬 0

8340.777 - 8358.927 Nathan Lambert

Most of the time, sometimes you're dropping a big document, but then you process it, you get your answer, you throw it away, right? You, you move on to the next thing, right? Whereas with reasoning, I'm now generating tens of thousands of tokens in sequence, right? And so this memory, this KV cache has to stay resident and you have to keep loading it. You have to keep it in memory constantly.

0
💬 0

8359.187 - 8370.531 Nathan Lambert

And now this butts out other users, right? If there's now a reasoning task, right? And the model is capable of reasoning, then all of a sudden that memory pressure means that I can't serve as many users simultaneously.

0
💬 0

8370.671 - 8389.426 Dylan Patel

Let's go into deep seek again. So we're in the post deep seek R1 time, I think. And there's two sides to this market watching how hard it is to serve it. On one side, we're going to talk about DeepSeek themselves. They now have a chat app that got to number one on the App Store. Disclaimer, number one on the App Store is measured by velocity.

0
💬 0

8389.506 - 8401.738 Dylan Patel

So it's not necessarily saying that more people have the DeepSeek app than the ChatGPT app. But it is... Still remarkable, Claude has never hit the number one in the app store, even though everyone in San Francisco is like, oh my God, you got to use Claude, don't use strategy BT. So DeepSeek hit this.

0
💬 0

8402.078 - 8412.67 Dylan Patel

They also launched an API product recently where you can ping their API and get these super long responses for R1 out. In... At the same time as these are out, we'll get to what's happened to them.

0
💬 0

8413.531 - 8433.58 Dylan Patel

Because the model weights for DeepSeq R1 are openly available and the license is very friendly, the MIT license is commercially available, all of these mid-sized companies and big companies are trying to be first to serve R1 to their users. We were trying to evaluate R1 because we have really similar research going on. We released the model and we're trying to compare to it.

0
💬 0

8433.74 - 8445.447 Dylan Patel

And out of all the companies that are... quote unquote serving R1 and they're doing it at prices that are way higher than the deep seek API. Most of them barely work and the throughput is really low.

0
💬 0

8445.667 - 8455.954 Nathan Lambert

To give context, right? Everyone, one of the parts of like freaking this out was like trying to reach the capabilities. The other aspect is they did it so cheap, right? And the so cheap, we kind of talked about on the training side, why it was so cheap.

0
💬 0

8455.974 - 8462.678 Lex Fridman

Yeah, let's talk about why it's so cheap on the inference. It works well and it's cheap. Why is R1 so damn cheap?

0
💬 0

8462.958 - 8478.91 Nathan Lambert

So I think there's a couple factors here, right? One is that they do have model architecture innovations, right? This MLA, this new attention that they've done is different than the attention from attention is all you need to transform our attention, right? Now, others have already innovated.

0
💬 0

8478.93 - 8488.097 Nathan Lambert

There's a lot of work like MQA, GQA, local, global, all these different innovations that like try to bend the curve, right? It's still quadratic, but the constant is now smaller, right?

0
💬 0

8488.257 - 8498.829 Dylan Patel

Related to our previous discussion, this multi-head latent attention can save about 80% to 90% in memory from the attention mechanism, which helps especially along context.

0
💬 0

8499.029 - 8503.775 Nathan Lambert

It's 80% to 90% versus the original, but then versus what people are actually doing. It's still an innovation.

0
💬 0

8504.035 - 8509.021 Dylan Patel

This 80% to 90% doesn't say that the whole model is 80% to 90% cheaper, just this one part of it.

0
💬 0

8509.121 - 8527.154 Nathan Lambert

Well, and not just that, right? Like other people have implemented techniques like local-global and sliding window and GQMQA. But anyways, like DeepSeq has their attention mechanism as a true architectural innovation. They did tons of experimentation. And this dramatically reduces the memory pressure. It's still there, right? It's still attention. It's still quadratic.

0
💬 0

8527.534 - 8530.316 Nathan Lambert

It's just dramatically reduced it relative to prior forms.

0
💬 0

8530.756 - 8536.7 Lex Fridman

That's the memory pressure. I should say, in case people don't know, R1 is 27 times cheaper than O1.

0
💬 0

8539.543 - 8544.348 Dylan Patel

We think that OpenAI had a large margin built in. There's multiple factors.

0
💬 0

8544.388 - 8555.802 Lex Fridman

We should break down the factors, I think. It's $2 per million token output for R1 and $60 per million token output for R1. Yeah, let's look at this.

0
💬 0

8557.6 - 8573.122 Nathan Lambert

So I think this is very important, right? OpenAI is, you know, that drastic gap between DeepSeek and pricing. But DeepSeek is offering the same model because they open-weighted it to everyone else for a very similar, like much lower price than what others are able to serve it for.

0
💬 0

8573.542 - 8598.02 Lex Fridman

right um so there's there's two factors here right their model is cheaper right um it is 27 times cheaper well i don't remember the number exactly off top of my head so we're looking at a graphic that's showing different places serving v3 deep seek v3 which is similar to deep seek r1 and there's a vast difference in uh in serving cost serving cost and what explains that difference

0
💬 0

8598.62 - 8614.069 Nathan Lambert

And so like part of it is OpenAI has a fantastic margin, right? They're serving, when they're doing inference, their gross margins are north of 75%, right? So that's a four to five X factor right there of the cost difference is that OpenAI is just making crazy amounts of money because they're the only one with the capability.

0
💬 0

8614.534 - 8616.696 Lex Fridman

Do they need that money? Are they using it for R&D?

0
💬 0

8616.796 - 8632.193 Nathan Lambert

They're losing money, obviously, as a company because they spend so much on training, right? So the inference itself is a very high margin, but it doesn't recoup the cost of everything else they're doing. So yes, they need that money because the revenue and margins pay for continuing to build the next thing, right? Alongside raising more money.

0
💬 0

8632.253 - 8635.236 Lex Fridman

So the suggestion is that DeepSeek is like really bleeding out money.

0
💬 0

8635.416 - 8650.233 Nathan Lambert

Well, so here's one thing, right? We'll get to this in a second, but like DeepSeek doesn't have any capacity to actually serve the model. They stopped signups. The ability to use it is like non-existent now, right? For most people, because so many people are trying to use it, they just don't have the GPUs to serve it.

0
💬 0

8651.475 - 8671.509 Nathan Lambert

OpenAI has hundreds of thousands of GPUs between them and Microsoft to serve their models. DeepSeq has a factor of much lower. Even if you believe our research, which is 50,000 GPUs, and a portion of those are for research, a portion of those are for the hedge fund, they still have nowhere close to the GPU volumes and capacity to serve the model at scale. So it is cheaper.

0
💬 0

8672.63 - 8692.651 Nathan Lambert

A part of that is OpenAI making a ton of money. Is DeepSeq making money on their API? Unknown. I don't actually think so. And part of that is this chart, right? Look at all the other providers, right? Together AI, Fireworks AI are very high-end companies, right? XMeta, Together AI is TreeDAO and the inventor of like Flash Attention, right? Which is a huge efficiency technique, right?

0
💬 0

8692.671 - 8708.141 Nathan Lambert

They're very efficient, good companies. And I do know those companies make money, right? Not tons of money on inference, but they make money. And so they're serving at like a five to seven X difference in cost, right? And so now when you equate, okay, OpenAI is making tons of money, that's like a five X difference.

0
💬 0

8708.781 - 8721.849 Nathan Lambert

And the companies that are trying to make money for this model is like a five X difference. There is still a gap, right? There's still a gap. And that is just DeepSeq being really freaking good, right? The model architecture, MLA, the way they did the MOE, all these things, there is like legitimate just efficiency difference.

0
💬 0

8722.069 - 8727.456 Dylan Patel

All their low-level libraries that we talked about in training, some of them probably translate to inference, and those weren't released.

0
💬 0

8727.556 - 8734.424 Lex Fridman

So we may go a bit into conspiracy land, but is it possible the Chinese government is subsidizing DeepSeek?

0
💬 0

8735.035 - 8749.727 Nathan Lambert

I actually don't think they are. I think when you look at the Chinese labs, there's Huawei has a lab, Moonshot AI. There's a couple other labs out there that are really close with the government. And then there's labs like Alibaba and DeepSeek, which are not close with the government.

0
💬 0

8751.228 - 8772.426 Nathan Lambert

And we talked about the CEO, this reverent figure who's quite different, who has very different viewpoints based on the Chinese interviews that are translated than what the CCP might necessarily want. Now, to be clear, right, does he have a loss leader because he can fund it through his hedge fund? Yeah, sure. So the hedge fund might be subsidizing it. Yes. I mean, they absolutely did, right?

0
💬 0

8772.446 - 8783.536 Nathan Lambert

Because DeepSeek has not raised much money. They're now trying to raise around in China, but they have not raised money historically. It's all just been funded by the hedge fund. And he owns like over half the company, like 50, 60% of the company is owned by him.

0
💬 0

8783.717 - 8796.684 Dylan Patel

Some of the interviews, there's discussion on how like doing this as a recruiting tool. You see this at the American companies too. It's like, Having GPUs, recruiting tool. Being at the cutting edge of AI, recruiting tool. Open sourcing. Open sourcing, recruiting tool.

0
💬 0

8796.704 - 8800.805 Nathan Lambert

They were so far behind and they got so much talent because they just open sourced stuff.

0
💬 0

8800.985 - 8823.172 Lex Fridman

More conspiracy thoughts. Is it possible, since they're a hedge fund, that they timed everything with this release and the pricing And they shorted NVIDIA stock and stock of USA ad companies and released it with just perfect timing to be able to make money.

0
💬 0

8823.192 - 8832.596 Dylan Patel

They released it on Inauguration Day. They know what is on the international calendar, but I don't expect them to. If you listen to their motivations for AI, it's like,

0
💬 0

8833.937 - 8850.786 Nathan Lambert

They released V3 on December 26th. Who releases the day after Christmas? No one looks, right? They released the papers before this, right? The V3 paper and the R1 paper. So people have been looking at it and be like, wow. And then they just released the R1 model. I think they're just shipping as fast as they can. And who cares about Christmas?

0
💬 0

8850.826 - 8860.731 Nathan Lambert

Who cares about... Get it out before Chinese New Year, right? Obviously, which just happened. I don't think they actually were timing the market or trying to make the biggest splash possible. I think they're just shipping. I don't know.

0
💬 0

8860.991 - 8881.713 Dylan Patel

I think that's one of their big advantages. We know that a lot of the American companies are very invested in safety. And that is the central culture of a place like Anthropic. And I think Anthropic sounds like a wonderful place to work. But if safety is your number one goal, it takes way longer to get artifacts out. That's why Anthropic is not open sourcing things. That's their claims.

0
💬 0

8882.373 - 8898.945 Dylan Patel

But there's reviews internally. Anthropic mentions things to international governments. There's been news of how Anthropic has done pre-release testing with the UK AI Safety Institute. All of these things add inertia to the process of getting things out. And we're on this trend line where the progress is very high.

0
💬 0

8899.245 - 8911.194 Dylan Patel

So if you reduce the time from when your model is done training, you run evals, it's good. You want to get it out as soon as possible to maximize the perceived quality of your outputs. Deep Seat does this so well.

0
💬 0

8911.394 - 8931.103 Nathan Lambert

Dario explicitly said Claude 3.5 Sonnet was trained like nine months or nine to 10 months ago, nine to 10 months ago. And I think it took them another like handful of months to release it. Right. So it's like there is there is a significant gap here. Right. And especially with reasoning models, the word in the San Francisco street is that like Anthropic has a better model than 03. Right.

0
💬 0

8931.223 - 8945.77 Nathan Lambert

And they won't release it. Why? Because chains of thought are scary. Right. And they are legitimately scary. Right. If you look at R1, it flips back and forth between Chinese and English. Sometimes it's gibberish. And then the right answer comes out. Right. And like for you and I, it's like great.

0
💬 0

8945.79 - 8951.112 Dylan Patel

I mean, like people are infatuated with you. You're telling me this is a high value thing and it works and it's

0
💬 0

8951.212 - 8968.896 Nathan Lambert

doing this it's amazing I mean you talked about that sort of like chain of thought for that philosophical thing which is not something they trained it to be philosophically good it's just sort of an artifact of the chain of thought training it did but like that's super important in that like Can I inspect your mind and what you're thinking right now? No.

0
💬 0

8969.497 - 8986.773 Nathan Lambert

And so I don't know if you're lying to my face. And chain of thought models are that way, right? Like this is a true quote unquote risk between, you know, a chat application where, hey, I asked the model to say, you know, bad words or whatever, or how to make anthrax. And it tells me that's unsafe. Sure. But that's something I can get out relatively easily.

0
💬 0

8987.174 - 9003.42 Nathan Lambert

What if I tell the AI to do a task and then it does the task automatically? all of a sudden randomly in a way that I don't want it, right? And now that has like much more task versus like response is very different, right? So the bar for safety is much higher. At least this is Anthropic's case, right? Like for deep seek, they're like ship, right? Yeah.

0
💬 0

9003.68 - 9020.537 Lex Fridman

So I mean, the bar for safety is probably lowered a bit because of deep seek. I mean, there's parallels here to the space race. The reason the Soviets probably put a man in space first is because their approach to safety was, the bar for safety was lower.

0
💬 0

9020.557 - 9023.56 Nathan Lambert

And they killed that dog, right? And all these things, right? So it's like.

0
💬 0

9023.62 - 9035.151 Lex Fridman

Less risk averse than the US-based program. And there's parallels here. But, you know, there's probably going to be downward pressure on that safety bar for the US companies, right?

0
💬 0

9035.171 - 9051.657 Dylan Patel

This is something that Dario talks about. It's like, That's a situation that Dario wants to avoid. Dario talks too about the difference between race to the bottom and race to the top. And the race to the top is where there's a very high standard on safety. There's a very high standard on your model forms and certain crucial evaluations.

0
💬 0

9051.737 - 9073.065 Dylan Patel

And when certain companies are really good to it, they will converge. This is the idea. And Ultimately, AI is not confined to one nationality or to one set of morals for what it should mean. And there's a lot of arguments on like, should we stop open sourcing models? And if the US stops, it's pretty clear.

0
💬 0

9073.145 - 9091.262 Dylan Patel

I mean, it's way easier to see now at DeepSeek that a different international body will be the one that builds it. We talk about the cost of training. DeepSeek has this... Shocking $5 million number. Think about how many entities in the world can afford 100 times that to have the best open source model that people use in the world. And it's like,

0
💬 0

9092.475 - 9112.56 Dylan Patel

It's a scary reality, which is that these open models are probably going to keep coming for the time being, whether or not we want to stop them. And stopping them might make it even worse and harder to prepare. But it just means that the preparation and understanding what AI can do is just so much more important. That's why I'm here at the end of the day.

0
💬 0

9112.62 - 9123.323 Dylan Patel

But it's like letting that sink into people, especially not in AI, is that like... this is coming, there are some structural things in a global, interconnected world that you have to accept.

0
💬 0

9123.863 - 9139.248 Lex Fridman

Yeah, you mentioned, you sent me something that Mark Zuckerberg mentioned on an earnings call. He said that, I think in light of some of the recent news, the new competitor, DeepSeek from China, I think it's one of the things that we're talking about is there's going to be an open source standard globally.

0
💬 0

9139.668 - 9157.025 Lex Fridman

And I think for our kind of national advantage, it's important that it's an American standard. So we take that seriously. We want to build the AI system that people around the world are using. And I think that if anything, some of the recent news has only strengthened our conviction that this is the right thing to be focused on. So yeah, open sourcing.

0
💬 0

9157.369 - 9169.299 Dylan Patel

Yeah, Mark Zuckerberg is not new to having American values in how he presents his company's trajectory. I think their products have long since been banned in China, and I respect the saying it directly.

0
💬 0

9169.74 - 9190.457 Nathan Lambert

And there's an interesting aspect of just because it's open-weighted or open-sourced doesn't mean it can't be subverted, right? There have been many open-source software bugs that have been like... For example, there was a Linux bug that was found after 10 years, which was clearly a backdoor because somebody was like, why is this taking half a second to load? This is the recent one.

0
💬 0

9190.477 - 9209.473 Nathan Lambert

Why is this taking half a second to load? And it was like, oh crap, there's a backdoor here. That's why. And it's like, this is very much possible with AI models. Today, the alignment of these models is very clear. I'm not going to say bad words. I'm not going to teach you how to make Anthrax. I'm not going to talk about Tiananmen Square.

0
💬 0

9210.253 - 9227.024 Nathan Lambert

I'm not going to, you know, things like, I'm going to say Taiwan is part of, you know, is, is just an Eastern province, right? Like, you know, all these things are like, depending on who you are, what you align, what, you know, whether, you know, and even like XAI is aligned a certain way, right? You know, there, they might be, it's not aligned in the like woke sense.

0
💬 0

9227.064 - 9243.534 Nathan Lambert

It's not aligned in like pro China sense, but there is certain things that are imbued within the model. Now, when you release this publicly in an instruct model, that's open weights, um, this can then proliferate, right? But as these systems get more and more capable, what you can embed deep down in the model is not as clear, right?

0
💬 0

9243.974 - 9262.645 Nathan Lambert

And so that is like one of the big fears is like if an American model or a Chinese model is the top model, right, you're going to embed things that are unclear. And it can be unintentional too, right? Like British English is dead because American LLMs won, right? And the internet is American and therefore like color is spelled the way Americans spell it, right?

0
💬 0

9262.745 - 9264.366 Lex Fridman

And this is just- A lot of strong words right now.

0
💬 0

9264.607 - 9267.709 Nathan Lambert

This is just- This is just the factual nature of the LLS now.

0
💬 0

9267.849 - 9275.634 Dylan Patel

I mean, it's like carp with each of you. The English is the hottest programming language, and that English is defined by a bunch of companies that primarily are in San Francisco.

0
💬 0

9276.375 - 9283.16 Lex Fridman

The right way to spell optimization is with a Z, just in case you're probably... I think it's an S in British English.

0
💬 0

9283.34 - 9304.064 Nathan Lambert

It is. Taking it as something silly, right? Like something as silly as the spelling, like which British and English, you know, Brits and Americans will like laugh about probably, right? I don't think we care that much. But like, you know, some people will, but like this can, this can boil down into like very, very important topics. Like, Hey, you know, subverting people, right?

0
💬 0

9304.625 - 9321.733 Nathan Lambert

You know, chatbots, right? Character AI has shown that they can, like, you know, talk to kids or adults. And, like, it will, like, people feel a certain way, right? And that's unintentional alignment. But, like, what happens when there's intentional alignment deep down on the open source standard? It's a backdoor today for, like, Linux, right?

0
💬 0

9322.173 - 9336.105 Nathan Lambert

right, that we discover, or some encryption system, right? China uses different encryption than NIST defines, the US NIST, because there's clearly, at least they think there's backdoors in it, right? What happens when the models are backdoors not just to computer systems, but to our minds?

0
💬 0

9336.145 - 9363.108 Dylan Patel

Yeah, they're cultural backdoors. The thing that amplifies the relevance of culture with language models is that We are used to this mode of interacting with people in back and forth conversation. And we now have a very powerful computer system that slots into a social context that we're used to, which makes people very... We don't know the extent to which people can be impacted by that.

0
💬 0

9363.548 - 9384.523 Lex Fridman

So there could be – this is an actual concern with a Chinese company that is providing open weights models is that there could be some secret Chinese government sort of requirement for these models to have a certain kind of backdoor, to have some kind of thing where – I don't necessarily think it will be a backdoor, right?

0
💬 0

9384.543 - 9400.936 Nathan Lambert

Because once it's open weights, it doesn't like phone home. It's more about, like, if it recognizes a certain system, it could... Now, it could be a backdoor in the sense of, like, hey, if you're building a software, you know, something in software, all of a sudden it's a software agent. Oh, program this backdoor that only we know about.

0
💬 0

9401.276 - 9405.479 Nathan Lambert

Or it could be, like, subvert the mind to think that, like, XYZ opinion is the correct one.

0
💬 0

9405.519 - 9426.252 Dylan Patel

Anthropic has research on this where they... show that if you put certain phrases in at pre-training, you can then elicit different behavior when you're actually using the model because they've poisoned the pre-training data. As of now, I don't think anybody in a production system is trying to do anything like this. I think it's mostly...

0
💬 0

9427.213 - 9440.265 Dylan Patel

Anthropic is doing very direct work and mostly just subtle things. We don't know what these models are going to, how they are going to generate tokens, what information they're going to represent, and what the complex representations they have are.

0
💬 0

9440.405 - 9472.477 Lex Fridman

Well, we're talking about Anthropic, which is generally just permeated with like good humans trying to do good in the world. We just don't know of any labs, this would be done in a military context, that are explicitly trained to, okay, how can we The front door looks like a happy LLM, but underneath, it's a thing that will, over time, do the maximum amount of damage to our quote-unquote enemies.

0
💬 0

9472.778 - 9494.876 Nathan Lambert

There's this very good quote from Sam Altman who, you know, he can be a hype beast sometime, but one of the things he said, and I think I agree, is that superhuman persuasion will happen before superhuman intelligence. Yeah. And if that's the case, then these things before we get this AGI-ASI stuff, we can embed superhuman persuasion towards our ideal or whatever the ideal of the model maker is.

0
💬 0

9495.497 - 9501.043 Nathan Lambert

And again, today, I truly don't believe DeepSeek has done this. But it is a sign of what could happen.

0
💬 0

9501.499 - 9525.525 Lex Fridman

So one of the dystopian worlds is described by Brave New World. So we could just be stuck scrolling Instagram, looking at cute puppies or worse, and then talking to bots that are giving us a narrative, and we completely get lost in that world that's controlled by somebody else, versus thinking independently. And that's a major concern, as we rely more and more on these kinds of systems.

0
💬 0

9526.001 - 9528.244 Dylan Patel

I mean, we've already seen this with recommendation systems.

0
💬 0

9528.284 - 9545.045 Nathan Lambert

Yeah, recommendation systems hack the dopamine induced reward circuit, but the brain is a lot more complicated. And what other sort of circuits, quote unquote, feedback loops in your brain can you hack slash subvert in ways like recommendation systems are purely just trying to do? you know, increased time and ads and et cetera.

0
💬 0

9545.065 - 9549.833 Nathan Lambert

But there's so many more goals that can be achieved through these complicated models.

0
💬 0

9550.093 - 9558.932 Dylan Patel

There's no reason in some number of years that you can't train a language model to maximize time spent on a chat app. Like right now they are trained.

0
💬 0

9558.952 - 9560.934 Nathan Lambert

I mean, is that not what character AI has done?

0
💬 0

9560.954 - 9576.587 Dylan Patel

Their time per session is like two hours. Yeah. Character AI very likely could be optimizing this where it's like the way that this data is collected is naive or it's like you're presented a few options and you choose them. But there's that's not the only way that these models are going to be trained. It's naive stuff like talk to an anime girl.

0
💬 0

9576.627 - 9579.99 Nathan Lambert

But like it can be like, yeah, this is a risk, right? Like.

0
💬 0

9580.39 - 9604.507 Lex Fridman

it's it's a bit of a cliche thing to say but i've uh over the past year had a few stretches of time where i didn't use social media or the internet at all and just read books and was out in nature and it like it clearly has a diff an effect on the mind where like it changes like i feel like i'm returning of course i was raised before the internet really took off but i'm returning to some more

0
💬 0

9606.277 - 9616.679 Dylan Patel

I know where you're going. I mean, you can see it physiologically. Like I take three days if I'm like backpacking or something and you. You're literally breaking down addiction cycles.

0
💬 0

9617.059 - 9639.514 Lex Fridman

I feel like I'm more in control of my mind. There feels like a sovereignty of intelligence that's happening when I'm disconnected from the internet. I think the more I use the internet and social media, the more other people are controlling my mind. That's definitely a feeling. And then in the future, that would be not other people but algorithms. or other people presented to me via algorithms.

0
💬 0

9639.914 - 9652.906 Dylan Patel

I mean, there are already tons of AI bots on the internet. Right now, it's not frequent, but every so often, I have replied to one, and they're instantly replying. I'm like, crap, that was a bot. And that is just going to become more common. They're going to get good.

0
💬 0

9653.486 - 9672.363 Nathan Lambert

One of the hilarious things about technology over its history is that the illicit adult entertainment industry has always adopted technologies first. Right. Whether it was like video streaming to like where, you know, there's now the like sort of like independent adult illicit content creators who have their subscription pages.

0
💬 0

9672.703 - 9688.415 Nathan Lambert

And there they actually heavily utilize, you know, generative AI has already been like diffusion models and all that is huge there. But now these like these subscription based individual creators do use bots to approximate themselves and chat with their, you know, people pay a lot for it. And people pay a lot, right?

0
💬 0

9688.695 - 9709.941 Nathan Lambert

A lot of times it's them, but a lot of there are agencies that do this for these creators and do it like on a like mass scale. So the largest creators are like able to talk to hundreds or thousands of like people at a time because of these bots. And so it's already being used there. Obviously, you know, like video streaming and other technologies have gone there first.

0
💬 0

9710.001 - 9711.321 Nathan Lambert

It's going to come to the rest of society, too.

0
💬 0

9712.486 - 9733.389 Lex Fridman

There's a general concern that models get censored by the companies that deploy them. So one case where we've seen that, and maybe censorship is one word, alignment maybe via RLHF or some other way is another word. So we saw that with Black Nazi image generation with Gemini.

0
💬 0

9735.45 - 9754.58 Lex Fridman

As you mentioned, we also see that with Chinese models refusing to answer what happened in June 4th, 1989 at Tiananmen Square. So how can this be avoided? And maybe can you just in general talk about how this happens and how can it be avoided? You give multiple examples.

0
💬 0

9755.861 - 9779.334 Dylan Patel

There's probably a few things to keep in mind here. One is the kind of Tiananmen Square factual knowledge. How does that get embedded into the models? Two is the Gemini, what you called the Black Nazi model. incident, which is when Gemini as a system had this extra thing put into it that dramatically changed the behavior.

0
💬 0

9779.795 - 9800.925 Dylan Patel

And then three is what most people would call general alignment, RLHF post-training. Each of these have very different scopes in how they are applied. In order to do, if you're just going to look at the model weights, in order to audit specific facts is extremely hard because you have to chrome through the pre-training data and look at all of this.

0
💬 0

9801.185 - 9806.709 Dylan Patel

And then that's terabytes of files and look for very specific words or hints of the words.

0
💬 0

9806.909 - 9815.915 Lex Fridman

So I guess one way to say it is that you can insert censorship or alignment at various stages in the pipeline. And what you refer to now is at the very beginning of the data.

0
💬 0

9815.935 - 9833.08 Dylan Patel

So if you want to get rid of facts in a model, you have to do it at every stage. You have to do it at the pre-training. So most people think that pre-training is where most of the knowledge is put into the model and then you can elicit and move that in different ways, whether through post-training or whether through systems afterwards.

0
💬 0

9833.26 - 9845.663 Nathan Lambert

This is where the whole like hacking models comes from, right? Like GPT will not tell you how to make anthrax, but if you try really, really hard, you can eventually get it to tell you about anthrax because they didn't filter it from the pre-training data set, right?

0
💬 0

9845.683 - 9852.384 Lex Fridman

Yeah. By the way, removing facts has such an ominous, dark feel to it.

0
💬 0

9852.404 - 9856.545 Dylan Patel

I almost think it's practically impossible. Because you effectively have to remove them from the internet.

0
💬 0

9857.005 - 9863.926 Lex Fridman

You're taking on a... Did they remove the mmm thing from the subreddits? The mmm?

0
💬 0

9864.146 - 9876.748 Dylan Patel

It gets filtered out. So you have quality filters, which are small language models that look at a document and tell you how good is this text? Is it close to a Wikipedia article, which is a good... that we want language models to be able to imitate.

0
💬 0

9876.768 - 9881.369 Lex Fridman

So couldn't you do a small language model that filters out mentions of Tiananmen Square in the data?

0
💬 0

9881.609 - 9885.87 Dylan Patel

Yes, but is it going to catch wordplay or encoded language?

0
💬 0

9885.91 - 9903.654 Nathan Lambert

I mean, people have been meaning on like games and other stuff, how to like say things that don't say Tiananmen Square. But, or like, yeah, so there's always like different ways to do it. There's, hey, the internet as a whole does tend to just have a slight left bias, right? Because it's always been richer, more affluent, right?

0
💬 0

9904.174 - 9925.613 Nathan Lambert

younger people on the internet relative to the rest of the population so there is already inherently a slight left bias right on the internet and so how do you filter things that are this complicated right is it like and and some of these can be like you know factual non-factual but like Tiananmen Square is obviously the example of a factual but it gets a lot harder when you're talking about aligning to a ideal right um

0
💬 0

9927.395 - 9942.144 Nathan Lambert

And so Grok, for example, Elon's tried really hard to make the model not be super PC and woke, but the best way to do pre-training is to throw the whole freaking internet at it and then later figure out. But then at the end of the day, the model at its core now still has some of these ideals.

0
💬 0

9942.564 - 9958.955 Nathan Lambert

You still ingested Reddit slash r slash politics, which is probably the largest political discussion board on the world that's freely available to scrape. And guess what? That's left leaning, right? And so, you know, there are some aspects like that you just can't censor unless you try really, really, really, really, really hard.

0
💬 0

9959.284 - 9965.55 Lex Fridman

So the base model will always have some TDS, trauma derangement syndrome, because it's trained so much.

0
💬 0

9965.61 - 9967.212 Dylan Patel

It'll have the ability to express it.

0
💬 0

9967.252 - 9972.697 Lex Fridman

But what if... There's a wide representation in the data.

0
💬 0

9972.797 - 9981.706 Dylan Patel

This is what happens. A lot of what is called post-training is a series of techniques to get the model... on rails of a really specific behavior.

0
💬 0

9982.087 - 9997.221 Nathan Lambert

And I mean, it's like you can, you also have the ingested data of like Twitter or like Reddit slash r slash the Donald, which is like also super pro-Trump, right? And then you have like fascist subreddits or like you have communist subreddits. So the model in pre-training ingests everything. It has no worldview.

0
💬 0

9997.522 - 10016.889 Nathan Lambert

Now, it does have some skew because more of the text is skewed a certain way, which is general, slight left, but also somewhat intellectual. It's just the general internet is a certain way. And then as Nathan's about to describe eloquently, you can elicit certain things out.

0
💬 0

10016.989 - 10037.777 Dylan Patel

And there's a lot of history here. So we can go through multiple examples and what happened. Llama 2 was a launch that the phrase like too much RLHF or like too much safety was a lot. It's just, that was the whole narrative after Lama2's chat models released. And the examples are sorts of things like you would ask Lama2 chat, how do you kill a Python process?

0
💬 0

10037.877 - 10052.444 Dylan Patel

And it would say, I can't talk about killing because that's a bad thing. And anyone that is trying to design an AI model will probably agree that that's just like, eh, model, you messed up a bit on the training there. I don't think they meant to do this, but this was in the model weight. So this is not, you know,

0
💬 0

10052.984 - 10074.324 Dylan Patel

It didn't necessarily be... There's things called system prompts, which are when you're querying a model, it's a piece of text that is shown to the model, but not to the user. So a fun example is your system prompt could be talk like a pirate. So no matter what the user says to the model, it'll respond like a pirate. In practice, what they are is... You are a helpful assistant.

0
💬 0

10074.384 - 10083.717 Dylan Patel

You should break down problems. If you don't know about something, don't tell them. Your date cut off is this. Today's date is this. It's a lot of really useful context for how can you answer a question well.

0
💬 0

10083.897 - 10085.96 Lex Fridman

An anthropic publishes their system.

0
💬 0

10085.98 - 10100.633 Dylan Patel

Yes, which I think is great. And there's a lot of research that goes into this and One of your previous guests, Amanda Askell, is probably the most knowledgeable person, at least in the combination of execution and sharing. She's the person that should talk about system prompts and character of models.

0
💬 0

10100.653 - 10110.701 Lex Fridman

Yeah, and people should read these system prompts because you're trying to nudge sometimes through extreme politeness, the model to be a certain way.

0
💬 0

10111.122 - 10124.074 Dylan Patel

And you could use this for bad things. We've done tests, which is, what if I tell the model to be a dumb model? Which evaluation scores go down? And it's like, we'll have this behavior where it could sometimes say, oh, I'm supposed to be dumb.

0
💬 0

10124.114 - 10147.516 Dylan Patel

And sometimes it doesn't affect math abilities as much, but something like, if you're trying, it's just the quality of a human judgment would drop to the floors. Let's go back to post-training, specifically RLHF around Lama 2. Too much safety prioritization was baked into the model weights. This makes you refuse things in a really annoying way for users. It's not great. It caused a lot of...

0
💬 0

10148.677 - 10170.015 Dylan Patel

um like awareness to be attached to rlhf that it makes the models dumb and it stigmatized the word it did in ai culture and as the techniques have evolved that's no longer the case where all these labs have very fine-grained control over what they get out of the models through techniques like rlhf although although different labs are definitely different levels like on the on one end of the spectrum is google

0
💬 0

10170.775 - 10181.443 Nathan Lambert

And then maybe OpenAI does less and Anthropic does less. And then on the other end of the spectrum is XAI. But they all have different forms of RLHF trying to make them a certain way.

0
💬 0

10181.683 - 10199.695 Dylan Patel

And the important thing to say is that no matter how you want the model to behave, these RLHF and preference tuning techniques also improve performance. So on things like math evals and code evals, there is something innate to these what is called contrastive loss functions. We could start to get into RL here.

0
💬 0

10199.735 - 10218.123 Dylan Patel

We don't really need to, but RLHF also boosts performance on anything from a chat task to a math problem to a code problem. So it is becoming a much more useful tool to these labs. So this kind of takes us through the arc of we've talked about pre-training, hard to get rid of things. We've talked about post-training and how post-training, if you You can mess it up.

0
💬 0

10218.283 - 10236.978 Dylan Patel

It's a complex, multifaceted optimization with 10 to 100 person teams converging on one artifact. It's really easy to not do it perfectly. And then there's the third case, which is what we talked about Gemini. The thing that was about Gemini is this was a served product where Google has their internal model weights. They've done all these processes that we talked about.

0
💬 0

10237.378 - 10257.354 Dylan Patel

And in the served product, what came out after this was that they had a prompt that they were rewriting user queries to boost diversity or something. And this just made it, the outputs were just blatantly wrong. It was some sort of organizational failure that had this prompt in that position. And I think Google executives probably have owned this. I didn't pay that attention, that detail.

0
💬 0

10257.374 - 10263.739 Dylan Patel

But it was just a mess up in execution that led to this ridiculous thing. But at the system level, the model weights might have been fine.

0
💬 0

10264.119 - 10268.541 Lex Fridman

So at the very end of the pipeline, there was a rewriting. To something like a system prompt.

0
💬 0

10268.782 - 10287.674 Dylan Patel

It was like the system prompt or what is called in industry is like you rewrite prompts. So especially for image models, if you're using Dolly or ChatGPT can generate you an image, you'll say, draw me a beautiful car. Mm-hmm. With these leading image models, they benefit from highly descriptive prompts.

0
💬 0

10288.154 - 10306.111 Dylan Patel

So what would happen is if you do that on ChatGPT, a language model behind the scenes will rewrite the prompt, say, make this more descriptive, and then that is passed to the image model. So prompt rewriting is something that is used at multiple levels of industry, and it's used effectively for image models, and the Gemini example is just a failed execution.

0
💬 0

10306.952 - 10320.062 Lex Fridman

Big philosophical question here with RLHF to generalize, where is human input, human in the loop, human data most useful at the current stage?

0
💬 0

10320.868 - 10342.259 Dylan Patel

For the past few years, the highest cost human data has been in these preferences, which is comparing, I would say highest cost and highest total usage. So a lot of money has gone to these pairwise comparisons where you have two model outputs and a human is comparing between the two of them. In earlier years, there was a lot of this instruction tuning data.

0
💬 0

10342.359 - 10356.669 Dylan Patel

So creating highly specific examples to something like a Reddit question to a domain that you care about. Language models used to struggle on math and code. So you would pay experts in math and code to come up with questions and write detailed answers that were used to train the models.

0
💬 0

10357.41 - 10375.259 Dylan Patel

Now it is the case that there are many model options that are way better than humans at writing detailed and eloquent answers for things like model and code. So They talked about this with the Lama 3 release, where they switched to using Lama 3, 4, or 5B to write their answers for math and code.

0
💬 0

10375.759 - 10392.244 Dylan Patel

But they, in their paper, talk about how they use extensive human preference data, which is something that they haven't gotten AIs to replace. There are other techniques in industry like constitutional AI, where you use human data for preferences and AI for preferences. And I expect the AI part to scale faster than the human part.

0
💬 0

10392.864 - 10398.726 Dylan Patel

But among the research that we have access to is that humans are in this kind of preference loop.

0
💬 0

10399.327 - 10405.149 Lex Fridman

So as reasoning becomes bigger and bigger and bigger, as we said, where's the role of humans in that?

0
💬 0

10405.429 - 10428.199 Dylan Patel

It's even less prevalent. So it's... The remarkable thing about these reasoning results, and especially the DeepSeq R1 paper, is this result that they call DeepSeq R1-0, which is they took one of these pre-trained models, they took DeepSeq V3 base, and then they do this reinforcement learning optimization on verifiable questions or verifiable rewards for a lot of questions and a lot of training.

0
💬 0

10428.6 - 10447.201 Dylan Patel

And these reasoning behaviors emerge naturally. So these things like, wait, let me see, wait, let me check this. Oh, that might be a mistake. And they emerge from only having questions and answers. And when you're using the model, the part that you look at is the completion. So in this case, all of that just emerges from this large scale RL training.

0
💬 0

10448.587 - 10469.25 Dylan Patel

And that model, which the weights are available, has no human preferences added into the post-training. The DeepSeq R1 full model has some of this human preference tuning, this RLHF, after the reasoning stage. But the very remarkable thing is that you can get these reasoning behaviors And it's very unlikely that there's humans writing out reasoning chains.

0
💬 0

10469.31 - 10486.434 Dylan Patel

It's very unlikely that they somehow hacked OpenAI and they got access to OpenAI 01's reasoning chains. It's something about the pre-trained language models and this RL training where you reward the model for getting the question right. And therefore, it's trying multiple solutions and it emerges this chain of thought.

0
💬 0

10487.23 - 10509.562 Lex Fridman

This might be a good place to mention the eloquent and the insightful tweet of the great and the powerful Andrzej Karpathy. I think he had a bunch of thoughts, but one of them, last thought, not sure if this is obvious. You know something profound is coming when you're saying it's not sure if it's obvious. There are two major types of learning in both children and in deep learning.

0
💬 0

10509.962 - 10534.339 Lex Fridman

There's one, imitation learning, watch and repeat, i.e. pre-training, supervised, fine-tuning, and two, trial and error learning, reinforcement learning. My favorite simple example is AlphaGo. One is learning by imitating expert players. Two is reinforcement learning to win the game. Almost every single shocking result of deep learning and the source of all magic is always two.

0
💬 0

10535.14 - 10558.585 Lex Fridman

Two is significantly more powerful. Two is what surprises you. Two is when the paddle learns to hit the ball behind the blocks and break out. Two is when AlphaGo beats even Lee Sedol. And two is the aha moment when the deep seek or O1, et cetera, discovers that it works well to reevaluate your assumptions, backtrack, try something else, et cetera.

0
💬 0

10559.125 - 10577.733 Lex Fridman

It's the solving strategies you see this model use in its chain of thought. It's how it goes back and forth thinking to itself. These thoughts are emergent, three exclamation points, and this is actually seriously incredible, impressive and new, and is publicly available and documented.

0
💬 0

10578.547 - 10599.323 Lex Fridman

The model could never learn this with imitation because the cognition of the model and the cognition of the human labeler is different. The human would never know to correctly annotate these kinds of solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards the final outcome.

0
💬 0

10599.784 - 10607.81 Lex Fridman

Anyway, the alpha zero sort of metaphor analogy here. Can you speak to that, the magic of the chain of thought that he's referring to?

0
💬 0

10608.443 - 10627.177 Dylan Patel

I think it's good to recap AlphaGo and AlphaZero because it plays nicely with these analogies between imitation learning and learning from scratch. So AlphaGo, the beginning of the process was learning from humans where they had, they started the first, this is the first expert level Go player or chess player in DeepMind series of models where they had some human data.

0
💬 0

10627.817 - 10649.086 Dylan Patel

And then why it is called AlphaZero is that there was zero human data in the loop. And that change to AlphaZero made a model that was dramatically more powerful for DeepMind. So this remove of the human prior, the human inductive bias, makes the final system far more powerful. We mentioned Bitter Lesson hours ago, and this is all aligned with this.

0
💬 0

10649.506 - 10666.498 Dylan Patel

And then there's been a lot of discussion in language models. This is not new. This goes back to the whole QSTAR rumors, which if you piece together the pieces is probably the start of OpenAI figuring out its O1 stuff when last year in November, the QSTAR rumors came out.

0
💬 0

10667.018 - 10688.494 Dylan Patel

There's a lot of intellectual drive to know when is something like this going to happen with language models, because we know these models are so powerful and we know it has been so successful in the past. And it is a reasonable analogy that this new type of reinforcement learning training for reasoning models is when the door is open to this.

0
💬 0

10688.854 - 10708.454 Dylan Patel

We don't yet have the equivalent of turn 37, which is the famous turn where the DeepMind's AI playing ghost dumped Lee Sedol completely. We don't have something that's that level of focal point, but that doesn't mean that the approach to technology is different and the impact of the general training. It's still incredibly new. What do you think that point would be?

0
💬 0

10708.474 - 10717.495 Dylan Patel

What would be move 37 for chain of thought, for reasoning? scientific discovery. When you use this sort of reasoning problem and it's just something we fully don't expect.

0
💬 0

10718.296 - 10743.342 Nathan Lambert

I think it's actually probably simpler than that. It's probably something related to computer user robotics rather than science discovery. Because the important aspect here is models take so much data to learn, they're not sample efficient, right? Trillions, they take the entire web, right? Over 10 trillion tokens to train on, right? This would take a human... thousands of years to read, right?

0
💬 0

10743.462 - 10759.873 Nathan Lambert

And humans know most of the stuff, a lot of the stuff models know better than it, right? Humans are way, way, way more sample efficient. That is because of the self-play, right? How does a baby learn what its body is? As it sticks its foot in its mouth and it says, oh, this is my body.

0
💬 0

10760.873 - 10780.198 Nathan Lambert

It sticks its hand in its mouth and it calibrates its touch on its fingers with the most sensitive touch thing on its tongue. This is how babies learn. And it's just self-play over and over and over and over again. And now we have something that is similar to that with these verifiable proofs, whether it's a unit test in code or...

0
💬 0

10780.798 - 10797.481 Nathan Lambert

mathematical verifiable task, generate many traces of reasoning, right? And keep branching them out, keep branching them out. And then check at the end, hey, which one actually has the right answer? Most of them are wrong. Great. These are the few that are right. Maybe we use some sort of reward model outside of this to select even the best one to preference as well.

0
💬 0

10797.841 - 10805.623 Nathan Lambert

But now you've started to get better and better at these benchmarks. And so you've seen over the last six months, a skyrocketing in a lot of different benchmarks, right?

0
💬 0

10805.683 - 10827.003 Dylan Patel

All math and code benchmarks were pretty much solved, except for frontier math, which is designed to be almost questions that aren't practical to most people. Because they're exam-level, open math problem-type things. So it's on the math problems that are somewhat reasonable, which is somewhat complicated word problems or coding problems. It's just what Dylan is saying.

0
💬 0

10827.203 - 10842.184 Nathan Lambert

So the thing here is that... These are only with verifiable tasks. We earlier showed an example of the, you know, the really interesting, like what happens when chain of thought is to a non-verifiable thing. It's just like a human, you know, chatting, right? With the, you know, thinking about what's novel for humans, right? A unique thought.

0
💬 0

10842.684 - 10861.881 Nathan Lambert

But this task and form of training only works when it's verifiable. And from here, the thought is, okay, we can continue to scale this current training method by increasing the number of verifiable tasks. In math and coding, coding probably has a lot more to go. Math has a lot less to go in terms of what are verifiable things.

0
💬 0

10861.921 - 10875.956 Nathan Lambert

Can I create a solver that then I generate trajectories toward or reasoning traces towards and then prune the ones that don't work and keep the ones that do work? Well, those are going to be solved pretty quickly, but even if you've solved math, you have not... actually created intelligence, right?

0
💬 0

10876.676 - 10897.724 Nathan Lambert

And so this is where I think the like, aha moment of computer user robotics will come in because now you have a sandbox or a playground that is infinitely verifiable, right? Did you, you know, messing around on the internet, there are so many actions that you can do that are verifiable. It'll start off with like, log into a website, create an account, click a button here, blah, blah, blah.

0
💬 0

10897.984 - 10912.988 Nathan Lambert

But it'll then get to the point where it's, hey, go do a task on Tasker or whatever these other, all these various task websites. hey, go get hundreds of likes, right? And it's going to fail. It's going to spawn hundreds of accounts. It's going to fail on most of them. But this one got to a thousand. Great. Now you've reached the verifiable thing.

0
💬 0

10913.248 - 10925.151 Nathan Lambert

And you just keep iterating this loop over and over. And that's when... And same with robotics, right? That's where, you know, where you have an infinite playground of tasks like, hey, did I put the ball in the bucket? All the way to like, oh, did I like build a car, right? Like, you know, there's a whole...

0
💬 0

10925.791 - 10950.985 Nathan Lambert

trajectory to speed run or you know what models can do but at some point i truly think that like you know we'll spawn models and initially all the training will be in sandboxes but then at some point you know the language model pre-training is going to be dwarfed by what is this reinforcement learning you know you'll pre-train a multimodal model that can see that can read that can write you know blah blah blah whatever vision audio etc but then you'll have it play in a sandbox and

0
💬 0

10951.71 - 10976.934 Nathan Lambert

infinitely and figure out figure out math figure out code figure out navigating the web figure out operating a robot arm right and then it'll learn so much and the aha moment i think will be when this is available to then create something that's not good right like oh cool part of it was like figuring out how to use the web now all of a sudden it's figured out really well how to just get hundreds of thousands of followers that are real and real engagement on twitter because all of a sudden this is one of the things that are verifiable

0
💬 0

10976.974 - 11002.826 Lex Fridman

And maybe not just engagement, but make money. Yes, of course. I mean, that could be the thing where almost fully automated, it makes $10 million by being an influencer, selling a product, creating the product. And I'm not referring to a hype product, but an actual product where like, holy shit, this thing created a business. it's running it, it's the face of the business, that kind of thing.

0
💬 0

11003.387 - 11018.566 Lex Fridman

Or maybe a number one song, like it creates the whole infrastructure required to create the song, to be the influencer that represents that song, that kind of thing. That could be the move. I mean, our culture respects money in that kind of way.

0
💬 0

11018.866 - 11020.387 Nathan Lambert

And it's verifiable, right?

0
💬 0

11020.407 - 11021.628 Lex Fridman

It's verifiable, right.

0
💬 0

11021.648 - 11042 Dylan Patel

The bank account can't lie. Exactly. There's surprising evidence that once you set up the ways of collecting the verifiable domain that this can work. There's been a lot of research before this R1 on math problems, and they approach math with language models just by increasing the number of samples. So you can just try again and again and again. And you look at the...

0
💬 0

11042.981 - 11054.25 Dylan Patel

amount of times that the language models get it right. And what we see is that even very bad models get it right sometimes. And the whole idea behind reinforcement learning is that you can learn from very sparse rewards.

0
💬 0

11054.931 - 11074.287 Dylan Patel

So it doesn't... The space of language and the space of tokens, whether you're generating language or tasks for a robot, is so big that you might say that it's like... I mean, the tokenizer for a language model can be like 200,000 things. So at each step, it can sample from that big of a space. So if it... can generate a bit of a signal that it can climb onto.

0
💬 0

11074.347 - 11089.186 Dylan Patel

That's what the whole field of RL is around is learning from sparse rewards. And the same thing has played out in math where it's like very weak models that sometimes generate answers where you see research already that you can boost their math scores. You can do this sort of RL training

0
💬 0

11089.867 - 11104.73 Dylan Patel

For math, it might not be as effective, but if you take a 1 billion parameter model, so something 600 times smaller than DeepSeq, you can boost its grade school math scores very directly with a small amount of this training. So it's not to say that this is coming soon.

0
💬 0

11104.79 - 11117.632 Dylan Patel

Setting up the verification domains is extremely hard and there's a lot of nuance in this, but there are some basic things that we have seen before where it's at least expectable that there's a domain and there's a chance that this works.

0
💬 0

11118.157 - 11141.447 Lex Fridman

All right, so we have fun things happening in real time. This is a good opportunity to talk about other reasoning models, 01, 03. Just now, OpenAI, as perhaps expected, released 03 Mini. What are we expecting from the different flavors? Can you just lay out the different flavors of the old models and from Gemini, the reasoning model?

0
💬 0

11141.727 - 11154.178 Dylan Patel

Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet. You do this large scale reasoning training with reinforcement learning.

0
💬 0

11154.778 - 11169.489 Dylan Patel

And then what the DeepSeek paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did... reasoning-heavy but very standard post-training techniques after the large-scale reasoning RL.

0
💬 0

11169.549 - 11191.224 Dylan Patel

So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math-heavy. So some of this transfer, we looked at this philosophical example early on, one of the big open questions is how much does this transfer?

0
💬 0

11191.304 - 11210.858 Dylan Patel

If we bring in domains after the reasoning training, are all the models going to become eloquent writers by reasoning? Is this philosophy stuff going to be open? We don't know in the research of how much this will transfer. There's other things about how we can make soft verifiers and things like this. But there is more training after reasoning, which makes it easier to use these reasoning models.

0
💬 0

11210.938 - 11220.561 Dylan Patel

And that's what we're using right now. So we're going to talk about with 3Mini and O1. These have gone through these extra techniques that are designed for human preferences after being trained to elicit reasoning.

0
💬 0

11221.401 - 11233.134 Nathan Lambert

I think one of the things that people are ignoring is Google's Gemini flash thinking is both cheaper than R1 and better. And they released it in the beginning of December. And nobody's talking about it. No one cares.

0
💬 0

11233.154 - 11255.23 Dylan Patel

It has a different flavor to it. Its behavior is less expressive than something like O1. It has fewer tracks than it is on. Quinn released a model last fall, QWQ, which was their preview reasoning model. And in DeepSeek had R1 Lite last fall, where these models kind of felt like they're on rails, where they really, really only can do math and code. And O1 is, it can answer anything.

0
💬 0

11255.29 - 11277.646 Dylan Patel

It might not be perfect for some tasks, but it's flexible. It has some richness to it. And this is kind of the art of Is a model a little bit undercooked? It's good to get a model out the door, but it's hard to gauge and it takes a lot of taste to be like, is this a full-fledged model? Can I use this for everything? They're probably more similar for math and code.

0
💬 0

11278.527 - 11301.77 Dylan Patel

My quick read is that Gemini Flash is not... trained the same way as 01, but taking an existing training stack, adding reasoning to it. So taking a more normal training stack and adding reasoning to it. And I'm sure they're going to have more. I mean, they've done quick releases on Gemini Flash, the reasoning, and this is the second version from the holidays. It's evolving fast and

0
💬 0

11303.382 - 11306.325 Dylan Patel

it takes longer to make this training stack where you're doing this large scale.

0
💬 0

11306.345 - 11311.011 Nathan Lambert

I get the same question from earlier. Uh, the one about the, the human nature.

0
💬 0

11311.031 - 11311.792 Dylan Patel

Yeah.

0
💬 0

11312.532 - 11314.635 Lex Fridman

What was the human nature one? Uh,

0
💬 0

11315.014 - 11327.385 Dylan Patel

The way I can ramble, why I can ramble about this so much is that we've been working on this at AI2 before O1 was fully available to everyone and before R1, which is essentially using this RL training for fine tuning.

0
💬 0

11327.645 - 11346.779 Dylan Patel

We use this in our like Tulu series of models and you can elicit the same behaviors where you say like weight and so much on, but it's so late in the training process that this kind of reasoning expression is much lighter. Yeah. So there's essentially a gradation, and just how much of this RL training you put into it determines how the output looks.

0
💬 0

11347.419 - 11352.481 Lex Fridman

So we're now using Gemini 2.0 Flash Thinking Experimental 121.

0
💬 0

11354.622 - 11359.044 Dylan Patel

It summarized the prompt as humans, self-domesticated apes.

0
💬 0

11361.945 - 11367.388 Lex Fridman

Okay. All right. So wait, is this revealing the reasoning? Here's why this is a novel. Okay.

0
💬 0

11368.528 - 11370.269 Dylan Patel

Click to expand. Click to expand.

0
💬 0

11370.774 - 11374.598 Lex Fridman

Okay. Analyze the request. Novel is the keyword.

0
💬 0

11375.119 - 11379.243 Dylan Patel

See how it just looks a little different? It looks like a normal output.

0
💬 0

11379.924 - 11384.508 Lex Fridman

Yeah. I mean, in some sense, it's better structured. It makes more sense.

0
💬 0

11384.668 - 11389.433 Nathan Lambert

Oh, and it latched onto human, and then it went into organisms, and oh, wow. Yeah.

0
💬 0

11390.456 - 11420.486 Lex Fridman

Apex Predator, focus on domestication, apply domestication to humans, explore the idea of self-domestication. Not good. Not good. Where is this going? Refine and articulate the insight, greater facial expressiveness and communication ability, yes. Plasticity and adaptability, yes. Dependence on social groups, yes. All right. And self-critique and refine further. Wow. Is this truly novel?

0
💬 0

11421.007 - 11447.025 Lex Fridman

Is it well supported? So on and so forth. And the insight is getting at is humans are not just social animals, but profoundly self-domesticated apes. And this self-domestication is the key to understanding our unique cognitive and social abilities. Self-domesticated apes. I prefer the deep seek response. I mean, it's novel. The insight is novel.

0
💬 0

11448.386 - 11468.424 Lex Fridman

I mean, that's like a good book title, Self-Demonstrated Apes. There could be a case made for that. I mean, yeah, it's cool. And it's revealing the reasoning. It's magical. It's magical. This is really powerful. Hello everyone, this is Lex with a quick intermission, recorded after the podcast.

0
💬 0

11468.805 - 11490.994 Lex Fridman

Since we reviewed responses from DeepSeeker 1 and Gemini Flash 2.0 thinking during this conversation, I thought at this moment it would be nice to insert myself quickly doing the same for OpenAI 01 Pro and 03 Mini with the same prompt. The prompt being, give one truly novel insight about humans.

0
💬 0

11492.188 - 11515.518 Lex Fridman

And I thought I would, in general, give my vibe check and vibe-based anecdotal report on my own experiences with the new O3 Mini model, now that I got a chance to spend many hours with it in different kinds of contexts and applications. So I would probably categorize this question as, let's say, open-ended philosophical question.

0
💬 0

11516.259 - 11541.739 Lex Fridman

And in particular, the emphasis on novelty, I think is a nice way to test one of the capabilities of the model, which is come up with something that makes you pause and almost surprise you with its brilliance. So that said, my general review after running each of the models on this question a bunch of times is that 01 Pro consistently gave brilliant answers.

0
💬 0

11542.86 - 11569.958 Lex Fridman

Ones that gave me pause and made me think, both cutting in its insight and just really nicely phrased with wit, with clarity, with nuance, over and over consistently generating the best answers. After that is R1, which was less consistent, but again, delivered brilliance. Gemini Flash 2.0 Thinking was third. And last was O3 Mini, actually.

0
💬 0

11570.678 - 11596.543 Lex Fridman

It often gave quite a generic answer, at least to my particular sensibilities. That said, in a bunch of other applications that I tested for brainstorming purposes, it actually worked extremely well and often outperformed R1. But on this open-ended philosophical question, it did consistently worse. Now, another important element for each of these models is how the reasoning is presented.

0
💬 0

11596.944 - 11610.559 Lex Fridman

DeepSeek R1 shows the full chain of thought tokens, which I personally just love. For these open-ended philosophical questions, it's really, really interesting to see the model think through it, but really also just stepping back

0
💬 0

11611.519 - 11633.286 Lex Fridman

Me as a person who appreciates intelligence and reasoning and reflection, reading these kind of chain of thought raw tokens of R1, there's something genuinely beautiful about observing the path of deliberation in an intelligence system. I think we don't always have that explicitly laid out for us humans.

0
💬 0

11633.326 - 11643.591 Lex Fridman

So to see it in another intelligence system, the non-linearity of it, akin to Ulysses of Finnegan's Wake by James Joyce, it's just beautiful to watch.

0
💬 0

11644.151 - 11666.253 Lex Fridman

Anyway, as we discussed in the episode, DeepSeek R1 talked about humans being able to convert selfish desires into cooperative systems by collectively pretending abstract rules like money, laws, and rights are real, and these shared hallucinations act as games where competition is secretly redirected to benefit the group. turning conflict into society's fuel.

0
💬 0

11666.773 - 11689.071 Lex Fridman

Gemini 2.0 Flash Thinking said, Humans are not just social animals, but self-domesticated apes. And this self-domestication is the key to understanding our unique cognitive and social abilities. Now, it's important to say that the chain of thought there was really interesting. It was looking through the entire evolution of life on Earth, considering apex predators.

0
💬 0

11690.598 - 11706.024 Lex Fridman

And considering how from that we ended up to where we are. I think that domestication by choice is a really interesting angle. Again, it's one of those things when somebody presents a different angle on a seemingly obvious thing, it just makes me smile.

0
💬 0

11706.344 - 11729.894 Lex Fridman

And the same with Deep Seek R1, that these hallucinations of money, laws, and rights, and us collectively pretending like it's real, and we play games with them that look like competition when secretly we're just cooperating with each other. And that is the fuel of progress, beautifully put. Now, OpenAI 01 Pro consistently over and over delivered bangers.

0
💬 0

11730.614 - 11748.162 Lex Fridman

I can go through many of them, but the first one was humans are the only species that turns raw materials into symbolic resources, then uses those symbols to reorganize the very materials they came from, creating a closed feedback loop between meaning and matter. Here, I just ran it again. Yeah.

0
💬 0

11749.763 - 11772.926 Lex Fridman

banger after banger i'm telling you humans are unique among known species in that they simultaneously rewrite two layers of reality the external world and their own private mental landscapes and then merge these two rewritten layers into a continuous personal narrative that feels objectively true feels true is this is poetry

0
💬 0

11773.627 - 11801.272 Lex Fridman

Okay, and then O3 Mini High for me was smart, fast, actually, and kind of generic. Never quite got there for me. So here's the first one I got from O3 Mini. Humans are not fixed beings, but rather ongoing narratives, dynamic stories that we continuously write, edit, and reinterpret. This narrative plasticity is more than just memory or self-reflection.

0
💬 0

11801.872 - 11824.805 Lex Fridman

It's an intrinsic cognitive process that acts like an internal error correction system. It allows us to adapt our identities and values over time in response to new experiences, challenges, and social contexts. Now it almost sneaks up to something approximating cutting insight with narrative plasticity in quotes. But then it goes back to the sort of the generic. I don't know.

0
💬 0

11825.126 - 11849.348 Lex Fridman

All of these models are incredible for different reasons. There's a lot of concerns as we discussed in this episode, but there's a lot of reasons to be excited as well. and I've probably spoken for too long. I am severely sleep-deprived, borderline delirious, so hopefully some of this made sense. And now, dear friends, back to the episode.

0
💬 0

11851.527 - 11872.882 Nathan Lambert

I think when you, you know, to Nathan's point, when you look at like the reasoning models, to me, even when I used R1 versus O1, there was like that sort of rough edges around the corner feeling, right? And flash thinking, you know, earlier, I didn't use this version, but the one from December, and it definitely had that rough edges around the corner feeling, right?

0
💬 0

11873.202 - 11891.194 Nathan Lambert

Where it's just not fleshed out in as many ways, right? Sure, they added math and coding capabilities via these verifiers in RL, but it feels like they lost something in certain areas. And O1 is worse performing than chat in many areas as well, to be clear. Not by a lot. Not by a lot though, right?

0
💬 0

11891.234 - 11905.263 Nathan Lambert

And it's like R1 definitely felt to me like it was worse than V3 in certain areas, like doing this RL expressed and learned a lot, but then it weakened in other areas. And so I think that's one of the big differences between these models is

0
💬 0

11905.843 - 11926.615 Nathan Lambert

and the and and and what oh one offers and then open ai has oh one pro and what they did with oh three which is like also very unique is that they stacked search on top of chain of thought right um and so chain of thought is one thing where it's able it's one chain it backtracks goes back and forth but how they served solved the arc agi challenge was not just the chain of thought

0
💬 0

11927.509 - 11932.294 Nathan Lambert

It was also sampling many times, i.e. running them in parallel and then selecting.

0
💬 0

11932.534 - 11943.304 Dylan Patel

Is running in parallel actually search? Because I don't know if we have the full information on how O1 Pro works. I don't have enough information to confidently say that it is search. It is parallel samples. Yeah.

0
💬 0

11943.544 - 11943.844 Nathan Lambert

And then what?

0
💬 0

11943.864 - 11958.389 Dylan Patel

And it selects something. And we don't know what the selection function is. The reason why we're debating is because since 01 was announced, there's been a lot of interest in techniques called Monte Carlo research, which is where you will break down the chain of thought into intermediate steps. We haven't defined chain of thought.

0
💬 0

11958.969 - 11978.296 Dylan Patel

Chain of thought is from a paper from years ago where you introduce the idea to ask a language model that at the time was much less easy to use. You would say, let's verify step by step, and it would induce the model to do this bulleted list of steps. Chain of thought is now... Almost a default in models where if you ask it a math question, you don't need to tell it to think step by step.

0
💬 0

11978.996 - 11994.55 Dylan Patel

And the idea with Monte Carlo tree search is that you would take an intermediate point in that train, do some sort of expansion, spend more compute, and then select the right one. That's like a very complex form of search that has been used in things like Mu0 and Alpha0 potentially. I know Mu0 does this.

0
💬 0

11995.351 - 11999.675 Nathan Lambert

Another form of search is just asking five different people and then taking the majority answers. Yes.

0
💬 0

11999.935 - 12023.332 Nathan Lambert

right there's a variety of like you know it could be complicated it could be simple we don't know what it is just that they are they are not just issuing one chain of thought in sequence they're launching many in parallel and in the arc hgi they launched a thousand in parallel for their uh the one that like really shocked everyone that beat the benchmark was they they would launch a thousand in parallel and then they would get the right answer like 80 of the time or 70 of the time 90 maybe even

0
💬 0

12025.093 - 12027.034 Nathan Lambert

Whereas if they just launched one, it was like 30%.

0
💬 0

12027.094 - 12047.219 Dylan Patel

There are many extensions to this. I would say the simplest one is that our language models to date have been designed to give the right answer the highest percentage of the time in one response. And we are now opening the door to different ways of running inference on our models in which we need to reevaluate many parts of the training process.

0
💬 0

12047.839 - 12061.632 Dylan Patel

which normally opens the door to more progress, but we don't know if OpenAI changed a lot or if just sampling more and multiple choice is what they're doing or if it's something more complex where they change the training and they know that the inference mode is going to be different.

0
💬 0

12062.072 - 12081.917 Lex Fridman

So we're talking about O1 Pro, $200 a month and they're losing money. So... The thing that we're referring to, this fascinating exploration of the test time compute space, is that actually possible? Do we have enough compute for that? Does the financials make sense?

0
💬 0

12082.118 - 12094.286 Nathan Lambert

So the fantastic thing is, and it's in the thing that I pulled up earlier, but the cost for GPT-3 has plummeted, if you scroll up just a few images, I think.

0
💬 0

12095.086 - 12120.732 Nathan Lambert

important thing about like hey is cost a limiting factor here right like my my view is that like we'll have like really awesome intelligence before we have like agi before we have it permeate throughout the economy um and this is sort of why that reason is right gpt3 was trained in what 2020 2021 um and the cost for running inference on it was 60 70 per million tokens right um which is the cost per intelligence was ridiculous

0
💬 0

12121.552 - 12129.175 Nathan Lambert

Now, as we scaled forward two years, we've had a 1200x reduction in cost to achieve the same level of intelligence as GPT-3.

0
💬 0

12129.435 - 12141.864 Lex Fridman

So here on the x-axis is time over just a couple of years, and on the y-axis is log scale time. to run inference on a million tokens.

0
💬 0

12141.885 - 12142.385

Yeah, a million.

0
💬 0

12142.685 - 12151.83 Lex Fridman

And so you have just a linear decline on log scale from GPT-3 through 3.5 to Lama.

0
💬 0

12151.87 - 12167.618 Nathan Lambert

It's like 5 cents or something like that now, right? Which is versus $60, 1200x. That's not the exact numbers, but it's 1200x. I remember that number. Is... is the humongous cost per intelligence, right? Now, the freak out over DeepSeq is, oh my God, they made it so cheap.

0
💬 0

12168.058 - 12184.923 Nathan Lambert

It's like, actually, if you look at this trend line, they're not below the trend line, first of all, and at least for GPT-3, right? They are the first to hit it, right? Which is a big deal. But they're not below the trend line as far as GPT-3. Now we have GPT-4. What's going to happen with these reasoning capabilities, right? It's a mix of architectural innovations.

0
💬 0

12185.163 - 12200.252 Nathan Lambert

It's a mix of better data, and it's going to be better training techniques, and all of these better inference systems uh, better hardware, right. Uh, going from, you know, each generation of GPU to new generations or a six, everything is going to take this cost curve down and down and down and down.

0
💬 0

12200.672 - 12219.184 Nathan Lambert

And then can I go in, can I just spawn a thousand different LLMs to create a task and then pick from one of them or, you know, whatever search search technique I want a tree Monte Carlo tree search. Maybe it gets that complicated. Um, maybe it doesn't cause it's too complicated to actually scale. Like who knows a better lesson, right? Uh, the, the question is, is,

0
💬 0

12221.045 - 12235.849 Nathan Lambert

I think when, not if, because the rate of progress is so fast, right? Nine months ago, Dario was saying, or Dario said nine months ago, the cost to train and inference was this, right? And now we're much better than this, right? And DeepSeek is much better than this.

0
💬 0

12236.049 - 12258.917 Nathan Lambert

And that cost curve for GPT-4, which was also roughly $60 per million tokens when it launched, has already fallen to $2 or so, right? And we're going to get it down to cents, Probably. For GPT-4 quality, and then that's the base for the reasoning models like O1 that we have today, and O1 Pro is spawning multiple, right? And O3 and so on and so forth.

0
💬 0

12258.937 - 12265.039 Nathan Lambert

These search techniques, too expensive today, but they will get cheaper. And that's what's going to unlock the intelligence, right?

0
💬 0

12265.526 - 12286.342 Lex Fridman

So it'll get cheaper and cheaper and cheaper. The big DeepSeek R1 release freaked everybody out because of the cheaper. One of the manifestations of that is Nvidia stock plummeted. Can you explain what happened? I mean, and also just explain this moment and whether, you know, if Nvidia is going to keep winning.

0
💬 0

12286.96 - 12307.326 Dylan Patel

We are both NVIDIA bulls here, I would say. And in some ways, the market response is reasonable. NVIDIA's biggest customers in the US are major tech companies, and they're spending a ton on AI. And a simple interpretation of DeepSeek is you can get really good models without spending as much on AI.

0
💬 0

12307.806 - 12325.22 Dylan Patel

So in that capacity, it's like, oh, maybe these big tech companies won't need to spend as much on AI and go down. The actual thing that happened, it's much more complex where there's social factors, where there's the rising in the app store, the social contagion that is happening. And then I think a lot of some of it is just like, I don't trade. I don't know anything about financial markets.

0
💬 0

12325.26 - 12337.27 Dylan Patel

But it builds up over the weekend or the social pressure where it's like if it was during the week and there was multiple days of trading when this was really becoming, but it comes on the weekend and then everybody wants to sell. Yeah. And that is a social contagion.

0
💬 0

12337.591 - 12355.995 Nathan Lambert

I think I think and like there are a lot of false narratives, which is like, hey, these guys are spending billions on models. Right. And they're not spending billions on models. No one spent more than a billion dollars on a model that's released publicly. Right. GPT-4 was a couple hundred million. And then, you know, they've reduced the cost with 4.0, 4 turbo 4.0. Right.

0
💬 0

12356.856 - 12370.942 Nathan Lambert

But billion dollar model runs are coming. right? And this concludes pre-training and post-training, right? And then the other number is like, hey, DeepSeq didn't include everything, right? They didn't include... A lot of the cost goes to research and all this sort of stuff. A lot of the cost goes to inference. A lot of the cost goes to post-training. None of these things were factored.

0
💬 0

12370.962 - 12388.831 Nathan Lambert

It's research salaries, right? All these things are counted in the billions of dollars that OpenAI is spending, but they weren't counted in the, hey, $6 million, $5 million that DeepSeq spent, right? So there's a bit of misunderstanding of what these numbers are. And then there's also an element of NVIDIA has just been a straight line up, right?

0
💬 0

12389.071 - 12406.798 Nathan Lambert

And there's been so many different narratives that have been trying to push down NVIDIA. I don't say push down NVIDIA stock. Everyone is looking for a reason to sell or to be worried, right? You know, it was Blackwell delays, right? Their GPU, you know, there's a lot of report. Every two weeks, there's a new report about their GPUs being delayed. There's...

0
💬 0

12408.579 - 12426.071 Nathan Lambert

There's the whole thing about scaling laws ending, right? It's so ironic, right? It lasted a month. It was just like literally just, hey, models aren't getting better, right? They're just not getting better. There's no reason to spend more. Pre-training scaling is dead. And then it's like, oh, one, oh, three, right? R1. R1, right?

0
💬 0

12426.171 - 12449.063 Nathan Lambert

And now it's like, wait, models are getting too, they're progressing too fast. Slow down the progress. Stop spending on GPUs, right? But, you know, the funniest thing I think that like comes out of this is Javon's paradox is true, right? AWS pricing for H100s has gone up over the last couple of weeks, right? Since a little bit after Christmas, since V3 was launched, AWS H100 pricing has gone up.

0
💬 0

12449.103 - 12457.285 Nathan Lambert

H200s are like almost out of stock everywhere because H200 has more memory and therefore R1 wants that chip over H100, right?

0
💬 0

12457.325 - 12464.988 Dylan Patel

We were trying to get GPUs on a short notice this week for a demo and it wasn't that easy. We were trying to get just like 16 or 32 H100s for a demo and it was not very easy.

0
💬 0

12465.488 - 12477.075 Lex Fridman

So for people who don't know, Gervon's paradox is when the efficiency goes up, somehow magically, counterintuitively, the total resource consumption goes up as well.

0
💬 0

12477.426 - 12495.282 Nathan Lambert

Right. And semiconductors is, you know, we're at 50 years of Moore's law. Every two years, half the cost, double the transistors, just like clockwork. And it's slowed down, obviously. But like the semiconductor industry has gone up the whole time. Right. It's been wavy. Right. There's obviously cycles and stuff. And I don't expect AI to be any different. Right. There's going to be ebbs and flows.

0
💬 0

12495.362 - 12508.592 Nathan Lambert

But this is an AI. It's just playing out at an insane timescale. Right. It was 2x every two years. This is 1200x in like three years. So it's like the scale of improvement that is hard to wrap your head around.

0
💬 0

12509.132 - 12526.683 Lex Fridman

Yeah, I was confused because to me, NVIDIA's stock on that should have gone up. But maybe it went down because there's kind of suspicion of foul play on the side of China or something like this. But if you just look purely at the actual principles at play here, it's obvious. Yeah, the Gervais paradox.

0
💬 0

12526.723 - 12541.731 Dylan Patel

The more progress that AI makes or the higher the... derivative of AI progress is, especially. Because NVIDIA is in the best place. The higher the derivative is, the sooner the market's going to be bigger and expanding. And NVIDIA is the only one that does everything reliably right now.

0
💬 0

12541.751 - 12548.494 Lex Fridman

Because it's not like an NVIDIA competitor arose. It's another company that's using NVIDIA.

0
💬 0

12548.874 - 12551.436 Dylan Patel

Who historically has been a large NVIDIA customer.

0
💬 0

12551.616 - 12551.876 Lex Fridman

Yeah.

0
💬 0

12552.316 - 12571.984 Nathan Lambert

And has press releases about them cheering about being China's biggest NVIDIA customer, right? Like, Obviously, they've quieted down, but I think that's another element of it, is that they don't want to say how many GPUs they have. Because, hey, yes, they have H800s. Yes, they have H20s. They also have some H100s, which were smuggled in.

0
💬 0

12572.104 - 12580.553 Lex Fridman

Can you speak to that, to the smuggling? What's the scale of smuggling that's feasible for a nation state to do for companies? Is it possible to...

0
💬 0

12581.013 - 12600.866 Nathan Lambert

I think there's a few angles of smuggling here, right? One is ByteDance arguably is the largest smuggler of GPUs for China, right? China's not supposed to have GPUs. ByteDance has like over 500,000 GPUs. Why? Because they're all rented from companies around the world. They rent from Oracle. They rent from Google. They rent from all these mass and a bunch of smaller cloud companies too, right?

0
💬 0

12601.106 - 12636.943 Nathan Lambert

All the Neo clouds, right? Of the world. They rent so, so many GPUs. They also buy a bunch, right? And they do this for mostly like what Meta does, right? Serving TikTok, right? Next best discussion. And Trump admin looks like they're going to keep them, which limits like allies, even like Singapore, which Singapore is like 20% of NVIDIA's 20, 30% of NVIDIA's revenue.

0
💬 0

12637.303 - 12652.338 Nathan Lambert

But Singapore had a mematorium on not building data centers for like 15 years because they don't have enough power. So where are they going? I mean, I'm not claiming they're all going to China, right? But a portion are, you know, many are going to Malaysia, including Microsoft and Oracle have big data centers in Malaysia.

0
💬 0

12652.418 - 12670.192 Nathan Lambert

Like, you know, they're going all over Southeast Asia, probably India as well, right? Like there's stuff routing, but like the diffusion rules are very de facto. Like you can only buy this many GPUs from this country. And you can only rent a cluster this large to companies that are Chinese, right? Like they're very explicit on trying to stop smuggling, right?

0
💬 0

12670.232 - 12683.817 Nathan Lambert

And a big chunk of it was, hey, let's, you know, random company by 16 servers, ship them to China, right? There's actually, I saw a photo from someone in the semiconductor industry who leads like a

0
💬 0

12685.037 - 12709.15 Nathan Lambert

a team for like networking chips uh that competes with nvidia and he sent a photo of a guy checking into a first class united flight from san francisco to shanghai or shenzhen with a super micro box that was this big which can only contain gpus right and he was booking first class because think about it three to five k for your first class ticket server cost you know 240 000 in the us 250 000 you sell it for 300 000 in china wait you just got a free first class ticket and a

0
💬 0

12713.752 - 12724.38 Nathan Lambert

a lot more money. So it's like, you know, and that's like small scale smuggling. Most of the large scale smuggling is like companies in Singapore and Malaysia, like routing them around or renting GPUs completely legally.

0
💬 0

12724.4 - 12737.013 Dylan Patel

I want to jump in. How much was the scale? I think there's been some number, like some people that are higher level economics people understanding, say that as you go from 1 billion of smuggling to 10 billion, it's like you're hiding certain levels of economic activity.

0
💬 0

12737.093 - 12743.66 Dylan Patel

And that's the most reasonable thing to me is that there's going to be some level where it's so obvious that it's easier to find this economic activity.

0
💬 0

12744.381 - 12760.4 Nathan Lambert

Yeah. So a My belief is that last year, roughly, so NVIDIA made a million H20s, which are legally allowed to be shipped to China, which we talked about is better for reasoning, inference at least, maybe not training, but reasoning inference, and inference generally.

0
💬 0

12760.74 - 12780.039 Nathan Lambert

Then they also had, you know, a couple hundred thousand, we think like 200 to 300,000 GPUs were routed to China from, you know, Singapore, Malaysia, US, wherever. Companies spawn up by 16 GPUs, 64 GPUs, whatever it is, route it. And Huawei is known for having spent up a massive network of like companies to get the materials they need after they were banned in like 2018.

0
💬 0

12780.26 - 12803.676 Nathan Lambert

So it's not like otherworldly. But I agree, right? Nathan's point is like, Hey, you can't smuggle up $10 billion of GPUs. And then the third sort of source, which is just now banned, which wasn't considered smuggling, but is China is renting, I believe from our research, Oracle's biggest GPU customer is ByteDance. And for Google, I think it's their second biggest customer.

0
💬 0

12804.236 - 12823.389 Nathan Lambert

And you go down the list of clouds, and especially these smaller cloud companies that aren't like the hyperscalers, right? Think beyond Core, even Lambda, even. There's a whole sea. There's 60 different new cloud companies serving NVIDIA GPUs. I think ByteDance is renting a lot of these, right? All over, right? And so these companies... are renting GPUs to Chinese companies.

0
💬 0

12823.429 - 12843.285 Nathan Lambert

And that was completely legal up until the diffusion rules, which happened just a few weeks ago. And even now you can rent GPU clusters that are less than 2000 GPUs, or you can buy GPUs and ship them wherever you want if they're less than 1500 GPUs, right? So it's like, there are still like some ways to smuggle, but yeah. It's not, you know, as the numbers grow, right?

0
💬 0

12843.925 - 12862.278 Nathan Lambert

You know, a hundred something billion dollars of revenue for NVIDIA last year, 200 something billion this year, right? And if next year, you know, it could nearly double again or more than double, right? Based on like what we see with data center footprints, like being built out all across the US and the rest of the world. It's going to be really hard for China to keep up with these rules, right?

0
💬 0

12862.499 - 12884.147 Nathan Lambert

Yes, there will always be smuggling and deep-seek level models of GPD-4 level models, O-1 level models capable to train on what China can get, even the next tier above that. But if we speed run a couple more jumps, right, to billion-dollar models, $10 billion models, then it becomes... hey, there is a compute disadvantage for China for training models and serving them.

0
💬 0

12884.507 - 12898.27 Nathan Lambert

And the serving part is really critical, right? DeepSeek cannot serve their model today, right? It's completely out of inventory. It's already started falling in the app store, actually, downloads, because you download it, you try and sign up, they say, we're not taking registrations because they have no capacity, right?

0
💬 0

12898.29 - 12908.153 Nathan Lambert

You open it up, you get like less than five tokens per second if you even get your request approved, right? Because there's just no capacity because they just don't have enough GPUs to serve the model, even though it's incredibly efficient.

0
💬 0

12908.49 - 12920.296 Lex Fridman

It would be fascinating to watch the smuggling. Because, I mean, there's drug smuggling, right? That's a market. There's weapons smuggling. And GPUs will surpass that at some point.

0
💬 0

12920.316 - 12935.645 Dylan Patel

Chips are highest value per kilogram, probably by far. I have another question for you, Dylan. Do you track model API access internationally? How easy is it for Chinese companies to use hosted model APIs from the U.S. ?

0
💬 0

12936.499 - 12954.207 Nathan Lambert

Yeah, I mean, that's incredibly easy, right? Like OpenAI publicly stated DeepSeq uses their API. And as they say, they have evidence, right? And this is another element of the training regime is people at OpenAI have claimed that it's a distilled model, i.e. you're taking OpenAI's model, you're generating a lot of output, and then you're training on the output in their model.

0
💬 0

12954.927 - 12958.609 Nathan Lambert

And even if that's the case, what they did is still amazing, by the way, what DeepSeq did efficiency-wise.

0
💬 0

12958.629 - 12970.733 Dylan Patel

Distillation is standard practice in industry, whether or not if you're at a closed lab where you care about terms of service and IP closely, you distill from your own models. If you are a researcher and you're not building any products, you distill from the opening eye.

0
💬 0

12970.753 - 12977.895 Lex Fridman

This is a good opportunity. Can you explain big picture distillation as a process? What is distillation? What's the process?

0
💬 0

12977.995 - 12996.824 Dylan Patel

We've talked a lot about training language models. They are trained on text. In post-training, you're trying to train on very high-quality text that you want the model to match the features of, or if you're using RL, you're letting the model find its own thing. But for supervised fine-tuning, for preference data, you need to have some completions what the model is trying to learn to imitate.

0
💬 0

12997.344 - 13018.575 Dylan Patel

And what you do there is instead of a human data or instead of the model you're currently training, you take completions from a different, normally more powerful model. I think there's rumors that these big models that people are waiting for, these GPT-5s of the world, the CLOD-3 opuses of the world are used internally to do this distillation process.

0
💬 0

13018.715 - 13029.1 Nathan Lambert

There's also public examples, right? Like Meta explicitly stated, not necessarily distilling, but they used 405B as a reward model for 70B in their LAMA 3.2 and 3.3. This is all the same topic.

0
💬 0

13031.061 - 13045.818 Lex Fridman

So is this ethical? Is this legal? Why is that Financial Times article headline say OpenAI says that there's evidence that China's deep seek used its model to train competitor?

0
💬 0

13046.018 - 13061.963 Dylan Patel

This is a long, at least in the academic side and research side, it's a long history because you're trying to interpret OpenAI's rule. OpenAI's terms of service say that you cannot build a competitor with outputs from their models. Terms of service are different than a license, which are essentially a contract between organizations.

0
💬 0

13062.464 - 13075.548 Dylan Patel

So if you have a terms of service on OpenAI's account, if I violate it, OpenAI can cancel my account. This is very different than like a license that says how you could use a downstream artifact. So a lot of it hinges on a word that is very unclear in the AI space, which is what is a competitor.

0
💬 0

13075.848 - 13083.251 Nathan Lambert

And then the ethical aspect of it is like, why is it unethical for me to train on your model when you can train on the Internet's text?

0
💬 0

13083.491 - 13094.255 Lex Fridman

Yeah. Right. So there's a bit of a hypocrisy because sort of open AI and potentially most of the companies trained on the Internet's text without permission.

0
💬 0

13094.615 - 13112.799 Dylan Patel

There's also a clear loophole, which is that I generate data from open AI and then I upload it somewhere and then somebody else trains on it and the link has been broken. Like they're not under the same terms of service contract. There's a lot of hip hop. There's a lot of like to be discovered details that don't make a lot of sense.

0
💬 0

13112.859 - 13133.904 Nathan Lambert

This is why a lot of models today, even if they train on zero OpenAI data, you ask the model who trained you, it'll say, I am Chad GPT trained by OpenAI. Because there's so much copy paste of like OpenAI outputs from that on the internet that you just weren't able to filter it out. And there was nothing in the URL where they implemented like, hey, like, or post-training or SFT, whatever that says.

0
💬 0

13134.164 - 13137.968 Nathan Lambert

hey, I'm actually a model by Allen Institute instead of OpenAI.

0
💬 0

13138.188 - 13153.744 Dylan Patel

We have to do this if we serve a demo. We do research and we use OpenAI APIs because it's useful and we want to understand post-training. And like our research models, they will say they're written by OpenAI unless we put in the system prop that we talked about that like, I am Tulu. I am a language model trained by the Allen Institute for AI.

0
💬 0

13154.344 - 13169.156 Dylan Patel

And if you ask more people around industry, especially with post-training, it's a very doable task to make the model say who it is or to suppress the OpenAI thing. So in some levels, it might be that DeepSeq didn't care that it was saying that it was by OpenAI.

0
💬 0

13169.857 - 13182.587 Dylan Patel

If you're going to upload model weights, it doesn't really matter because anyone that's serving it in an application and cares a lot about serving is going to, when serving it, if they're using it for a specific task, they're going to tailor it to that. And it doesn't matter that it's saying it's ChatGPT.

0
💬 0

13183.567 - 13190.272 Lex Fridman

Oh, I guess one of the ways to do that is like a system prompt or something like that. Like if you're serving it to say that you're... That's what we do.

0
💬 0

13190.412 - 13200.5 Dylan Patel

Like if we host the demo, you say, you are Tulu 3, a language model trained by the Allen Institute for AI. We also are benefited from OpenAI data because it's a great research tool.

0
💬 0

13200.62 - 13210.485 Lex Fridman

I mean, do you think there's any truth and value to the claim, OpenAI's claim that there's evidence that China's deep seek used this model to train?

0
💬 0

13210.585 - 13223.429 Nathan Lambert

I think everyone has benefited regardless because the data's on the internet. And therefore, it's in your portrayal now. There are subreddits where people share the best chat GPT outputs, and those are in your model.

0
💬 0

13223.789 - 13242.435 Dylan Patel

I think that they're trying to shift the narrative. They're trying to protect themselves, and we saw this years ago when ByteDance was actually banned from some open AI APIs for training on outputs. There's other AI startups that most people, if you're in the AI culture, were like, They just told us they trained on open AI outputs and they never got banned.

0
💬 0

13242.795 - 13254.42 Dylan Patel

Like that's how they bootstrapped their early models. So it's much easier to get off the ground using this than to set up human pipelines and build a strong model. So there's a long history here and a lot of the communications seem like narrative communications.

0
💬 0

13254.64 - 13270.64 Nathan Lambert

Actually, over the last couple of days, we've seen a lot of people distill DeepSeq's model into Lama models because the DeepSeq models are kind of complicated to run inference on because they're a mixture of experts and they're 600 plus billion parameters and all this. And people distill them into the Lama models because...

0
💬 0

13271.1 - 13284.507 Nathan Lambert

Because the Lama models are so easy to serve and everyone's built the pipelines and tooling for inference with the Lama models, right? Because it's the open standard. So, you know, we've seen it. We've seen a sort of roundabout, right? Like, is it bad? Is it illegal? Maybe it's illegal, whatever. I don't know about that.

0
💬 0

13284.547 - 13289.73 Dylan Patel

But like... It could break contracts. I don't think it's illegal. Like in any legal... Like no one's going to jail for this.

0
💬 0

13290.13 - 13313.539 Lex Fridman

I think like fundamentally, I think it's ethical or I hope it's ethical because like the moment it becomes... We ban that kind of thing... it's going to make everybody much worse off. And I also actually, this is difficult, but I think you should be allowed to train on the internet. I know a lot of authors and creators are very sensitive about it. That's a difficult question.

0
💬 0

13314.099 - 13317.24 Lex Fridman

But the moment you're not allowed to train on the internet,

0
💬 0

13317.6 - 13342.63 Nathan Lambert

I agree. I have a schizo take on how you can solve this because it already works. I have a reasonable take on it. Japan has a law which you're allowed to train on any training data and copyrights don't apply if you want to train a model. A. B. Japan has 9 gigawatts of curtailed nuclear power. C, Japan is allowed under the AI diffusion rule to import as many GPUs as they'd like.

0
💬 0

13343.17 - 13360.136 Nathan Lambert

So all we have to do, we have a market here to make. We build massive data centers, we rent them to the labs, and then we train models in a legally permissible way and there's no if, ands, or buts. And now the models have no potential copyright lawsuit from New York Times or anything like that. No, no, it's just completely legal.

0
💬 0

13361.496 - 13380.765 Dylan Patel

Genius. The early copyright lawsuits have fallen in the favor of AI training companies I would say that the long tail of use is going to go in the side of AI, which is if you do if you scrape trillions of data, you're not looking at the trillions of tokens of data. You're not looking and saying this one New York Times article is so important to me.

0
💬 0

13381.145 - 13402 Dylan Patel

But if you're doing a audio generation for music or image generation and you say, make it in the style of X person. That's a reasonable case where you could figure out what is their profit margin on inference. I don't know if it's going to be the 50-50 of YouTube creator program or something, but I would opt into that program as a writer. Please.

0
💬 0

13402.56 - 13409.886 Dylan Patel

It's going to be a rough journey, but there will be some solutions like that that make sense. But there's a long tail where it's just on the internet.

0
💬 0

13410.287 - 13429.553 Lex Fridman

I think one of the other aspects of that Financial Times article implied And so that leads to a more general question. Do you think there's... How difficult is spying, espionage, and stealing of actual secret code and data from inside companies? How much of that is being attempted?

0
💬 0

13429.573 - 13450.241 Dylan Patel

Code and data is hard, but ideas is easy. Silicon Valley operates on the way that top employees get bought out by other companies for a pay raise. And a large reason why these companies do this is to bring ideas with them. And there are, there's no, I mean, in California, there's rules that like certain like non-competes or whatever are illegal in California.

0
💬 0

13450.381 - 13467.953 Dylan Patel

And whether or not there's NDAs and things, that is how a lot of it happens. Recently, there was somebody from Gemini who helped make this 1 million contacts length. And everyone is saying the next Lama who, I mean, he went to the meta team is going to have 1 million contacts length. And that's kind of how the world works.

0
💬 0

13467.973 - 13488.119 Nathan Lambert

Yeah. As far as industrial espionage and things, that has been greatly successful in the past. The Americans did it to the Brits, the Chinese have done it to the Americans, and so on and so forth. It is a fact of life. And so to argue industrial espionage can be stopped is probably unlikely.

0
💬 0

13488.139 - 13505.244 Nathan Lambert

You can make it difficult, but even then, there's all these stories about, hey, F-35 and F-22 have already been given to China in terms of design plans and stuff. Code and stuff like between, you know, I say companies, not nation states is probably very difficult. But ideas are discussed a lot, right?

0
💬 0

13505.524 - 13524.692 Nathan Lambert

Whether it be a house party in San Francisco or a company changing employees or, you know, or the, you know, the always the like mythical honeypot that always gets talked about, right? Like someone gets honeypotted, right? Because everyone working on AI is a single dude who's in their 20s and 30s. Not everyone, but, like, an insane amount of... Insane percentages.

0
💬 0

13525.613 - 13533.52 Lex Fridman

So there's always, like, all these, like, you know... And obviously... So a honeypotter is, like, a spy, a female spy approaches you and, like... Yeah.

0
💬 0

13534.02 - 13547.011 Nathan Lambert

Yeah, or male, right? You know, it's San Francisco, right? But as a single dude, I will say, in his late 20s, right, is, like, we were very easily corrupted, right? Like, you know, like... Not corrupted myself, but you know, we are, we are, right?

0
💬 0

13547.071 - 13553.736 Dylan Patel

Everybody else, not me. I'm too oblivious and I am not single. So I'm safe from one espionage access.

0
💬 0

13555.017 - 13571.93 Lex Fridman

Yeah, you have to make sure to close all security vulnerabilities. So you, Dylan, collect a lot of information about each of the mega clusters for each of the major AI companies. Can you talk about the build-outs for each one that stand out?

0
💬 0

13572.569 - 13591.089 Nathan Lambert

Yeah. So I think the thing that's like really important about these mega cluster build outs is they're completely unprecedented in scale. Right. U.S., you know, sort of like data center power consumption has been slowly on the rise and it's gone up to two, three percent even through the cloud computing revolution. Right. Data center consumption as a percentage of total U.S.,

0
💬 0

13592.11 - 13610.617 Nathan Lambert

And that's been over decades, right, of data centers, et cetera. It's been climbing, climbing slowly. But now, two to three percent. Now, by the end of this decade, it's, like, even under, like, you know, when I say, like, 10%, a lot of people that are traditionally, by, like, 2028, 2030, people traditionally non-traditional data center people, like, that's nuts.

0
💬 0

13610.997 - 13626.346 Nathan Lambert

But then, like, people who are in, like, AI who have, like, really looked at this at, like, the anthropics and open AIs are, like, that's not enough. And I'm, like, okay. But, like... you know, this is, this is both through, uh, globally distributed, uh, or distributed throughout the U S as well as like centralized clusters, right?

0
💬 0

13626.646 - 13640.859 Nathan Lambert

The, the distributed throughout the U S is, is exciting and it's the bulk of it, right? Like, Hey, you know, uh, open AI or, uh, you know, say meta is adding a gigawatt, right? Um, but most of it is distributed through the U S for inference and all these other things, right?

0
💬 0

13641.059 - 13659.431 Lex Fridman

So maybe we should lay out what a, what a cluster is. So, uh, Does this include AWS? Maybe it's good to talk about the different kinds of clusters and what you mean by megaclusters and what's a GPU and what's a computer? Not that far back, but yeah. So what do we mean by the clusters?

0
💬 0

13659.672 - 13684.154 Nathan Lambert

I thought I was about to do the Apple ad, right? What's a computer? So traditionally, data centers and data center tasks have been a distributed systems problem that is capable of being spread very far and widely, right? I.e., I send a request to Google, it gets routed to a data center somewhat close to me, it does whatever search ranking recommendation, sends a result back, right? Yeah.

0
💬 0

13685.356 - 13703.488 Nathan Lambert

The nature of the task is changing rapidly in that there's two tasks that people are really focused on now, right? It's not database access. It's not serve me the right page, serve me the right ad. It's now... A, inference. And inference is dramatically different from traditional distributed systems, but it looks a lot more similar. And then there's training, right?

0
💬 0

13704.089 - 13721.317 Nathan Lambert

The inference side is still like, hey, I'm going to put, you know, thousands of GPUs and, you know, blocks all around these data centers. I'm going to run models on them. You know, user submits a request, gets kicked off. Or, hey, my service, you know, they submit a request to my service, right? They're on Word and they're like, oh, yeah, help me copilot. And it kicks it off.

0
💬 0

13721.377 - 13738.482 Nathan Lambert

I'm on my Windows, copilot, whatever. Apple intelligence, whatever it is, it gets kicked off to a data center. right? And that data center does some work and sends it back. That's inference. That is going to be the bulk of compute. But then, you know, and that's like, you know, there's thousands of data centers that we're tracking with like satellites and like all these other things.

0
💬 0

13739.182 - 13753.285 Nathan Lambert

And those are the bulk of what's being built. But the scale of... And so that's like what's really reshaping and that's what's getting millions of GPUs. But the scale of the largest cluster is also really important, right? When we look back at history, right? Like

0
💬 0

13754.185 - 13775.523 Nathan Lambert

you know or through through the age of ai right like it was a really big deal when they did alex net on i think two gpus or four gpus i don't remember it's a really big deal it's a big deal because you use gpus it's a big deal they use gpus um and they use multiple right but then over time its scale has just been compounding right and so when you skip forward to gpt3 then gpt4 gpt4 20 000

0
💬 0

13777.945 - 13796.426 Nathan Lambert

a 100 gpus unprecedented run right in terms of the size and the cost right a couple hundred million dollars on a yolo right a yolo run for gpd4 and it and it yielded you know this magical improvement that was like perfectly in line with what was experimented and just like a log scale right oh yeah they have that plot from the paper the scaling the technical performance

0
💬 0

13796.866 - 13817.033 Nathan Lambert

The scaling laws were perfect, right? But that's not a crazy number, right? 20,000 A100s, roughly each GPU is consuming 400 watts. And then when you add in the whole server, right, everything, it's like 15 to 20 megawatts of power, right? You know, maybe you could look up what the power of consumption of a human person is because the numbers are going to get silly.

0
💬 0

13817.373 - 13836.179 Nathan Lambert

But like 15 to 20 megawatts was standard data center size. It was just unprecedented. That was all GPUs running one task. How many watts was a toaster? A toaster is like a similar power consumption to an A100, right? H100 comes around, they increase the power from like 400 to 700 watts, and that's just per GPU, and then there's all the associated stuff around it.

0
💬 0

13836.439 - 13843.02 Nathan Lambert

So once you count all that, it's roughly like 1200 to 1400 watts for everything, networking, CPUs, memory, blah, blah, blah.

0
💬 0

13843.08 - 13860.393 Lex Fridman

So we should also say, so what's required... You said power, so a lot of power is required, a lot of heat is generated, cooling is required, and because there's a lot of GPUs that have to be, or CPUs or whatever, they have to be connected, so there's a lot of networking.

0
💬 0

13860.793 - 13878.242 Nathan Lambert

Yeah, so I think, yeah, sorry for skipping past that. And then the data center itself is complicated, right? But these are still standardized data centers for GPT-4 scale, right? Now we step forward to sort of what is the scale of clusters that people built last year? And it ranges widely.

0
💬 0

13878.462 - 13894.646 Nathan Lambert

It ranges from like, hey, these are standard data centers and we're just using multiple of them and connecting them together really with a ton of fiber between them, a lot of networking, et cetera. That's what OpenAI and Microsoft did in Arizona. And so they have 100,000 GPUs. Meta, similar thing. They took their standard existing data center design.

0
💬 0

13895.086 - 13907.555 Nathan Lambert

Um, and it looks like an H and they connected multiple of them together. Um, and you know, they got to, they first did 16,000 GPUs, uh, 24,000 GPUs total, only 16 of them, thousand of them were running on the training run because GPUs are very unreliable.

0
💬 0

13907.575 - 13926.99 Nathan Lambert

So they need to have spares to like swap in and out all the way to like now a hundred thousand GPUs that they're training on Lama for on currently, right? Like 128,000 or so, right? This is, you know, think about a hundred thousand GPUs, um, with roughly 1400 watts a piece, that's 140 megawatts, 150 megawatts, right? For 128, right?

0
💬 0

13927.27 - 13947.649 Nathan Lambert

So you're talking about, you've jumped from 15 to 20 megawatts to 10x, you know, almost 10x that number, 9x that number to 150 megawatts in... In two years, right? From 2022 to 2024, right? And some people like Elon, he admittedly, right? And he says it himself, got into the game a little bit late for pre-training large language models, right? XAI was started later, right?

0
💬 0

13947.889 - 13964.82 Nathan Lambert

But then he bent heaven and hell to get his data center up and get the largest cluster in the world, right? Which is 200,000 GPUs. And he did that. He bought a factory in Memphis. He's upgrading the substation, but at the same time, he's got a bunch of mobile power generation, a bunch of single cycle combine.

0
💬 0

13965.1 - 13982.847 Nathan Lambert

He tapped the natural gas line that's right next to the factory, and he's just pulling a ton of gas, burning gas. He's generating all this power. He's in a factory, in an old appliance factory that shut down and moved to China long ago, right? And he's got 200,000 GPUs in it. And now what's the next scale, right? Like all the hyperscalers have done this.

0
💬 0

13982.887 - 14003.344 Nathan Lambert

Now the next scale is something that's even bigger, right? And so, you know, Elon, just to stick on the topic, he's building his own natural gas plant, like a proper one right next door. He's deploying tons of Tesla Megapack batteries to make the power more smooth and all sorts of other things. He's got like industrial chillers, right? to cool the water down because he's water cooling the chips.

0
💬 0

14003.945 - 14022.38 Nathan Lambert

Um, so all these crazy things to, uh, get the clusters bigger and bigger. Um, but when you look at like, say what opening I did with Stargate, that's that in Arizona and, um, in Abilene, Texas, right. Uh, what they've announced at least, right. It's not built, right. Elon says they don't have the money. You know, there's some debates about this. Um,

0
💬 0

14022.7 - 14043.691 Nathan Lambert

But at full scale, at least the first section is like definitely money's accounted for, but there's multiple sections. But full scale, that data center is going to be 2.2 gigawatts, right? 2200 megawatts of power in and roughly like 1.8 gigawatts or 1800 megawatt. Yeah. 1800 megawatts of power delivered to chips, right? Now, this is an absurd scale.

0
💬 0

14043.991 - 14056.484 Nathan Lambert

2.2 gigawatts is like more than most cities, right? You know, to be clear, delivered to a single cluster that's connected to do training, right? To train these models, to do both the pre-training, the post-training, all of this stuff, right?

0
💬 0

14056.504 - 14057.706 Lex Fridman

This is insane.

0
💬 0

14058.186 - 14080.559 Nathan Lambert

What is a nuclear power plant again? Everyone is doing this, right? Everyone is doing this, right? Meta in Louisiana, right? They're building two natural gas plants, massive ones, and then they're building this massive data center. Amazon has plans for this scale. Google has plans for this scale. XAI has plans for this scale, right? All of these, the guys that are racing...

0
💬 0

14080.999 - 14098.861 Nathan Lambert

The companies that are racing are racing hard and they're doing multi gigawatt data centers, right? To build this out because they think that, yeah, if I now have, you know, obviously pre-training scaling is going to continue, but to some extent, but then also all this post-training stuff where you have an RL sandbox for computer use or whatever, right? Like, you know,

0
💬 0

14099.241 - 14115.588 Nathan Lambert

This is where they're going to and all these fearful about viable domains where they just keep learning and learning and learning self play, whatever, whatever it is, makes the AI so much more capable because the line does go up, right? As you throw more compute, you get more performance. The shirt is about scaling laws. You know, to some extent, it is diminishing returns, right?

0
💬 0

14115.608 - 14137.494 Nathan Lambert

You 10x the compute, you don't get 10x better model, right? You get a diminishing returns, but also you get efficiency improvements. So you bend the curve, right? And these scale of data centers are wreaking a lot of havoc on the network. Nathan was mentioning Amazon has tried to buy this nuclear power plant, Talon. And if you look at Talon's stock, it's just skyrocketing.

0
💬 0

14137.534 - 14150.236 Nathan Lambert

And they're building a massive multi-gigawatt data center there. And you just go down the list. There's so many ramifications. Interesting thing is certain regions of the US, transmitting power costs more than actually generating it.

0
💬 0

14151.316 - 14170.602 Nathan Lambert

Because the grid is so slow to build and the demand for power and the ability to build power and re-ramping on a natural gas plant or even a coal plant is easy enough to do. But transmitting the power is really hard. So in some parts of the US, like in Virginia, it costs more to transmit power than it costs to generate it. There's all sorts of second order effects that are insane here.

0
💬 0

14171.002 - 14172.883 Lex Fridman

Can the power grid support this kind of growth?

0
💬 0

14173.294 - 14188.141 Nathan Lambert

You know, Trump's executive orders, there was a Biden executive order before the end of the year, but then Trump had some more executive orders, which hopefully reduced the regulations to where, yes, things can be built. But yeah, this is a big, big challenge, right? Is building enough power fast enough?

0
💬 0

14188.261 - 14192.623 Lex Fridman

Are you going to basically have a nuclear power plant next to a data center for each one of these?

0
💬 0

14192.903 - 14207.94 Nathan Lambert

So the fun thing here is this is too slow. To build the power plant. To build a power plant or to reconfigure an existing power plant is too slow. And so therefore you must use natural, data center power consumption is flat, right? You know, I mean, like it's spiky, right?

0
💬 0

14207.96 - 14216.184 Dylan Patel

Which is why nuclear is also good for it. Like long-term nuclear is a very natural fit, but you can't do solar or anything in the short term.

0
💬 0

14216.784 - 14235.995 Nathan Lambert

Because data center power is like this, right? You're telling me I'm going to buy tens of billions of dollars of GPUs and idle them because the power is not being generated? Power is cheap, right? If you look at the cost of a cluster, less than 20% of it is power, right? Most of it is the capital cost and depreciation of the GPUs, right? And so it's like, well, screw it.

0
💬 0

14236.075 - 14256.096 Nathan Lambert

I'll just build natural gas plants. This is what Meta is doing in Louisiana. This is what OpenAI is doing in Texas and all these different places. They may not be doing it directly, but they are partnered with someone. And so There is a couple hopes, right? One is, Elon, what he's doing in Memphis is to the extreme. They're not just using dual combine cycle gas, which is super efficient.

0
💬 0

14256.356 - 14268.972 Nathan Lambert

He's also just using single cycle and mobile generators and stuff, which is less efficient. But you know, there's also like the flip side, which is like solar power generation is like this. And wind is another, like, like this different correlate, you know, different.

0
💬 0

14269.272 - 14285.748 Nathan Lambert

So if you stack both of those, plus you get a big chunk of batteries, um, plus you have a little bit of gas, it is possible to run it more green. It's just the timescales for that is slow, right? So people are trying, um, But, you know, Meta basically said, whatever, don't care about my sustainability pledge.

0
💬 0

14286.288 - 14299.578 Nathan Lambert

Or they'll buy like a power, it's called a PPA, power purchasing agreement, where there'll be a massive wind farm or solar farm, like wherever. And then they'll just pretend like those electrons are being consumed by the data center. But in reality, they're paying for the power here and selling it to the grid and they're buying power here. Yeah.

0
💬 0

14300.138 - 14317.027 Nathan Lambert

And then another thing is like Microsoft quit on some of their sustainability pledges, right? Elon, what he did with Memphis is objectively somewhat dirty, but he's also doing it in an area where there's like a bigger natural gas plant right next door and like a sewer next or not a sewer, but like a wastewater treatment and a garbage dump nearby, right?

0
💬 0

14317.347 - 14335.702 Nathan Lambert

And he's obviously made the world a lot more clean than that one data center is going to do, right? So I think like it's fine to some extent and maybe AGI solves that. you know, global warming and stuff, right? Whatever it is. You know, this is sort of the attitude that people at the labs have, right? Which is like, yeah, it's great. We'll just use gas, right? Because the race is that important.

0
💬 0

14335.742 - 14338.145 Nathan Lambert

And if we lose, you know, that's way worse, right?

0
💬 0

14338.385 - 14362.921 Lex Fridman

I should say that I got a chance to visit the Memphis Data Center. Oh, wow. And it's kind of incredible. I mean, I visited with Elon yesterday. Just the teams and the rate of innovation there is insane. My sense is that nobody's ever done anything of this scale, and nobody has certainly ever done anything of this scale at the rate that XAI is doing.

0
💬 0

14363.513 - 14370.656 Lex Fridman

So they're like figuring out, I mean, and so I was sitting in on all these meetings where they're brainstorming. It's like, it's insane.

0
💬 0

14371.056 - 14392.41 Lex Fridman

It's exciting because they're like, they're trying to figure out what the bottlenecks are, how to remove the bottlenecks, how to make sure that, you know, there's just so many really cool things about putting together a data center because, you know, everything has to work. It's the people that do like the sysadmin, the machine learning, all that is the exciting thing, so on.

0
💬 0

14392.47 - 14408.08 Lex Fridman

But really the people that run everything are the folks that know like the low level software and hardware that runs everything, the networking, all of that. And so you have to like... make sure you have procedures that test everything. I think they're using ethernet.

0
💬 0

14408.1 - 14426.39 Nathan Lambert

I don't know how they're doing the networking, but they're using Nvidia spectrum X ethernet. Um, there's actually like, I think, yeah, the unsung heroes are the cooling and electrical systems, which are just like glossed over. Yeah. Um, But I think like one story that maybe is like exemplifies how insane this stuff is, is when you're training, right?

0
💬 0

14427.43 - 14442.4 Nathan Lambert

You're always doing, you're running through the model a bunch, right? In the most simplistic terms, running through the model a bunch, and then you're going to exchange everything and synchronize the weights, right? So you'll do a step. This is like a step in model training, right? And every step your loss goes down, hopefully, and it doesn't always.

0
💬 0

14442.46 - 14453.445 Nathan Lambert

But in the simplest terms, you'll be computing a lot and then you'll exchange. right? The interesting thing is GPU power is most of it. Networking power is some, but it's a lot less. So while you're computing, your power for your GPUs is here.

0
💬 0

14453.705 - 14471.01 Nathan Lambert

But then when you're exchanging weights, if you're not able to overlap communications and compute perfectly, there may be a time period where your GPUs are just idle and you're exchanging weights and you're like, hey, the model's updating. So you're exchanging the gradients, you do the model update, and then you start training again. So the power goes And it's super spiky.

0
💬 0

14471.51 - 14495.152 Nathan Lambert

And so funnily enough, when you talk about the scale of data center power, you can blow stuff up so easily. And so Meta actually has accidentally upstreamed something to code in PyTorch where they added an operator. And I kid you not, whoever made this, I want to hug the guy because it says PyTorch. It's like PyTorch.powerplantnoblowup. Okay. equals zero or equal one.

0
💬 0

14495.352 - 14508.164 Nathan Lambert

And what it does, what it does is amazing, right? Either, you know, when you're exchanging the weights, the GPU will just compute fake numbers so the power doesn't spike too much. And so then the power plants don't blow up because the transient spikes like screw stuff up.

0
💬 0

14508.595 - 14513.438

Well, that makes sense. I mean, you have to do that kind of thing. You have to make sure they're not idle. Yeah.

0
💬 0

14513.618 - 14528.446 Nathan Lambert

And Elon's solution was like, let me throw a bunch of Tesla mega packs and a few other things, right? Like everyone has different solutions, but like Meta's at least was publicly and openly known, which is just like set this operator. And what this operator does is it just makes the GPUs compute nothing so that the power doesn't spike.

0
💬 0

14528.735 - 14533.199 Lex Fridman

But that just tells you how much power you're working with. I mean, it's insane. It's insane.

0
💬 0

14533.239 - 14547.453 Dylan Patel

People should just go to Google, like scale, like what does X watts do and go through all the scales from one watt to a kilowatt to a megawatt. And you look and stare at that and you're how high on the list a gigawatt is. And it's mind blowing. Yeah.

0
💬 0

14548.474 - 14561.078 Lex Fridman

Can you say something about the cooling? So I know Elon's using liquid cooling, I believe, in all cases. That's a new thing, right? Most of them don't use liquid cooling. Is there something interesting to say about the cooling?

0
💬 0

14561.178 - 14577.205 Nathan Lambert

Yeah, yeah. So air cooling has been the de facto standard. Throw a bunch of metal, heat pipes, et cetera, and fans, right? And like that's cooled. That's been enough to cool it. People have been dabbling in water cooling. Google's TPUs are water cooled. Right. So they've been doing that for a few years.

0
💬 0

14578.145 - 14594.578 Nathan Lambert

But with GPUs, no one's ever done and no one's ever done the scale of water cooling that Elon just did. Right. Now, next generation NVIDIA is for the for the like highest end GPU. It is mandatory water cooling. You have to water cool it. But Elon did it on this current generation NVIDIA. And that required a lot of stuff, right?

0
💬 0

14594.598 - 14611.729 Nathan Lambert

If you look at like some of the satellite photos and stuff of the Memphis facility, there's all these external water chillers that are sitting basically, it looks like a semi-truck pod thing. What's it called? The container. But really those are water chillers. And he has like 90 of those water chillers just sitting outside. 90 different containers, right?

0
💬 0

14611.829 - 14630.822 Nathan Lambert

With water, you know, that chill the water, bring it back to the data center, and then you distribute it to all the chips, pull all the heat out, and then send it back, right? And this is both a way to cool the chips, but also an efficiency thing. And going back to that three-vector thing, there is memory bandwidth, flops, and interconnect.

0
💬 0

14631.102 - 14646.313 Nathan Lambert

The closer the chips are together, the easier it is to do high-speed interconnects. And so this is also a reason why you're going to go water cooling is because you can just put the chips right next to each other and therefore get higher speed connectivity.

0
💬 0

14647.574 - 14655.117 Lex Fridman

I got to ask you, so in one of your recent posts, there's a section called Cluster Measuring Contest.

0
💬 0

14657.178 - 14658.959 Nathan Lambert

There's another word there, but I won't say it, you know?

0
💬 0

14661.32 - 14665.422 Lex Fridman

Who's got the biggest now and who's going to have the biggest?

0
💬 0

14665.442 - 14668.143 Nathan Lambert

Today, individual largest is Elon, right?

0
💬 0

14668.163 - 14670.91 Lex Fridman

Right. Elon's cluster.

0
💬 0

14671.05 - 14690.646 Nathan Lambert

Elon's cluster in Memphis, 200,000 GPUs, right? Meta has like 128,000. OpenAI has 100,000. Now, to be clear, other companies have more GPUs than Elon. They just don't have them in one place, right? And for training, you want them tightly connected. There's some techniques that people are researching and working on that lets you train across multiple regions.

0
💬 0

14690.966 - 14710.74 Nathan Lambert

But for the most part, you want them all in like one area, right? So you can connect them highly with high-speed networking, right? Um, and so, you know, Elon today has 200,000 GP H one hundreds and H a hundred thousand H one hundreds, a hundred thousand H two hundreds, right. Um, meta open AI, uh, you know, and, and, and Amazon all have on the scale of a hundred thousand, a little bit less.

0
💬 0

14711.28 - 14732.074 Nathan Lambert

Um, but next this year, right this year, people are building much more, right. Anthropic and Amazon are building a cluster of 400,000 tranium too, which is Amazon specific chip, uh, trying to get away from Nvidia. Right. Um, you know, uh, yeah. Meta and OpenAI have scales for hundreds of thousands. But by next year, you'll have like 500,000 to 700,000 GPU clusters.

0
💬 0

14732.434 - 14744.023 Nathan Lambert

And note those GPUs are much higher power consumption than existing ones, right? Hopper 700 watts, Blackwell goes to 1200 watts, right? So the power per chip is growing and the number of chips is growing, right?

0
💬 0

14744.043 - 14749.908 Lex Fridman

Nuts. You think Elon said he'll get to a million. You think that's actually feasible?

0
💬 0

14750.776 - 14770.648 Nathan Lambert

I mean, I don't doubt Elon, right? The filings that he has for like, you know, the power plant and the Tesla battery packs, it's clear he has some crazy plans for Memphis, like permits and stuff is open record, right? But it's not quite clear that, you know, what and what the timescales are. I just never doubt Elon, right? You know, that's he's gonna surprise us.

0
💬 0

14770.888 - 14785.158 Lex Fridman

So what's the idea with these clusters? If you have a million GPUs, what percentage in, let's say, two, three years is used for training and what percent pre-training and what percent is used for the actual computation?

0
💬 0

14785.178 - 14803.606 Nathan Lambert

So these mega clusters make no sense for inference, right? You could route inference there and just not train. Yeah. But most of the inference capacity is being, you know, hey, I've got a 30 megawatt data center here. I've got 50 megawatts here. I've got 100 here, whatever. I'll just throw inference in all of those because the mega clusters, right? Multi gigawatt data centers.

0
💬 0

14803.946 - 14818.671 Nathan Lambert

I want to train there because that's where all of my GPUs are co-located, where I can put them at a super high networking speed connected together, right? Because that's what you need for training. Now with pre-training, this is the old scale, right? You could, you would increase parameters. You didn't increase data model gets better, right?

0
💬 0

14819.41 - 14837.117 Nathan Lambert

That doesn't apply anymore because there's not much more data in the pre-training side. Yes, there's video and audio and image that has not been fully taken advantage of. So there's a lot more scaling. But a lot of people have taken transcripts of YouTube videos. And that gets you a lot of the data. It doesn't get you all the learning value out of the video and image data.

0
💬 0

14837.137 - 14853.147 Nathan Lambert

But there's still scaling to be done on pre-training. But this post-training world is where all the flops are going to be spent, right? The model is going to play with itself. It's going to self-play. It's going to do verifiable tasks. It's going to do computer use in sandboxes. It might even do like simulated robotics things, right?

0
💬 0

14853.167 - 14873.896 Nathan Lambert

Like all of these things are going to be environments where compute is spent in quote unquote post-training. But I think it's going to be good. We're going to drop the post from post-training. It's going to be pre-training and it's going to be training, I think. At some point. Because for the bulk of the last few years, pre-training has dwarfed post-training.

0
💬 0

14874.537 - 14889.36 Nathan Lambert

But with these verifiable methods, especially ones that scale really potentially infinitely, like computer use and robotics, not just math and coding, where you can verify what's happening, those infinitely verifiable tasks, it seems you can spend as much compute as you want on them. Especially at the context length increase context.

0
💬 0

14889.38 - 14906.885 Dylan Patel

Because at the end of pre-training is when you increase the context length for these models. And we've talked earlier in the conversation about how the context length, when you have a long input, is much easier to manage than output. And a lot of these post-training and reasoning techniques rely on a ton of sampling, and it's becoming increasingly long context.

0
💬 0

14907.366 - 14927.796 Dylan Patel

So it's just like, effectively, your compute efficiency goes down. I think flops is the standard for how you measure it. But with RL, and you have to do all these things where you... move your weights around in a different way than at pre-training and just generation. It's going to become less efficient, and flops is going to be less of a useful term.

0
💬 0

14928.196 - 14930.778 Dylan Patel

And then as the infrastructure gets better, it's probably going to go back to flops.

0
💬 0

14931.419 - 14939.465 Lex Fridman

So all of the things we've been talking about is most likely going to be NVIDIA, right? Is there any competitors? Google, I kind of ignored them.

0
💬 0

14939.485 - 14941.927 Nathan Lambert

I was like, huh?

0
💬 0

14942.467 - 14944.509 Lex Fridman

What's the story with TPU? Like, what's the...

0
💬 0

14944.989 - 14964.901 Nathan Lambert

TPU is awesome, right? It's great. Google is... They're a bit more tepid on building data centers for some reason. They're building big data centers, don't get me wrong. And they actually have the biggest cluster. I was talking about NVIDIA clusters. They actually have the biggest cluster, period. But the way they do it is very interesting, right? They have two data center...

0
💬 0

14965.561 - 14988.13 Nathan Lambert

super regions right in that the data center isn't physically like all of the gpus aren't physically on one site but they're like 30 miles from each other not gpus tpus right they have like in in iowa nebraska they have four data centers that are just like right next to each other why doesn't google flex its cluster size go to multi-data center training this is good images in there so i'll show you what i mean it's just uh semi-analysis multi-data center

0
💬 0

14989.523 - 14996.046 Nathan Lambert

So this is an image of what a standard Google data center looks like. By the way, their data centers look very different than anyone else's data centers.

0
💬 0

14996.246 - 14997.887 Lex Fridman

What are we looking at here?

0
💬 0

14998.167 - 15016.036 Nathan Lambert

So if you see this image, in the center there are these big rectangular boxes. Those are where the actual chips are kept. And then if you scroll down a little bit further, you can see there's these water pipes, there's these chiller cooling towers in the top, and a bunch of diesel generators. The diesel generators are backup power.

0
💬 0

15016.716 - 15031.597 Nathan Lambert

the data center itself is like look physically smaller than the water chillers, right? So the chips are actually easier to like keep together, but then like cooling all the water for the water cooling is very difficult, right? So Google has like a very advanced infrastructure that no one else has for the TPU.

0
💬 0

15032.738 - 15044.864 Nathan Lambert

And what they do is they've like stamped these data center, they've dumped a bunch of these data centers out in a few regions, right? So if you go a little bit further down, this is this is a Microsoft, this is an Arizona, this is where GPT five, quote, unquote, will be trained.

0
💬 0

15046.085 - 15048.506 Dylan Patel

You know, if it doesn't exist already.

0
💬 0

15048.726 - 15067.124 Nathan Lambert

Yeah, if it doesn't exist already. But each of these data centers, I've shown a couple images of them. They're really closely co-located in the same region, Nebraska, Iowa. And then they also have a similar one in Ohio complex. And so these data centers are really close to each other. And what they've done is they've connected them super high bandwidth with fiber.

0
💬 0

15067.724 - 15082.638 Nathan Lambert

And so these are just a bunch of data centers. And the point here is that Google has a very advanced infrastructure. very tightly connected in a small region. So Elon will always have the biggest cluster fully connected, right? Because it's all in one building, right? And he's completely right on that, right?

0
💬 0

15082.958 - 15089.684 Nathan Lambert

Google has the biggest cluster, but you have to spread over three sites and by a significant margin, but you have to go across multiple sites.

0
💬 0

15090.005 - 15095.65 Lex Fridman

Why doesn't Google compete with NVIDIA? Why don't they sell TPUs?

0
💬 0

15096.324 - 15119.291 Nathan Lambert

I think there's a couple problems with it. It's like one, TPU has been a form of allowing search to be really freaking cheap and build models for that, right? And so like a big chunk of the search TPU purchases or big chunk of Google's purchases and usage, all of it is for internal workloads, right? Whether it be search, now Gemini, right?

0
💬 0

15119.512 - 15140.016 Nathan Lambert

YouTube, all these different applications that they have, you know, ads. These are where all their TPUs are being spent and that's what they're hyper-focused on, right? And so there's certain like aspects of the architecture that are optimized for their use case that are not optimized elsewhere. Right. One simple one is like they've open sourced a Gemma model and they called it Gemma 7B. Right.

0
💬 0

15140.557 - 15152.153 Nathan Lambert

But then it's actually eight billion parameters because the vocabulary is so large. And the reason they made the vocabulary so large is because TPUs like matrix multiply unit is massive. Because that's what they've like sort of optimized for.

0
💬 0

15152.193 - 15169.987 Nathan Lambert

And so they decided, oh, well, I'll just make the vocabulary large too, even though it makes no sense to do so on such a small model, because that fits on their hardware. So Gemma doesn't run as efficiently on a GPU as a Lama does, right? But vice versa, Lama doesn't run as efficiently on a TPU as a Gemma does. And so there's certain aspects of hardware software co-design.

0
💬 0

15170.247 - 15187.979 Nathan Lambert

So all their search models are their ranking and recommendation models. All these different models that are AI, but not like gen AI, have been hyper-optimized with TPUs forever. The software stack is super optimized, but all of this software stack has not been released publicly at all. Right. Very small portions of it. Jackson XLA have been.

0
💬 0

15188.299 - 15200.368 Nathan Lambert

But like the experience when you're inside of Google and you're training on TPUs as a researcher, you don't need to know anything about the hardware in many cases. Right. Like it's like pretty beautiful. But as soon as you step outside, they all go. A lot of them go back.

0
💬 0

15201.208 - 15226.148 Nathan Lambert

they leave google and then they go back yeah yeah they're like they leave and they start a company because they have all these amazing research ideas and they're like wait infrastructure is hard software is hard and this is on gpus or if they try to use tpus same thing because they don't have access to all this code and so it's like how do you convince a company whose golden goose is search where they're making hundreds of billions of dollars from to start selling gpu or tpus uh which they used to only buy a couple billion of you know i think in 2023 they bought like

0
💬 0

15227.949 - 15244.32 Nathan Lambert

um like a couple billion and now they're buying like 10 billion to 15 billion dollars worth but how do you convince them that they could they should just buy like twice as many and figure out how to sell them and make 30 billion dollars like who cares about making 30 billion dollars won't that 30 billion exceed actually the search profit eventually

0
💬 0

15245.12 - 15259.243 Nathan Lambert

I mean, like, you're always going to make more money on services than... Always. I mean, like, yeah. Like, to be clear, like, today people are spending a lot more on hardware than they are the services, right? Because the hardware front runs the service spend.

0
💬 0

15259.704 - 15261.324 Lex Fridman

But, like... You're investing.

0
💬 0

15261.564 - 15278.491 Nathan Lambert

If there's no revenue for AI stuff or not enough revenue, then obviously, like, it's going to blow up, right? You know, people won't continue to spend on GPUs forever. And then NVIDIA is trying to move up the stack with, like, software that they're trying to sell and license and stuff, right? But... Google has never had that DNA of like, this is a product we should sell, right?

0
💬 0

15278.971 - 15287.059 Nathan Lambert

Google Cloud, which is a separate organization from the TPU team, which is a separate organization from the DeepMind team, which is a separate organization from the search team, right? There's a lot of bureaucracy here.

0
💬 0

15287.079 - 15289.621 Lex Fridman

Wait, Google Cloud is a separate team than the TPU team?

0
💬 0

15289.921 - 15313.051 Nathan Lambert

Technically TPU sits under infrastructure, which sits under Google Cloud, but like Google Cloud, like for like renting stuff and TPU architecture are very different goals, right? In hardware and software, like all of this, right? Like the Jaxx XLA teams do not serve Google's customers externally. Whereas NVIDIA's various CUDA teams for like things like Nickel serve external customers, right?

0
💬 0

15314.291 - 15321.315 Nathan Lambert

The internal teams like Jaxx and XLA and stuff, they more so serve DeepMind and Search, right? And so their customer is different. They're not building a product for them.

0
💬 0

15321.635 - 15331.843 Lex Fridman

Do you understand why AWS keeps winning versus Azure for cloud versus Google Cloud? Google Cloud is tiny, isn't it, relative to AWS?

0
💬 0

15331.903 - 15351.94 Nathan Lambert

Google Cloud is third. Microsoft is the second biggest, but Amazon is the biggest, right? Yeah. And Microsoft deceptively sort of includes like Microsoft Office 365 and things like that, like some of these enterprise-wide licenses. So in reality, the gulf is even larger. Microsoft is still second though, right? Amazon is way bigger. Why? Because using AWS is better and easier.

0
💬 0

15352.321 - 15354.162 Nathan Lambert

And in many cases, it's cheaper. And it's first.

0
💬 0

15354.182 - 15357.285 Lex Fridman

It was first. Yeah, but there's a lot of things that are first.

0
💬 0

15357.345 - 15363.371 Dylan Patel

Well, it's easier. It's harder to switch than it is to do it. There's big fees for switching, too.

0
💬 0

15363.471 - 15374.661 Nathan Lambert

AWS generates over 80% of Amazon's profit. I think over 90%. That's insane. The distribution centers are just like, one day we'll decide to make money from this. But they haven't yet, right? They make tiny little profit from it.

0
💬 0

15374.681 - 15376.703 Dylan Patel

Yeah, one day Amazon Prime will triple in price.

0
💬 0

15377.003 - 15384.906 Lex Fridman

You would think they would improve AWS interface because it's like horrible. It's like clunky, but everybody is.

0
💬 0

15384.926 - 15387.527 Dylan Patel

Yeah, one would think.

0
💬 0

15387.547 - 15392.93 Nathan Lambert

I think actually Google's interface is sometimes nice, but it's also like they don't care about anyone besides their top customers.

0
💬 0

15393.07 - 15393.39 Lex Fridman

Exactly.

0
💬 0

15393.41 - 15396.271 Nathan Lambert

And like their customer service sucks and like they have a lot less like.

0
💬 0

15396.431 - 15401.075 Lex Fridman

I mean, all these companies, they optimize for the big customers, yeah. It's supposed to be for business.

0
💬 0

15401.415 - 15417.369 Nathan Lambert

Amazon has always optimized for the small customer too, though, right? Obviously, they optimize a lot for the big customer, but when they started, they just would go to random Bay Area things and give out credits, right? Or just put in your credit card and use us, right? Back in the early days. So they've always... The business has grown with them, right? And Virgin.

0
💬 0

15417.409 - 15431.345 Nathan Lambert

So why does Amazon... Why is Snowflake all over Amazon? Because Snowflake, in the beginning, when Amazon didn't care about them... was still using Amazon, right? And then, of course, one day, Snowflake and Amazon has a super huge partnership. But, like, this is the case. Like, Amazon's user experience and quality is better.

0
💬 0

15431.646 - 15451.767 Nathan Lambert

Also, a lot of the silicon they've engineered makes them have a lower cost structure in traditional cloud storage, CPU, networking, that kind of stuff than... in databases, right? Like the, you know, I think like four of Amazon's top five revenue products, uh, margin products are like gross profit products or all database related products like redshift and like all these things, right?

0
💬 0

15451.787 - 15472.723 Nathan Lambert

Like, um, so, so Amazon has a very like good Silicon to a user experience, like entire pipeline with AWS. I think Google they're in for their Silicon teams. Yeah. They have awesome Silicon internally, TPU, the YouTube chip, um, you know, some of these other chips that they've made and, uh, The problem is they're not serving external customers, they're serving internal customers, right?

0
💬 0

15473.023 - 15493.835 Dylan Patel

I mean, NVIDIA's entire culture is designed from the bottom up to do this. There's this recent book, The NVIDIA Way by Taekim, that details this and how they look for future opportunities and ready their CUDA software libraries to make it so that new applications of high-performance computing can very rapidly be evolved on CUDA and NVIDIA chips.

0
💬 0

15493.975 - 15497.777 Dylan Patel

And that is entirely different than Google as a services business.

0
💬 0

15498.43 - 15512.92 Lex Fridman

Yeah, I mean, NVIDIA, it should be said, is a truly special company. Like, I mean, they, the whole, the culture, everything, they're really optimized for that kind of thing. Speaking of which, is there somebody that can even challenge NVIDIA, hardware-wise? Intel, AMD...

0
💬 0

15514.22 - 15533.036 Nathan Lambert

I really don't think so. We went through a very long process of working with AMD on training on their GPUs, inference and stuff. And they're decent. Their hardware is better in many ways than NVIDIA's. The problem is their software is really bad. And I think they're getting better, right? They're getting better faster, but the gulf is so large.

0
💬 0

15534.617 - 15548.407 Nathan Lambert

And, like, they don't spend enough resources on it, or haven't historically, right? Maybe they're changing their tune now, but, you know, for multiple months, we were submitting the most bugs, right? Like, us, semi-analysis, right? Like, what the fuck? Like, why are we submitting the most bugs, right?

0
💬 0

15548.527 - 15564.697 Nathan Lambert

Because they only cared about their, like, biggest customers, and so they'd ship them a private image, blah, blah, blah, and it's like, okay, but, like... I am just using PyTorch and I want to use the publicly available libraries and you don't care about that. Right. So they're getting better. But like, I think AMD is not possible.

0
💬 0

15564.757 - 15572.902 Nathan Lambert

Intel is obviously in dire straits right now and needs to be saved somehow. Very important for national security, for American technology.

0
💬 0

15573.002 - 15575.944 Lex Fridman

Can you explain the obviously? So why are they in dire straits?

0
💬 0

15576.124 - 15595.783 Nathan Lambert

Going back to earlier, only three companies can R&D, right? Taiwan, Hsinchu, Samsung, Pyongyang, and then Intel Hillsborough. Samsung's doing horribly. Intel's doing horribly. We could be in a world where there's only one company that can do R&D. And that one company already manufactures most of the chips. They've been gaining market share anyways, but like... That's a critical thing, right?

0
💬 0

15595.843 - 15613.89 Nathan Lambert

So what happens to Taiwan means the rest of the world's semiconductor industry and therefore tech relies on Taiwan, right? And that's obviously precarious. As far as like Intel, they've been slowly, steadily declining. They were on top of servers and PCs, but now Apple's done the M1 and NVIDIA is releasing a PC chip. And

0
💬 0

15614.29 - 15643.037 Nathan Lambert

Qualcomm's releasing a PC chip and in servers hyperscalers are all making their own ARM based server chips and Intel has no AI silicon like wins right they have very small wins and they never got into mobile because they said no to the iPhone and like all these things have compounded and they've lost their process technology leadership right they were ahead for 20 years and now they're behind by at least a couple years right and they're trying to catch back up and we'll see if like their 18A 14A strategy works out where they try and leapfrog TSMC but like

0
💬 0

15643.817 - 15655.564 Nathan Lambert

And Intel is just like losing tons of money anyways, right? And they just fired their CEO, even though the CEO was the only person who understood the company well, right? We'll see. He was not the best, but he was pretty good, relatively, technical guy.

0
💬 0

15656.004 - 15658.205 Lex Fridman

Where does Intel make most of its money? The CPUs still?

0
💬 0

15658.325 - 15677.519 Nathan Lambert

PCs and data center CPUs, yeah. But data center CPUs are all going cloud. And Amazon, Microsoft, Google are making ARM-based CPUs. And then PC side, AMD's gained market share. NVIDIA's launching a chip. That's not going to be a success, right? MediaTek, Qualcomm have relaunched chips. Apple's doing well, right? Like they could get squeezed a little bit in PC.

0
💬 0

15677.739 - 15680.821 Nathan Lambert

Although PC generally, I imagine, will just stick Intel mostly for Windows side.

0
💬 0

15681.496 - 15686.98 Lex Fridman

Let's talk about the broad AI race. Who do you think wins? We talked about Google.

0
💬 0

15687.361 - 15691.324 Dylan Patel

The default leader has been Google because of their infrastructure advantage.

0
💬 0

15691.904 - 15694.907 Lex Fridman

Well, like in the news, OpenAI is the leader.

0
💬 0

15695.207 - 15704.794 Dylan Patel

They're the leading in the narrative. They have the best model. They have the best model that people can use and they're experts. And they have the most AI revenue. Yeah. OpenAI is winning.

0
💬 0

15705.455 - 15709.038 Lex Fridman

So who's making money on AI right now? Is anyone making money?

0
💬 0

15709.44 - 15729.169 Nathan Lambert

So accounting profit-wise, Microsoft is making money, but they're spending a lot of CapEx, right? And that gets depreciated over years. Meta is making tons of money, but with recommendation systems, which is AI, but not with Lama, right? Lama is losing money for sure, right? Yeah. I think Anthropic and OpenAI are obviously not making money because otherwise they wouldn't be raising money, right?

0
💬 0

15729.65 - 15743.587 Nathan Lambert

They have to raise money to build more, right? Although theoretically they are making money, right? Like, you know, you spent a few hundred million dollars on GPT-4 and it's doing billions in revenue. So like, obviously it's like making money. Although they had to continue to research to get the compute efficiency wins, right?

0
💬 0

15743.927 - 15759.514 Nathan Lambert

And move down the curve to get that 1200X that has been achieved for GPT-3. Maybe we're only at a couple hundred X now, but with GPT-4 Turbo and 4.0, and there'll be another one probably cheaper than GPT-4.0 even that comes out at some point.

0
💬 0

15759.934 - 15775.856 Lex Fridman

And that research costs a lot of money. Yep, exactly. That's the thing that I guess is not talked about with the cost, that when you're referring to the cost of the model, it's not just the training or the test runs, it's the actual research, the manpower.

0
💬 0

15776.316 - 15796.354 Nathan Lambert

Yeah, to do things like reasoning, right? Now that that exists, they're going to scale it, they're going to do a lot of research still. I think the... People focus on the payback question, but it's really easy to just be like, well, GDP is humans and industrial capital, right? And if you can make intelligence cheap, then you can grow a lot, right? That's the sort of dumb way to explain it.

0
💬 0

15796.394 - 15814.515 Nathan Lambert

But that's sort of what basically the investment thesis is. Um, I think only Nvidia is actually making tons of money and other hardware vendors. Um, the hyperscalers are all on paper making money. Uh, but in reality, they're like spending a lot more on purchasing the GPUs, which you don't know if they're still going to make this much money on each GPU in two years. Right.

0
💬 0

15814.936 - 15835.144 Nathan Lambert

Um, you don't know if, um, All of a sudden, OpenAI goes kapoof, and now Microsoft has hundreds of thousands of GPUs they were renting to OpenAI that they paid for themselves with their investment in them that no longer have a customer. This is always a possibility. I don't believe that. I think OpenAI will keep raising money.

0
💬 0

15835.184 - 15841.966 Nathan Lambert

I think others will keep raising money because the returns from it are going to be eventually huge once we have AGI.

0
💬 0

15841.986 - 15859.892 Lex Fridman

Do you think multiple companies will get... I don't think it's winner-take-all. Okay. So it's not, let's not call it AGI, whatever. It's like a single day. It's a gradual thing. Super powerful AI. But it's a gradually increasing set of features that are useful and make a lot of money.

0
💬 0

15859.972 - 15861.752 Nathan Lambert

Rapidly increasing set of features.

0
💬 0

15862.012 - 15875.059 Lex Fridman

Rapidly increasing set of features. So you're saying a lot of companies will be... It just seems absurd that all of these companies are building gigantic data centers.

0
💬 0

15875.099 - 15888.247 Dylan Patel

There are companies that will benefit from AI, but not because they trained the best model. Like Meta has so many avenues to benefit from AI in all of their services. People are there, people spend time on Meta's platforms, and it's a way to make more money per user per hour.

0
💬 0

15888.587 - 15916.264 Lex Fridman

Yeah, it seems like Google... slash XAI slash Tesla, important to say, and then Meta will benefit not directly from the AI, like the LLMs, but from the intelligence, like the additional boost of intelligence to the products they already sell. So whether that's the recommendation system, or for Elon, who's been talking about Optimus, the robot, potentially the intelligence of the robot.

0
💬 0

15916.725 - 15927.63 Lex Fridman

And then you have personalized robots in the home, that kind of thing. He thinks it's a $10 plus trillion business, which... At some point, maybe.

0
💬 0

15927.67 - 15930.572 Dylan Patel

Not soon, but who knows what robotics will be used for.

0
💬 0

15930.592 - 15940.397 Nathan Lambert

Let's do a TAM analysis, right? 8 billion humans, and let's get 8 billion robots, right? And let's pay them the average salary, and yeah, there we go, $10 trillion. More than $10 trillion. Yeah.

0
💬 0

15940.836 - 15946.942 Lex Fridman

Yeah, I mean, you know, if there's robots everywhere, why does it have to be just 8 billion robots?

0
💬 0

15946.962 - 15951.325 Nathan Lambert

Yeah, of course, of course. I'm going to have like one robot. You're going to have like 20.

0
💬 0

15951.886 - 15960.594 Lex Fridman

Yeah, I mean, I see a use case for that. So, yeah. So, I guess the benefit would be in the products they sell, which is why OpenAI is in a trickier position because they –

0
💬 0

15960.905 - 15981.973 Dylan Patel

All of the value of OpenAI right now as a brand is in ChatGPT. And there is actually not that, for most users, there's not that much of a reason that they need OpenAI to be spending billions and billions of dollars on the next best model when they could just license Lama 5 and Furby way cheaper. So that's kind of like ChatGPT is an extremely valuable entity to them.

0
💬 0

15983.873 - 15985.614 Dylan Patel

But they could make more money just off that.

0
💬 0

15985.794 - 16003.579 Nathan Lambert

The chat application is clearly like does not have tons of room to continue, right? Like the standard chat, right? Where you're just using it for random questions and stuff, right? The cost continues to collapse. V3 is the latest one. It'll go down to ads. Biggest. But it's going to get supported by ads, right? Like, you know, Meta already serves 405B, probably loses the money.

0
💬 0

16003.599 - 16016.006 Nathan Lambert

But at some point, you know, they're going to get, the models are going to get so cheap that they can just serve them for free with ads supported, right? And that's what Google is going to be able to do. And that's obviously they've got a bigger reach, right? So chat is not going to be the only use case.

0
💬 0

16016.286 - 16024.274 Nathan Lambert

It's like these reasoning, code, agents, computer use, all this stuff is where OpenAI has to actually go to make money in the future. Otherwise, they're kaputs.

0
💬 0

16024.714 - 16035.825 Lex Fridman

But X, Google, and Meta have these other products. So isn't it likely that OpenAI and Anthropic disappear eventually?

0
💬 0

16036.573 - 16038.655 Nathan Lambert

Unless they're so good at models, which they are.

0
💬 0

16038.675 - 16040.076 Lex Fridman

But it's such a cutting edge.

0
💬 0

16040.096 - 16042.859 Nathan Lambert

It depends on where you think AI capabilities are going.

0
💬 0

16042.879 - 16043.88 Lex Fridman

You have to keep winning.

0
💬 0

16044.321 - 16044.501 Dylan Patel

Yes.

0
💬 0

16044.881 - 16064.398 Lex Fridman

You have to keep winning. As you climb, even if the AI capabilities are going super rapidly, awesome, into the direction of AGI... Like there's still a boost for X in terms of data, Google in terms of data, Meta in terms of data, in terms of other products and the money and like there's just huge amounts of money.

0
💬 0

16064.418 - 16069.883 Nathan Lambert

The whole idea is human data is kind of tapped out. We don't care. We all care about self-play, verifiable data.

0
💬 0

16070.183 - 16089.613 Dylan Patel

If you think about AWS, AWS does not make a lot of money on each individual machine. And the same can be said for the most powerful AI platform, which is even though the calls to the API are so cheap, there's still a lot of money to be made by owning that platform. And there's a lot of discussions as it's the next compute layer.

0
💬 0

16090.133 - 16110.669 Nathan Lambert

You have to believe that, and you know, there's a lot of discussions that tokens and tokenomics and LLM APIs are the next compute layer, or the next paradigm for the economy, kind of like energy and oil was. But there's also like, you have to sort of believe that APIs and chat are not where AI is stuck, right? It is actually just tasks and agents and robotics and computer use.

0
💬 0

16110.709 - 16116.634 Nathan Lambert

And those are the areas where all the value will be delivered, not API, not chat application.

0
💬 0

16117.454 - 16127.885 Lex Fridman

Is it possible you have, I mean, it all just becomes a commodity and you have the very thin wrapper, like perplexity. Just joking.

0
💬 0

16129.226 - 16131.228 Dylan Patel

There are a lot of wrappers making a lot of money.

0
💬 0

16131.509 - 16140.898 Lex Fridman

Yeah, but do you think it's possible that people would just even forget what OpenAI and Anthropic is? Because there would be wrappers around the API and it just dynamically...

0
💬 0

16140.998 - 16160.45 Nathan Lambert

If model progress is not rapid, yeah, it's becoming a commodity, right? DeepSeek v3 shows this, but also the GPT-3 chart earlier, Kurt chart showed this, right? Lama 3B is 1200x cheaper than GPT-3. Any GPT-3, like anyone whose business model was GPT-3 level capabilities is dead. Anyone whose business model is GPT-4 level capabilities is dead.

0
💬 0

16160.71 - 16166.174 Dylan Patel

It is a common saying that the best businesses being made now are ones that are predicated on models getting better.

0
💬 0

16166.835 - 16171.107 Lex Fridman

which would be like wrappers, thing that is riding the wave of the models.

0
💬 0

16172.033 - 16188.846 Dylan Patel

The short term, the company that could make the most money is the one that figures out what advertising targeting method works for language model generations. We have the meta ads, which are hyper-targeted in feed, not within specific pieces of content. And we have search ads that are used by Google and Amazon has been rising a lot on search.

0
💬 0

16189.266 - 16206.776 Dylan Patel

But within a piece, within a return from ChatGPT, it is not clear how you get a high quality placed ad within the output. And if you can do that with model costs coming down, you're you can just get super high revenue. Like that revenue is totally untapped and it's not clear technically how it is done.

0
💬 0

16207.056 - 16218.898 Lex Fridman

Yeah, that is, I mean, the sort of the AdSense innovation that Google did. The one day you'll have in GPT output an ad and that's going to make like billions.

0
💬 0

16218.938 - 16229.7 Dylan Patel

And it could be very subtle. It could be in conversation. Like we have voice mode now. It could be some way of making it so the voice introduces certain things. It's much harder to measure and it takes imagination, but yeah.

0
💬 0

16230.16 - 16246.513 Lex Fridman

And it wouldn't come off shady so that you would receive public blowback, that kind of thing. So you have to do it loud enough to where it's clear it's an ad and balance all of that. So that's the open question they're trying to solve. Anthropic and open AI, they need to. They might not say that they're trying.

0
💬 0

16246.533 - 16247.754 Nathan Lambert

I don't think they care about that at all.

0
💬 0

16247.774 - 16252.618 Dylan Patel

They don't care about it right now. I think it's places like Perplexity are experimenting on that more.

0
💬 0

16253.519 - 16255.42 Lex Fridman

Oh, interesting. Yeah, for sure.

0
💬 0

16255.48 - 16279.636 Nathan Lambert

Like Perplexity, Google, Meta care about this. I think OpenAI and Anthropic are purely laser focused on agents and AGI. And if I build AGI, I can make tons of money, right? Or I can pay for everything, right? And this is just predicated back on the export control thing, right? If you think AGI is five, 10 years away or less, right? These labs think it's two, three years away.

0
💬 0

16280.476 - 16293.187 Nathan Lambert

Obviously, your actions are – if you assume they're rational actors, which they are mostly, what you do in a two-year AGI versus five-year versus 10-year is very, very, very different, right?

0
💬 0

16294.268 - 16311.371 Lex Fridman

Do you think agents are promising? We'll have to talk about this. This was – this is like the excitement of the year that agents are going to – this is the generic – hype term that a lot of business folks are using. AI agents are going to revolutionize everything.

0
💬 0

16311.751 - 16326.52 Dylan Patel

Okay, so mostly the term agent is obviously overblown. We've talked a lot about reinforcement learning as a way to train for verifiable outcomes. Agents should mean something that is open-ended and is solving a task independently on its own and able to adapt to uncertainty.

0
💬 0

16327.04 - 16348.593 Dylan Patel

There's a lot of the term agent applied to things like Apple intelligence, which we still don't have after the last WWDC, which is orchestrating between apps. And that type of tool use thing is something that language models can do really well. Apple intelligence, I suspect, will come eventually. It's a closed domain. It's your messages app integrating with your photos, with AI in the background.

0
💬 0

16349.013 - 16364.883 Dylan Patel

That will work. That has been described as an agent by a lot of software companies to get into the narrative. The question is, what ways can we get language models to generalize to new domains and solve their own problems in real time?

0
💬 0

16365.483 - 16381.033 Dylan Patel

Maybe some tiny amount of training when they are doing this with fine-tuning themselves or in-context learning, which is the idea of storing information in a prompt, and you can use learning algorithms to update that, and whether or not you believe that that is going to actually generalize to things like

0
💬 0

16382.134 - 16392.866 Dylan Patel

And me saying, book my trip to go to Austin in two days, I have X, Y, Z constraints and actually trusting it. I think there's an HCI problem, coming back for information.

0
💬 0

16393.808 - 16398.273 Lex Fridman

Well, what's your prediction there? Because my gut says we're very far away from that.

0
💬 0

16399.179 - 16419.629 Nathan Lambert

I think OpenAI's statement, I don't know if you've seen the five levels, right? Where it's chat is level one, reasoning is level two, and then agents is level three. And I think there's a couple more levels, but it's important to note, right? We were in chat for a couple of years, right? We just theoretically got to reasoning. We'll be here for a year or two, right? And then agents.

0
💬 0

16419.649 - 16439.639 Nathan Lambert

But at the same time, people can try and approximate capabilities of the next level. But the agents are doing things autonomously, doing things for minutes at a time, hours at a time, et cetera, right? Reasoning is doing things for... tens of seconds at a time, right? And then coming back with an output that I still need to verify and use and try to check out, right?

0
💬 0

16440.14 - 16458.352 Nathan Lambert

And the biggest problem is, of course, like it's the same thing with manufacturing, right? Like there's the whole Six Sigma thing, right? Like, you know, how many nines do you get? And then you compound the nines onto each other. And it's like, if you multiply, you know, by the number of steps that are Six Sigma, you get to, you know, a yield or something, right?

0
💬 0

16458.372 - 16474.401 Nathan Lambert

So like in semiconductor manufacturing, tens of thousands of steps, 9999999 is not enough, right? Right? Because you multiply by that many times, you actually end up with like 60% yield, right? Yeah, or zero. Really low yield, yeah, or zero. And this is the same thing with agents, right? Like chaining tasks together each time.

0
💬 0

16475.101 - 16495.388 Nathan Lambert

LLMs, even the best LLMs in particularly pretty good benchmarks don't get 100%, right? They get a little bit below that because there's a lot of noise. And so... How do you get to enough nines, right? This is the same thing with self-driving. We can't have self-driving because without it being like super geo-fenced like Google's, right?

0
💬 0

16495.608 - 16501.289 Nathan Lambert

And even then they have a bunch of teleoperators to make sure it doesn't get stuck, right? But you can't do that because it doesn't have enough nines.

0
💬 0

16501.709 - 16530.425 Lex Fridman

And self-driving has quite a lot of structure because roads have rules. It's well-defined. There's regulation. When you're talking about computer use for the open web, for example, or the open operating system, Like, there's no, it's a mess. So, like, the possibility, I'm always skeptical of any system that is tasked with interacting with the human world, with the open, messy human world.

0
💬 0

16530.445 - 16542.079 Dylan Patel

That's the thing, if we can't get intelligence that's enough to solve the human world on its own. We can create infrastructure like the human operators for Waymo over many years that enable certain workflows.

0
💬 0

16542.139 - 16549.208 Nathan Lambert

There is a company, I don't remember what it is, but that's literally their pitch is, yeah, we're just going to be the human operator when agents fail. And you just call us and we fix it.

0
💬 0

16549.588 - 16564.513 Dylan Patel

Yeah, it's like an API call and it's hilarious. There's going to be teleoperation markets when we get human robots, which is there's going to be somebody around the world that's happy to fix the fact that it can't finish loading my dishwasher when I'm unhappy with it. But that's just going to be part of the Tesla service package.

0
💬 0

16565.174 - 16574.837 Lex Fridman

I'm just imagining like an AI agent talking to another AI agent. One company has an agent that specializes in helping other AI agents.

0
💬 0

16575.165 - 16594.123 Dylan Patel

But if you can make things that are good at one step, you can stack them together. So that's why I'm like, if it takes a long time, we're going to build infrastructure that enables it. You see the operator launch. They have partnerships with certain websites, with DoorDash, with OpenTable, with things like this. Those partnerships are going to let them climb really fast.

0
💬 0

16594.163 - 16614.762 Dylan Patel

Their model is going to get really good at those things. It's Proof of concept, that might be a network effect where more companies want to make it easier for AI. Some companies will be like, no, let's put blockers in place. And this is the story of the internet we've seen. We see it now with training data for language models where companies are like, no, you have to pay. Business working it out.

0
💬 0

16615.062 - 16628.894 Lex Fridman

That said, I think airlines and hotels have high incentive to make their site work really well, and they usually don't. Like, if you look at how many clicks it takes to order an airplane ticket, it's insane.

0
💬 0

16629.034 - 16631.677 Dylan Patel

You actually can't call an American Airlines agent anymore.

0
💬 0

16632.437 - 16654.935 Lex Fridman

They don't have a phone number. I mean, it's horrible. To imagine that agents will be able to deal with that website when I, as a human, struggle. Like, I have an existential crisis every time I try to book an airplane ticket that I... I think it's going to be extremely difficult to build an AI agent that's robust in that way.

0
💬 0

16655.015 - 16673.833 Dylan Patel

But think about it. United has accepted the Starlink term, which is they have to provide Starlink for free and the users are going to love it. What if one airline is like, we're going to take a year and we're going to make our website have white text that works perfectly for the AIs. Every time anyone asks about an AI flight... They buy whatever airline it is.

0
💬 0

16673.873 - 16684.62 Nathan Lambert

Or like, they just like, here's an API and it's only exposed to AI agents. And if anyone queries it, the price is 10% higher and for any flight, but we'll let you see any of our flights and you can just book any of them.

0
💬 0

16685.04 - 16685.62

Here you go, agent.

0
💬 0

16685.68 - 16699.748 Nathan Lambert

And then it's like, oh, and I made 10% higher price. Awesome. Yeah. And like, am I willing to say that for like, hey, book me a flight to see Lex, right? And it's like, yeah, whatever. Yeah. Yeah. Okay. I think computers and real world and the open world are really, really messy.

0
💬 0

16701.169 - 16722.44 Nathan Lambert

But if you start defining the problem in narrow regions, people are going to be able to create very, very productive things. And ratchet down cost massively, right? Like now crazy things like, you know, robotics in the home, you know, those are going to be a lot harder to do just like self-driving, right? Because there's just a billion different failure modes, right?

0
💬 0

16722.7 - 16739.092 Nathan Lambert

But, but like agents that can like navigate a certain set of websites and do certain sets of tasks. Or like, look at, you know, look at your, you know, take a photo of your groceries, your fridge and or like upload your recipes. And then like it figures out what to order from, you know, Amazon slash Whole Foods food delivery.

0
💬 0

16739.132 - 16748.72 Nathan Lambert

Like that's then that's going to be like pretty quick and easy to do, I think. So it's going to be a whole range of like business outcomes. And it's going to be tons of tons of sort of optimism around people can just figure out ways to make money.

0
💬 0

16748.84 - 16768.916 Dylan Patel

To be clear, these sandboxes already exist in research. There are people who have built clones of all the most popular websites of Google, Amazon, blah, blah, blah, to make it so that there's... I mean, OpenAI probably has them internally to train these things. It's the same as DeepMind's robotics team for years has had clusters for robotics where you interact with robots fully remotely.

0
💬 0

16768.956 - 16786.453 Dylan Patel

They just have a lab in London and you send tasks to it, arrange the blocks, and you do this research. Obviously, there's techs there that fix stuff, but... We've turned these cranks of automation before. You go from sandbox to progress, and then you add one more domain at a time and generalize.

0
💬 0

16786.714 - 16803.326 Dylan Patel

In the history of NLP and language processing, instruction tuning in tasks per language model used to be like one language model did one task. And then in the instruction tuning literature, there's this point where you start adding more and more tasks together, where it just starts to generalize to every task. And we don't know where on this curve we are.

0
💬 0

16803.346 - 16815.71 Dylan Patel

I think for reasoning with this RL and verifiable domains, we're early, but we don't know where the point is where you just start training on enough domains and poof, like more domains just start working and you've crossed the generalization barrier.

0
💬 0

16816.43 - 16828.917 Lex Fridman

Well, what do you think about the programming context? So software engineering, that's where I personally and I know a lot of people interact with AI the most.

0
💬 0

16829.097 - 16837.823 Nathan Lambert

There's a lot of fear and angst too from current CS students, but there's also, that's where, that is the area where probably the most AI revenue and productivity gains have come, right?

0
💬 0
0
💬 0

16838.543 - 16858.231 Nathan Lambert

Um, whether it be co-pilots or cursor or, uh, what have you, right? This is, or just standard chat GPT, right? Like a lot of, I don't, I know very few programmers who don't have chat GPT and actually many of them have the $200 tier because that's what it's, it's so good for, right? Um, I think that in that world, uh, we already see it like sweep bench.

0
💬 0

16858.371 - 16875.825 Nathan Lambert

And if you've looked at the benchmark, uh, made by some Stanford students, uh, I wouldn't say it's like really hard, but I wouldn't say it's easy either. I think like it takes someone who's been through at least, you know, a few years of CS or a couple of years of programming to do sweep bench well. And the models went from 4% to 60% in like a year, right?

0
💬 0

16876.546 - 16893.221 Nathan Lambert

And where are they going to go to next year? You know, it's going to be higher. It probably won't be 100% because again, that nines is like really hard to do. But we're going to get to some point where that's, and then we're going to need harder software engineering benchmarks and so on and so forth. But the way that people think of it now is it can do code completion easy.

0
💬 0

16893.241 - 16914.98 Nathan Lambert

It can do some function generation and have to review it. Great. But really, the software engineering agents, I think, can be done faster, sooner than any other agent because it is a verifiable domain. You can always unit test or compile. And there's many different regions of... It can inspect the whole code base at once, which no engineer really can.

0
💬 0

16915.04 - 16930.646 Nathan Lambert

Only the architects can really think about this stuff, the really senior guys, and they can define stuff. And then the agent can execute on it. So I think software engineering costs are going to plummet like crazy. And one interesting aspect of that is when software engineering costs are really low, you get very different markets, right?

0
💬 0

16930.666 - 16950.924 Nathan Lambert

So in the US, you have all these platform SaaS companies, right? Salesforce and so on and so forth, right? In China, no one uses platform SaaS. everyone just builds their own stack because software engineering is much cheaper in China. Uh, partially because like people, STEM, number of STEM graduates, et cetera. Uh, so STEM is, so it's generally just cheaper to do. Um,

0
💬 0

16953.109 - 16969.179 Nathan Lambert

And so at the same time, code LLMs have been adopted much less in China because the cost of an engineer there is much lower. But what happens when every company can just invent their own business logic really cheaply and quickly? You stop using platform SaaS, you start building custom tailored solutions, you change them really quickly.

0
💬 0

16969.199 - 16984.63 Nathan Lambert

Now, all of a sudden, your business is a little bit more efficient too, potentially, because you're not dealing with the hell that is some random platform SaaS company stuff not working perfectly and having to adjust workflows or random business automation cases that aren't necessarily AI required. It's just logic that needs to be built that no one has built, right?

0
💬 0

16984.67 - 17002.708 Nathan Lambert

All of these things can go happen faster. And so I think software... And then the other domain is like industrial, chemical, mechanical engineers suck at coding, right? Just generally. And like their tools, like semiconductor engineers, their tools are 20 years old. All the tools run on XP, including ASML lithography tools run on Windows XP, right? It's like, you know, and like...

0
💬 0

17003.328 - 17021.533 Nathan Lambert

A lot of the analysis happens in Excel, right? Like, it's just like, guys, like you guys can move 20 years forward with all the data you have and gathered and like do a lot better. It's just you need the engineering skills for software engineering to be delivered to the actual domain expert engineer. So I think that's the area where I'm like super duper bullish of generally AI creating value.

0
💬 0

17021.993 - 17044.18 Dylan Patel

The big picture is that I don't think it's going to be a cliff. It's like we talked to a really good example of how growth changes is when meta added stories. So Snapchat was on an exponential. They added stories. It flatlined. Software engineers, then up until the right. AI is going to come in. It's probably just going to be flat. It's a lot like everyone's going to lose their job.

0
💬 0

17044.6 - 17064.359 Dylan Patel

It's hard because the supply corrects more slowly. So the amount of students is still growing and that'll correct on a multi-year, like a year delay, but the amount of jobs will just turn. And then maybe in 20, 40 years, it'll be well down. But in the few years, there'll never going to be the snap moment where it's like software engineers aren't useful.

0
💬 0

17064.844 - 17081.037 Lex Fridman

I think also the nature of what it means to be a programmer and what kind of jobs programmers do changes. Because I think there needs to be a human in the loop of everything you've talked about. There's a really important human in that picture of correcting the code.

0
💬 0

17083.639 - 17085.14 Nathan Lambert

Thinking larger than the context length.

0
💬 0

17085.52 - 17096.043 Lex Fridman

Yep. And debugging also. Like, debugging by sort of reading the code, understanding the steering the system. Like, no, no, no, you missed the point. Adding more to the prompt.

0
💬 0

17096.643 - 17109.808 Dylan Patel

Kind of like, yes, adding the human... Designing the perfect Google button. Google's famous for having people design buttons that are so perfect. And it's like, how is AI going to do that? Like, they could give you all the ideas, but...

0
💬 0

17110.648 - 17126.999 Lex Fridman

I mean, that's the thing. You can call it taste. One thing humans can do is figure out what other humans enjoy better than AI systems. That's where the preference, you're loading that in, but ultimately humans are the greatest preference generator. That's where the preference comes from.

0
💬 0

17127.259 - 17144.927 Dylan Patel

And humans are actually very good at reading or judging between two things. This goes back to the core of what RLHF and preference tuning is, is that it's hard to generate a good answer for a lot of problems, but it's easy to see which one is better. And that's how we're using humans for AI now, is judging which one is better. And that's what software engineering could look like.

0
💬 0

17145.687 - 17153.071 Dylan Patel

The PR review, here's a few options. Here are some potential pros and cons. And they're going to be judges.

0
💬 0

17154.271 - 17178.603 Lex Fridman

I think the thing I would very much recommend is programmers start using AI and embracing that role of the supervisor of the AI system and partner of the AI system versus writing from scratch or not learning coding at all and just generating stuff. Because I think there actually has to be a pretty high level of expertise as a programmer to be able to manage intelligent systems.

0
💬 0

17178.743 - 17197.237 Nathan Lambert

I think it's that and then becoming a domain expert in something. Sure, yeah. Seriously, if you go look at aerospace or semiconductors or chemical engineering, everyone is using really crappy platforms, really old software. The job of a data scientist is a joke in many cases. In many cases, it's very real, but it's like,

0
💬 0

17197.857 - 17214.057 Nathan Lambert

bring what the forefront of human capabilities are to your domain and like even if the forefront is like from the AI your domain you're like at the forefront right so it's like it's like you have to be at the forefront of something and then leverage the the like rising tide that is AI for everything else oh yeah there's so many low-hanging fruit

0
💬 0

17215.559 - 17242.28 Lex Fridman

in terms of where software can help automate a thing or digitize a thing in a legal system. I mean, that's why Doge is exciting. I mean, I got to hang out with a bunch of the Doge folks and they... I mean, government is like so old school. It's like begging for the modernization of software, of organizing the data, all this kind of stuff.

0
💬 0

17242.52 - 17273.929 Lex Fridman

I mean, in that case, it's by design because bureaucracy protects centers of power, and so on. But software breaks down those barriers, so it hurts those that are holding onto power, but ultimately benefits humanity. So there's a bunch of domains of that kind. One thing we didn't fully finish talking about is open source. So first of all, congrats. You released a new model? Yeah. Tulu.

0
💬 0

17274.59 - 17294.488 Dylan Patel

I'll explain what a Tulu is. A Tulu is a hybrid camel when you breed a dromedary with a Bacchian camel. Back in the early days after Chachipiti, there was a big wave of models coming out like Alpaca, Vicuna, et cetera, that were all named after various mammalian species. So Tulu, the brand is multiple years old, which comes from that. And

0
💬 0

17295.649 - 17318.965 Dylan Patel

We've been playing at the frontiers of post-training with open source code. And this first part of this release was in the fall where we built on Lama's open models, open weight models, and then we add in our fully open code, our fully open data. There's a popular benchmark that is Chatbot Arena, and that's generally the metric by which how these chat models are evaluated.

0
💬 0

17319.285 - 17338.696 Dylan Patel

And it's humans compare random models from different organizations. And if you looked at the leaderboard in November or December, among the top 60 models from 10s to 20s of organizations, none of them had open code or data for just post-training. Among that, even fewer or none have pre-training data and code available, but post-training is much more accessible at this time.

0
💬 0

17338.736 - 17355.907 Dylan Patel

It's still pretty cheap and you can do it. And the thing is, how high can we push this number where people have access to all the code and data? So that's kind of the motivation of the project. We draw on lessons from Lama. NVIDIA had a Nemotron model where the recipe for their post-training was fairly open with some data and a paper.

0
💬 0

17356.207 - 17361.791 Dylan Patel

And it's putting all these together to try to create a recipe that people can fine-tune models like GPT-4 to their domain.

0
💬 0

17362.271 - 17372.517 Lex Fridman

So to be clear, in the case of Tulu, maybe you can talk about Alma too, but in the case of Tulu, you're taking Lama 3, 4, 5B.

0
💬 0

17373.277 - 17377.7 Dylan Patel

Tulu has been a series of recipes for post-training. So we've done multiple models over years.

0
💬 0

17378.32 - 17380.361 Lex Fridman

And so you're open sourcing everything.

0
💬 0

17381.132 - 17394.575 Dylan Patel

Yeah, if you start with an open weight-based model, the whole model technically is an open source because you don't know what Lama put into it, which is why we have a separate thing that we'll get to. But it's just getting parts of the pipeline where people can zoom in and customize.

0
💬 0

17394.615 - 17418.447 Dylan Patel

I know I hear from startups and businesses, they're like, okay, I can take this post-training and try to apply it to my domain. We talk about verifiers a lot. We use this idea, which is reinforcement learning with verifiable domain rewards, RLVR, kind of similar to RLHF. And we applied it to math and the model today, which is like we applied it to the Lama 405B base model from last year.

0
💬 0

17418.748 - 17438.805 Dylan Patel

And we have our other stuff. We have our instruction tuning and our preference tuning. But the math thing is interesting, which is like it's easier to improve this math benchmark. There's a benchmark, M-A-T-H, math, all capitals. tough name on the benchmark is name is the area that you're evaluating. We're researchers. We're not, we're not brands, brand strategists.

0
💬 0

17439.005 - 17458.578 Dylan Patel

And this is something that the deep seek paper talked about as well as like at this bigger model, it's easier to elicit powerful capabilities with this RL training. And then they distill it down from that big model to the small model. And this model we released today, we saw the same thing as it were AI too. We don't have a ton of compute. We can't train four or five B models all the time.

0
💬 0

17458.598 - 17478.364 Dylan Patel

So we just did a few runs and they tend to work. And it's like, it just shows that there's a lot of room for people to play in these things. And they crushed Lama's actual release, right? Like they're way better than it. Yeah. So our Val numbers, I mean, we have extra months in this, but our Val numbers are like much better than the Lama instruct model that they released.

0
💬 0

17478.504 - 17480.145 Lex Fridman

And he also said better than deep seek V3.

0
💬 0

17480.836 - 17492.062 Dylan Patel

Yeah, on our eval benchmark. DeepSeq v3 is really similar. We have a safety benchmark to understand if it will say harmful things and things like that. And that's what draws down most of the way.

0
💬 0

17492.162 - 17494.323 Nathan Lambert

It's like an amalgamation of multiple benchmarks or what do you mean?

0
💬 0

17494.543 - 17510.736 Dylan Patel

Yeah, so we have a 10 evaluation. This is standard practice in post-training is you choose your evaluations you care about. In academics, in smaller labs, you'll have fewer evaluations. In companies, you'll have a really one domain that you really care about. In frontier labs, you'll have 10s to 20s to maybe even like 100 evaluations of specific things.

0
💬 0

17511.176 - 17528.592 Dylan Patel

So we'd choose a representative suite of things that look like chat, precise instruction following, which is like respond only in emojis. Like does the model follow weird things like that? Math, code. And you create a suite like this. So safety would be one of 10. In that type of suite where you have like, what is the broader community of AI care about?

0
💬 0

17529.192 - 17546.266 Dylan Patel

And for example, in comparison to deep seek, it would be something like our average eval for our model would be 80, including safety and similar without and deep seek would be like 79% average score average. without safety and their safety score would bring it down to like 76 on average.

0
💬 0

17546.346 - 17548.108 Nathan Lambert

Oh, so you beat them even ignoring safety.

0
💬 0

17548.208 - 17565.027 Dylan Patel

Yeah, so this is something that internally it's like, I don't want to win only by like how you shape the valve benchmark. So if there's something that's like people may or may not care about safety in their model, safety can come downstream. Safety can be when you host the model for an API. Like safety is... addressed in a spectrum of locations and applications.

0
💬 0

17565.047 - 17584.217 Dylan Patel

So it's like, if you want to say that you have the best recipe, you can't just gate it on these things that some people might not want. And this is because it's like the time of progress. We benefit if we can release a model later. We have more time to learn new techniques like this RL technique. We had started this in the fall. It's now really popular as reasoning models.

0
💬 0

17584.297 - 17602.588 Dylan Patel

The next thing to do for open source post-training is to scale up verifiers, to scale up data, to replicate some of DeepSeq's results. And it's awesome that we have a paper to draw on and it makes it a lot easier. And that's the type of things that is going on among academic and closed frontier research in AI.

0
💬 0

17602.848 - 17612.934 Lex Fridman

Since you're pushing open source, what do you think is the future of it? You think deep seek actually changes things since it's open source or open weight or is pushing the open source movement into the open direction?

0
💬 0

17613.339 - 17628.813 Dylan Patel

This goes very back to the license discussion. So DeepSeq R1 with a friendly license is a major reset. So it's like the first time that we've had a really clear frontier model that is open weights and with a commercially friendly license with no restrictions on downstream use cases, synthetic data, distillation, whatever.

0
💬 0

17629.234 - 17638.262 Dylan Patel

This has never been the case at all in the history of AI in the last few years since ChatGPT. There have been models that are off the frontier or models with weird licenses that you can't really use them.

0
💬 0

17639.162 - 17644.283 Nathan Lambert

Isn't Meta's license pretty much permissible except for five companies?

0
💬 0

17644.743 - 17662.507 Dylan Patel

So this goes to what open source AI is, which is there's also use case restrictions in the Lama license, which says you can't use it for specific things. So if you come from an open source software background, you would say that that is not an open source license. What kind of things are those, though? At this point, I can't pull them off the top of my head.

0
💬 0

17662.547 - 17685.323 Dylan Patel

It used to be military use was one, and they removed that for scale. It'll be like... like CSAM, like child abuse material. That's the type of thing that is forbidden there, but that's enough from an open source background to say it's not an open source license. And also the Lama license has this horrible thing where you have to name your model Lama if you touch it to the Lama model.

0
💬 0

17685.363 - 17701.259 Dylan Patel

So it's like the branding thing. So if a company uses Lama, technically the license says that they should say built with Lama at the bottom of their application. And from a marketing perspective, that just hurts. I could suck it up as a researcher. I'm like, oh, it's fine. It says Lama Dash on all of our materials for this release.

0
💬 0

17701.56 - 17706.145 Dylan Patel

But this is why we need truly open models, which is, we don't know DeepSeek R1's data.

0
💬 0

17706.265 - 17711.691 Nathan Lambert

Wait, so you're saying I can't make a cheap copy of Lama and pretend it's mine, but I can do this with the Chinese model?

0
💬 0

17712.391 - 17728.91 Dylan Patel

hell yeah that's that's what i'm saying and yeah and that's why it's like we want this whole open language models thing the olmo thing is to try to keep the model where everything is open with the data as close to the frontier as possible so we're compute constrained we're personnel constrained we're

0
💬 0

17729.57 - 17748.369 Dylan Patel

We rely on getting insights from people like John Schulman tells us to do RL on outputs like we can make these big jumps, but it just takes a long time to push the frontier of open source. And fundamentally, I would say that that's because open source AI does not have the same feedback loops as open source software. We talked about open source software for security.

0
💬 0

17748.949 - 17766.016 Dylan Patel

Also, it's just because you build something once and you can reuse it. If you go into a new company, there's so many benefits. But if you open source a language model, you have this data sitting around, you have this training code. It's not like that easy for someone to come and build on and improve because you need to spend a lot on compute. You need to have expertise.

0
💬 0

17766.496 - 17788.872 Dylan Patel

So until there are feedback loops of open source AI, it seems like mostly an ideological mission. Like people like Mark Zuckerberg, which is like America needs this. And I agree with him, but In the time where the motivation ideologically is high, we need to capitalize and build this ecosystem around what benefits do you get from seeing the language model data. And there's not a lot about that.

0
💬 0

17788.892 - 17807.968 Dylan Patel

We're going to try to launch a demo soon where you can look at an OMO model and a query and see what pre-training data is similar to it, which is like legally risky and complicated, but it's like... what does it mean to see the data that the AI was trained on? It's hard to parse. It's terabytes of files. It's like, I don't know what I'm going to find in there.

0
💬 0

17808.788 - 17814.831 Dylan Patel

But that's what we need to do as an ecosystem if people want open source AI to be financially useful.

0
💬 0

17815.782 - 17836.557 Lex Fridman

We didn't really talk about Stargate. I would love to get your opinion on the new administration, the Trump administration, everything that's being done from the America side in supporting AI infrastructure and the efforts of the different AI companies. What do you think about Stargate? What are we supposed to think about Stargate? And does Sam have the money?

0
💬 0

17837.692 - 17861.118 Nathan Lambert

Yeah, so I think Stargate is an opaque thing. It definitely doesn't have $500 billion. It doesn't even have $100 billion, right? So what they announced is this $500 billion number, Larry Ellison, Sam Altman, and Trump said it. They thanked Trump, and Trump did do some executive actions that do significantly improve the ability for this to be built faster. Yeah.

0
💬 0

17861.778 - 17879.542 Nathan Lambert

You know, one of the executive actions he did is on federal land, you can just basically build data centers in power, you know, like pretty much like that. And then the permitting process is basically gone or you file after the fact. So like one of the, again, like I had a schizo take earlier, another schizo take. If you've ever been to the Presidio in San Francisco, beautiful area.

0
💬 0

17880.042 - 17898.17 Nathan Lambert

You could build a power plant in a data center there if you wanted to. Because it is federal land. It used to be a military base. It is. Obviously, this would piss people off. It's a good bit. Anyways, Trump has made it much easier to do this, right, generally. Texas has the only unregulated grid in the nation as well.

0
💬 0

17898.37 - 17899.23 Lex Fridman

Let's go Texas.

0
💬 0

17899.751 - 17919.042 Nathan Lambert

And so, therefore, ERCOT enables people to build faster as well. In addition, the federal regulations are coming down. And so, Stargate is predicated, and this is why that whole show happened. Now, how they came up with a $500 billion number is beyond me. How they came up with a $100 billion number makes sense to some extent, right?

0
💬 0

17919.243 - 17950.045 Nathan Lambert

And there's actually a good table in here that I would like to show in that Stargate piece that I had. It's the most recent one, yeah. So anyways, Stargate, you know, it's basically, right, like there is, it's a table about cost. There, you passed it already. It's that one. So this table is kind of explaining what happens. So Stargate is in Abilene, Texas, the first $100 billion of it.

0
💬 0

17950.686 - 17972.995 Nathan Lambert

That site is 2.2 gigawatts of power in, about 1.8 gigawatts of power consumed. Per GPU, they have roughly... Oracle is already building the first part of this before Stargate came about. To be clear, they've been building it for a year. They tried to rent it to Elon, in fact. But Elon was like, it's too slow. I need it faster. So then he went and did his Memphis thing.

0
💬 0

17974.436 - 17993.508 Nathan Lambert

And so OpenAI was able to get it with this weird joint venture called Stargate. They initially signed a deal with just Oracle for the first section of this cluster. This first section of this cluster is roughly $5 billion to $6 billion of server spend. And then there's another billion or so of data center spend.

0
💬 0

17995.229 - 18019.753 Nathan Lambert

And then likewise, if you fill out that entire 1.8 gigawatts with the next two generations of NVIDIA's chips, GB200, GB300, VR200, and you fill it out completely, that ends up being roughly $50 billion server cost. Plus there's data center costs, plus maintenance costs, plus operation costs, plus all these things. And that's where OpenAI gets to their $100 billion announcement that they had.

0
💬 0

18019.773 - 18040.045 Nathan Lambert

Because they talked about $100 billion is phase one. That's this Abilene, Texas data center. $100 billion of total cost of ownership, quote unquote. So it's not CapEx. It's not investment. It's $100 billion of total cost of ownership. And then, and then there will be future phases. They're looking at other sites that are even bigger than this 2.2 gigawatts, by the way, uh, in Texas and elsewhere.

0
💬 0

18040.065 - 18057.35 Nathan Lambert

Um, and so they're, they're not, you know, completely ignoring that, but there is, there is the number of a hundred billion dollars that they say is for phase one, uh, which I do think will happen. They don't even have the money for that. Um, furthermore, it's not a hundred billion dollars. It's $50 billion of spend, right. And then like $50 billion of operational cost, power, et cetera. Um,

0
💬 0

18059.051 - 18076.948 Nathan Lambert

rental pricing, et cetera, because they're renting it, OpenAI is renting the GPUs from the Stargate joint venture, right? What money do they actually have, right? SoftBank, SoftBank is going to invest, Oracle is going to invest, OpenAI is going to invest. OpenAI is on the line for $19 billion. Everyone knows that they've only got $6 billion in their last round and $4 billion in debt. Mm-hmm.

0
💬 0

18077.828 - 18091.394 Nathan Lambert

But there is news of SoftBank maybe investing $25 billion into OpenAI. So that's part of it. So $19 billion can come from there. So OpenAI does not have the money at all, to be clear. Ink is not dried on anything.

0
💬 0

18091.635 - 18112.186 Nathan Lambert

OpenAI has $0 for this $50 billion, in which they're legally obligated to put $19 billion of CapEx into the joint venture, and then the rest they're going to pay via renting the GPUs from the joint venture. And then there's Oracle. Oracle has a lot of money. They're building the first section completely. They were spending for it themselves, right? This $6 billion of CapEx, $10 billion of TCO.

0
💬 0

18112.647 - 18129.781 Nathan Lambert

And they were going to do that first section. They're paying for that, right? As far as the rest of the section, I don't know how much Larry wants to spend, right? At any point, he could pull out, right? This is, again, completely voluntary. So at any point, there's no signed ink on this, right? But he potentially could contribute tens of billions of dollars, right? To be clear, he's got the money.

0
💬 0

18130.061 - 18152.658 Nathan Lambert

Oracle's got the money. And then there's MGX, which is the UAE fund, which technically has $1.5 trillion for investing in AI. But again, I don't know how real that money is. And whereas there is no ink signed for this, SoftBank does not have $25 billion of cash. They have to sell down their stake in ARM. which is the leader in CPUs, and they IPO'd it.

0
💬 0

18152.698 - 18173.911 Nathan Lambert

This is obviously what they've always wanted to do. They just didn't know where they'd redeploy the capital. Selling down the stake in ARM makes a ton of sense. So they can sell that down and invest in this if they want to and invest in OpenAI if they want to. As far as money secured, the first 100,000 GB200 cluster can be funded. Everything else after that is up in the air. Money's coming.

0
💬 0

18174.071 - 18176.653 Nathan Lambert

I believe the money will come. I personally do. Yeah.

0
💬 0

18177.553 - 18178.253 Lex Fridman

It's a belief.

0
💬 0

18178.554 - 18186.217 Nathan Lambert

It's a belief that they are going to release better models and be able to raise more money. But the actual reality is that Elon's right. The money does not exist.

0
💬 0

18186.537 - 18191.839 Lex Fridman

What does the US government have to do with anything? What does Trump have to do with everything? He's just a hype man?

0
💬 0

18191.979 - 18208.888 Nathan Lambert

Trump is reducing the regulation so they can build it faster. Right. And he's allowing them to do it. Right. You know, because any investment of this side is going to involve like antitrust stuff. Right. Like so obviously he's going to he's going to allow them to do it. He's going to enable the regulations to actually allow it to be built. I don't believe there's any U.S.

0
💬 0

18209.109 - 18210.709 Nathan Lambert

government dollars being spent on this, though.

0
💬 0

18211.27 - 18224.187 Lex Fridman

Yeah. So I think he's also just creating a general vibe that this is regulation will go down and this is. the era of building. So if you're a builder, you want to create stuff, you want to launch stuff, this is the time to do it.

0
💬 0

18224.468 - 18240.145 Nathan Lambert

And so like, we've had this 1.8 gigawatt data center in our data for over a year now. And we've been like sort of sending it to all of our clients, including many of these companies that are building the multi gigawatts. But that is like at a level that's not quite maybe executives like seeing $500 billion, $100 billion. And then everyone's asking them like,

0
💬 0

18240.405 - 18256.79 Nathan Lambert

So it could spur like another like an even faster arms race. Right. Because there's already an arms race. But like this, this like 100 billion, 500 billion dollar number, Trump talking about it on TV, like it could spur the arm race to be even faster and more investors to flood in and et cetera, et cetera. So I think I think you're right.

0
💬 0

18256.85 - 18264.433 Nathan Lambert

Is that in that sense that open eye or sort of Trump is sort of like championing people are going to build more and his actions are going to let people build more.

0
💬 0

18266.1 - 18292.523 Lex Fridman

What are you excited about these several years that are upcoming in terms of cluster build-outs, in terms of breakthroughs in AI? The best possible future you can imagine in the next couple of years, two, three, four years, what does that look like? It could be very specific technical things like breakthroughs on post-training, or it could be just size, big,

0
💬 0

18293.343 - 18295.284 Lex Fridman

Yeah, I mean, it's- Impressive clusters.

0
💬 0

18295.844 - 18316.773 Nathan Lambert

I really enjoy tracking supply chain and like who's involved in what. I really do. It's really fun to see like the numbers, the cost, who's building what capacity, helping them figure out how much capacity they should build, winning deals, strategic stuff. That's really cool. I think technologically, there's a lot around the networking side that really excites me with optics and electronics, right?

0
💬 0

18316.793 - 18322.275 Nathan Lambert

Like kind of getting closer and closer, whether it be co-package optics or some sort of like forms of new forms of switching, right?

0
💬 0

18322.595 - 18324.957 Lex Fridman

This is internal to a cluster.

0
💬 0

18325.177 - 18340.767 Nathan Lambert

Yeah. Also multi data center training, right? Like there's people are putting so much fiber between these data centers and lighting it up with so many different, you know, with so much bandwidth that there's a lot of interesting stuff happening on that end, right? Telecom has been really boring since 5G. And now it's like really exciting again.

0
💬 0

18342.428 - 18357.419 Lex Fridman

Can you educate me a little bit about the speed of things? So the speed of memory versus the speed of interconnect versus the speed of fiber between data centers. Are these like orders of magnitude different? can we at some point converge towards a place where it all just feels like one computer?

0
💬 0

18357.439 - 18377.476 Nathan Lambert

No, I don't think that's possible. It's only going to get harder to program, not easier. It's only going to get more difficult and complicated and more layers, right? The general image that people like to have is like this hierarchy of memory. So on chip is really close, localized within the chip, right? You have registers, right? And those are shared between some compute elements.

0
💬 0

18377.557 - 18378.017 Nathan Lambert

And then you'll have

0
💬 0

18378.357 - 18407.214 Nathan Lambert

caches which are shared between more compute elements then you have like memory right like hbm or dram like ddr memory or whatever it is and that's shared between the whole chip and then you can have you know pools of memory that are shared between many chips right and then storage and it keep you keep zoning out right the access latency across data centers across within the data center within a chip is different so like you're obviously always you're always going to have different programming paradigms for this it's not going to be easy programming this stuff is going to be hard maybe i can help right

0
💬 0

18407.894 - 18430.843 Nathan Lambert

um, you know, with programming this, but the, the, the way to think about it is that like, there is, there, there's sort of like the more elements you add to a task, you, you don't gain, you don't get strong scaling, right? If I double the number of chips, I don't get two X the performance, right? This is just like a reality of computing, uh, cause there's inefficiencies.

0
💬 0

18431.303 - 18444.851 Nathan Lambert

Um, and there's a lot of interesting work being done to make it not You know, to make it more linear, whether it's making the chips more networked together more tightly or, you know, cool programming models or cool algorithmic things that you can do on the model side. Right.

0
💬 0

18444.891 - 18458.918 Nathan Lambert

DeepSeq did some of these really cool innovations because they were limited on interconnect, but they still needed to parallelize. Right. Like all sorts of, you know, everyone's always doing stuff. Google's got a bunch of work and everyone's got a bunch of work about this. That stuff is super exciting on the model and workload and innovation side. Right.

0
💬 0

18459.298 - 18465.5 Nathan Lambert

Hardware, solid state transformers are interesting for the power side. There's all sorts of stuff on batteries and stuff.

0
💬 0

18465.54 - 18483.944 Nathan Lambert

There's all sorts of stuff on, you know, I think when you look at, if you look at every layer of the compute stack, right, whether it goes from lithography and etch all the way to like fabrication, to like optics, to networking, to power, to transformers, to cooling, to, you know, networking, and you just go on up and up and up and up the stack, you know, even air conditioners for data centers are like innovating.

0
💬 0

18484.224 - 18498.854 Nathan Lambert

right? Like it's like, there's like copper cables are innovating, right? Like you wouldn't think it, but copper cables, like are, there's some innovations happening there with like the density of how you can pack them. And like, it's like all of these layers of the stack, all the way up to the models, human progress is at a pace that's never been seen before.

0
💬 0

18499.074 - 18509.001 Lex Fridman

I'm just imagining you sitting back in a layer somewhere with screens everywhere, just monitoring the supply chain where all these clusters, like all the information, information you're gathering. I mean, you do incredible.

0
💬 0

18509.081 - 18509.622 Nathan Lambert

There's a big team.

0
💬 0

18509.762 - 18525.733 Lex Fridman

There's a big team. Yeah. You do quite incredible work with semi-analysis. I mean, just keeping your finger on the pulse of human civilization in the digital world. It's pretty cool just to watch, feel that.

0
💬 0

18526.134 - 18526.854 Nathan Lambert

Yeah, thank you.

0
💬 0

18526.954 - 18541.125 Lex Fridman

I guess that. Feel all of us doing shit. Epic shit. Feel the AGI. I mean, from meme to reality. Nathan, is there breakthroughs that you're looking forward to potentially?

0
💬 0

18541.499 - 18560.126 Dylan Patel

I had a while to think about this while listening to Dylan's beautiful response. He didn't listen to me. He was so dumb. No, I knew this was coming. And it's like, realistically, training models is very fun because there's so much low-hanging fruit. And the thing that makes my job entertaining, I train models. I write analysis about what's happening with models.

0
💬 0

18560.806 - 18578.611 Dylan Patel

And it's fun because there is obviously so much more progress to be had. And the real motivation why I do this somewhere where I can share things is that there's just I don't trust people that are like, trust me, bro. We're going to make AI good. It's like, we're the ones that it's like, we're going to do it and you can trust us. And we're just going to have all the AI.

0
💬 0

18579.232 - 18598.195 Dylan Patel

And it's just like, I would like a future where more people have a say in what AI is and can understand it. And that's, it's a little bit less fun that it's not a like positive thing of like, this is just all really fun. Like training models is fun and bring people in as fun, but it's really like AI. If it is going to be the most powerful technology of my lifetime, it's like,

0
💬 0

18598.875 - 18605.697 Dylan Patel

We need to have a lot of people involved in making that. Making it open helps with that.

0
💬 0

18606.277 - 18609.018 Lex Fridman

As accessible as possible, as open as possible, yeah.

0
💬 0

18609.378 - 18622.962 Dylan Patel

My read of the last few years is that more openness would help the AI ecosystem in terms of having more people understand what's going on, rather than researchers from non-AI fields to governments to everything. It doesn't mean that openness will always be the answer. I think that it will.

0
💬 0

18623.562 - 18648.448 Lex Fridman

reassess of like what is the biggest problem facing ai and tack on a different angle to the wild ride that we're on and for me just from even the user experience anytime you have the like apathy said the the aha moments like the magic like seeing the reasoning the chain of thought It's like there's something really just fundamentally beautiful about that.

0
💬 0

18648.968 - 18674.039 Lex Fridman

It's putting a mirror to ourselves and seeing like, oh shit, it is solving intelligence as the cliche goal of these companies is. And you get to understand why we humans are special. The intelligence within us is special. And for now also why we're special in terms of we seem to be conscious in the AI systems for now. and we get to explore that mystery.

0
💬 0

18674.559 - 18698.827 Lex Fridman

So it's just really cool to get to explore these questions that I don't think, I would have never imagined would be even possible. Back when, so just watching with excitement Deep Blue beat Kasparov, Like I wouldn't have ever thought this kind of AI would be possible in my lifetime. It's like, this is really feels like AI. It's incredible.

0
💬 0

18699.147 - 18711.054 Dylan Patel

I started with AI of learning to fly a silly quadrotor. It's like learn to fly. And it just like, it learned to fly up. It would hit the ceiling and stop and catch it. It's like, okay, that is like really stupid compared to what's going on now.

0
💬 0

18711.474 - 18718.64 Lex Fridman

And now you could probably, with natural language, tell it to learn to fly and it's going to generate the control algorithm required to do that.

0
💬 0

18719.48 - 18722.543 Dylan Patel

There's low-level blockers. We have to do some weird stuff for that.

0
💬 0

18722.583 - 18741.929 Lex Fridman

But you can. You definitely can. Back to our robotics conversation. Yeah, when you have to interact in an actual physical world, it's hard. What gives you hope about the future of human civilization? Looking into the next 10 years, 100 years, 1,000 years, how long do you think we'll make it? You think we've got a thousand years?

0
💬 0

18741.949 - 18763.077 Dylan Patel

I think humans will definitely be around in a thousand years. I think there's ways that very bad things could happen and there'll be way fewer humans, but humans are very good at surviving. There's been a lot of things that that is true. I don't think they're necessarily, we're good at long-term credit assignment of risk, but when the risk becomes immediate, we tend to figure things out.

0
💬 0

18763.317 - 18791.189 Dylan Patel

And for that reason, I'm like, There's physical constraints to things like AGI, recursive improvement to kill us all type stuff. For physical reasons and for how humans have figured things out before, I'm not too worried about AI takeover. There are other international things that are worrying, but... there's just fundamental human goodness and trying to amplify that. We're on a tenuous time.

0
💬 0

18791.609 - 18803.012 Dylan Patel

And I mean, if you look at humanity as a whole, there's been times where things go backwards. There's times when things don't happen at all. And we're on what should be very positive trajectory right now.

0
💬 0

18803.152 - 18812.436 Lex Fridman

Yeah, there seems to be progress, but just like with power, there's like spikes of human suffering. Yeah. and we want to try to minimize the amount of spikes?

0
💬 0

18812.976 - 18837.173 Nathan Lambert

Generally, humanity is going to suffer a lot less, right? I'm very optimistic about that. I do worry of techno-fascism type stuff arising as AI becomes more and more prevalent and powerful, and those who control it can do more and more. Maybe it doesn't kill us all, but at some point, every very powerful human is going to want a brain-computer interface so that they can interact with the AGI system

0
💬 0

18837.393 - 18858.394 Nathan Lambert

and all of its advantages in many more way and merge its mind with, you know, sort of like, and its capabilities or that person's capabilities, uh, can leverage those much better than anyone else and therefore be, you know, it won't be one person rule them all, but it will be, uh, you know, the thing I worry about is it'll be like few people, you know, you know, hundreds, thousands, tens of thousands, maybe millions of people rule whoever's left.

0
💬 0

18858.454 - 18874.06 Nathan Lambert

Right. Um, And the economy around it, right? And I think that's the thing that's probably more worrisome is human-machine amalgamations. This enables an individual human to have more impact on the world, and that impact can be both positive and negative, right?

0
💬 0

18874.901 - 18895.455 Nathan Lambert

Generally, humans have positive impacts on the world, at least societally, but it's possible for individual humans to have such negative impacts. And AGI, at least as I think the labs define it, which is not a runaway sentient thing, but rather just something that can do a lot of tasks really efficiently, amplifies the capabilities of someone causing extreme damage.

0
💬 0

18896.857 - 18904.724 Nathan Lambert

But for the most part, I think it'll be used for profit-seeking motives, which will increase the abundance and supply of things and therefore reduce suffering, right?

0
💬 0

18906.323 - 18912.726 Lex Fridman

That's the goal. Scrolling on a timeline. Scrolling is stasis.

0
💬 0

18912.786 - 18914.826 Dylan Patel

Scrolling holds the status quo of the world.

0
💬 0

18914.866 - 18921.289 Nathan Lambert

That is a positive outcome, right? It's like, if I have food tubes and laptops scrolling and I'm happy, that's a positive outcome.

0
💬 0

18923.51 - 18935.835 Lex Fridman

While expanding out into the cosmos. Well, this is a fun time to be alive. And thank you for pushing the forefront of what is possible in humans. And thank you for talking today. This was fun.

0
💬 0

18936.141 - 18937.281 Dylan Patel

Thanks for having us. Thanks for having us.

0
💬 0

18938.342 - 18963.652 Lex Fridman

Thanks for listening to this conversation with Dylan Patel and Nathan Lambert. To support this podcast, please check out our sponsors in the description. And now let me leave you some words from Richard Feynman. For a successful technology, reality must take precedence over public relations. For nature cannot be fooled. Thank you for listening. I hope to see you next time.

0
💬 0
Comments

There are no comments yet.

Please log in to write the first comment.