Dylan Patel

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

And there's a lot of history here. So we can go through multiple examples and what happened. Llama 2 was a launch that the phrase like too much RLHF or like too much safety was a lot. It's just, that was the whole narrative after Lama2's chat models released. And the examples are sorts of things like you would ask Lama2 chat, how do you kill a Python process?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10037.877

And it would say, I can't talk about killing because that's a bad thing. And anyone that is trying to design an AI model will probably agree that that's just like, eh, model, you messed up a bit on the training there. I don't think they meant to do this, but this was in the model weight. So this is not, you know,

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10052.984

It didn't necessarily be... There's things called system prompts, which are when you're querying a model, it's a piece of text that is shown to the model, but not to the user. So a fun example is your system prompt could be talk like a pirate. So no matter what the user says to the model, it'll respond like a pirate. In practice, what they are is... You are a helpful assistant.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10074.384

You should break down problems. If you don't know about something, don't tell them. Your date cut off is this. Today's date is this. It's a lot of really useful context for how can you answer a question well.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10085.98

Yes, which I think is great. And there's a lot of research that goes into this and One of your previous guests, Amanda Askell, is probably the most knowledgeable person, at least in the combination of execution and sharing. She's the person that should talk about system prompts and character of models.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10111.122

And you could use this for bad things. We've done tests, which is, what if I tell the model to be a dumb model? Which evaluation scores go down? And it's like, we'll have this behavior where it could sometimes say, oh, I'm supposed to be dumb.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1012.37

Again and again, as we try to get deeper into how the models were trained, we will say things like the data processing, data filtering, data quality is the number one determinant of the model quality. And then a lot of the training code is the determinant on how long it takes to train and how fast your experimentation is.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10124.114

And sometimes it doesn't affect math abilities as much, but something like, if you're trying, it's just the quality of a human judgment would drop to the floors. Let's go back to post-training, specifically RLHF around Lama 2. Too much safety prioritization was baked into the model weights. This makes you refuse things in a really annoying way for users. It's not great. It caused a lot of...

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10148.677

um like awareness to be attached to rlhf that it makes the models dumb and it stigmatized the word it did in ai culture and as the techniques have evolved that's no longer the case where all these labs have very fine-grained control over what they get out of the models through techniques like rlhf although although different labs are definitely different levels like on the on one end of the spectrum is google

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10181.683

And the important thing to say is that no matter how you want the model to behave, these RLHF and preference tuning techniques also improve performance. So on things like math evals and code evals, there is something innate to these what is called contrastive loss functions. We could start to get into RL here.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10199.735

We don't really need to, but RLHF also boosts performance on anything from a chat task to a math problem to a code problem. So it is becoming a much more useful tool to these labs. So this kind of takes us through the arc of we've talked about pre-training, hard to get rid of things. We've talked about post-training and how post-training, if you You can mess it up.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10218.283

It's a complex, multifaceted optimization with 10 to 100 person teams converging on one artifact. It's really easy to not do it perfectly. And then there's the third case, which is what we talked about Gemini. The thing that was about Gemini is this was a served product where Google has their internal model weights. They've done all these processes that we talked about.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10237.378

And in the served product, what came out after this was that they had a prompt that they were rewriting user queries to boost diversity or something. And this just made it, the outputs were just blatantly wrong. It was some sort of organizational failure that had this prompt in that position. And I think Google executives probably have owned this. I didn't pay that attention, that detail.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10257.374

But it was just a mess up in execution that led to this ridiculous thing. But at the system level, the model weights might have been fine.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10268.782

It was like the system prompt or what is called in industry is like you rewrite prompts. So especially for image models, if you're using Dolly or ChatGPT can generate you an image, you'll say, draw me a beautiful car. Mm-hmm. With these leading image models, they benefit from highly descriptive prompts.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10288.154

So what would happen is if you do that on ChatGPT, a language model behind the scenes will rewrite the prompt, say, make this more descriptive, and then that is passed to the image model. So prompt rewriting is something that is used at multiple levels of industry, and it's used effectively for image models, and the Gemini example is just a failed execution.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1030.794

So without fully open source models where you have access to this data, it is... hard to know, or it's harder to replicate. So we'll get into cost numbers for DeepSeq v3 on mostly GPU hours and how much you could pay to rent those yourselves. But without the data, the replication cost is going to be far, far higher. And same goes for the code.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10320.868

For the past few years, the highest cost human data has been in these preferences, which is comparing, I would say highest cost and highest total usage. So a lot of money has gone to these pairwise comparisons where you have two model outputs and a human is comparing between the two of them. In earlier years, there was a lot of this instruction tuning data.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10342.359

So creating highly specific examples to something like a Reddit question to a domain that you care about. Language models used to struggle on math and code. So you would pay experts in math and code to come up with questions and write detailed answers that were used to train the models.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10357.41

Now it is the case that there are many model options that are way better than humans at writing detailed and eloquent answers for things like model and code. So They talked about this with the Lama 3 release, where they switched to using Lama 3, 4, or 5B to write their answers for math and code.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10375.759

But they, in their paper, talk about how they use extensive human preference data, which is something that they haven't gotten AIs to replace. There are other techniques in industry like constitutional AI, where you use human data for preferences and AI for preferences. And I expect the AI part to scale faster than the human part.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10392.864

But among the research that we have access to is that humans are in this kind of preference loop.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10405.429

It's even less prevalent. So it's... The remarkable thing about these reasoning results, and especially the DeepSeq R1 paper, is this result that they call DeepSeq R1-0, which is they took one of these pre-trained models, they took DeepSeq V3 base, and then they do this reinforcement learning optimization on verifiable questions or verifiable rewards for a lot of questions and a lot of training.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10428.6

And these reasoning behaviors emerge naturally. So these things like, wait, let me see, wait, let me check this. Oh, that might be a mistake. And they emerge from only having questions and answers. And when you're using the model, the part that you look at is the completion. So in this case, all of that just emerges from this large scale RL training.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10448.587

And that model, which the weights are available, has no human preferences added into the post-training. The DeepSeq R1 full model has some of this human preference tuning, this RLHF, after the reasoning stage. But the very remarkable thing is that you can get these reasoning behaviors And it's very unlikely that there's humans writing out reasoning chains.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10469.31

It's very unlikely that they somehow hacked OpenAI and they got access to OpenAI 01's reasoning chains. It's something about the pre-trained language models and this RL training where you reward the model for getting the question right. And therefore, it's trying multiple solutions and it emerges this chain of thought.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10608.443

I think it's good to recap AlphaGo and AlphaZero because it plays nicely with these analogies between imitation learning and learning from scratch. So AlphaGo, the beginning of the process was learning from humans where they had, they started the first, this is the first expert level Go player or chess player in DeepMind series of models where they had some human data.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10627.817

And then why it is called AlphaZero is that there was zero human data in the loop. And that change to AlphaZero made a model that was dramatically more powerful for DeepMind. So this remove of the human prior, the human inductive bias, makes the final system far more powerful. We mentioned Bitter Lesson hours ago, and this is all aligned with this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10649.506

And then there's been a lot of discussion in language models. This is not new. This goes back to the whole QSTAR rumors, which if you piece together the pieces is probably the start of OpenAI figuring out its O1 stuff when last year in November, the QSTAR rumors came out.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10667.018

There's a lot of intellectual drive to know when is something like this going to happen with language models, because we know these models are so powerful and we know it has been so successful in the past. And it is a reasonable analogy that this new type of reinforcement learning training for reasoning models is when the door is open to this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10688.854

We don't yet have the equivalent of turn 37, which is the famous turn where the DeepMind's AI playing ghost dumped Lee Sedol completely. We don't have something that's that level of focal point, but that doesn't mean that the approach to technology is different and the impact of the general training. It's still incredibly new. What do you think that point would be?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10708.474

What would be move 37 for chain of thought, for reasoning? scientific discovery. When you use this sort of reasoning problem and it's just something we fully don't expect.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

10805.683

All math and code benchmarks were pretty much solved, except for frontier math, which is designed to be almost questions that aren't practical to most people. Because they're exam-level, open math problem-type things. So it's on the math problems that are somewhat reasonable, which is somewhat complicated word problems or coding problems. It's just what Dylan is saying.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1087.696

Yeah. DeepSeek is doing fantastic work for disseminating understanding of AI. Their papers are extremely detailed in what they do. And for other companies, teams around the world, they're very actionable in terms of improving your own training techniques. And we'll talk about licenses more. The DeepSeq R1 model has a very permissive license. It's called the MIT license.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11021.648

The bank account can't lie. Exactly. There's surprising evidence that once you set up the ways of collecting the verifiable domain that this can work. There's been a lot of research before this R1 on math problems, and they approach math with language models just by increasing the number of samples. So you can just try again and again and again. And you look at the...

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11042.981

amount of times that the language models get it right. And what we see is that even very bad models get it right sometimes. And the whole idea behind reinforcement learning is that you can learn from very sparse rewards.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11054.931

So it doesn't... The space of language and the space of tokens, whether you're generating language or tasks for a robot, is so big that you might say that it's like... I mean, the tokenizer for a language model can be like 200,000 things. So at each step, it can sample from that big of a space. So if it... can generate a bit of a signal that it can climb onto.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11074.347

That's what the whole field of RL is around is learning from sparse rewards. And the same thing has played out in math where it's like very weak models that sometimes generate answers where you see research already that you can boost their math scores. You can do this sort of RL training

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11089.867

For math, it might not be as effective, but if you take a 1 billion parameter model, so something 600 times smaller than DeepSeq, you can boost its grade school math scores very directly with a small amount of this training. So it's not to say that this is coming soon.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11104.79

Setting up the verification domains is extremely hard and there's a lot of nuance in this, but there are some basic things that we have seen before where it's at least expectable that there's a domain and there's a chance that this works.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1112.767

That effectively means there's no downstream restrictions on commercial use. There's no use case restrictions. You can use the outputs from the models to create synthetic data. And this is all fantastic. I think the closest peer is something like Lama, where you have the weights and you have a technical report. And the technical report is very good for Lama.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11141.727

Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet. You do this large scale reasoning training with reinforcement learning.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11154.778

And then what the DeepSeek paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did... reasoning-heavy but very standard post-training techniques after the large-scale reasoning RL.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11169.549

So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math-heavy. So some of this transfer, we looked at this philosophical example early on, one of the big open questions is how much does this transfer?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11191.304

If we bring in domains after the reasoning training, are all the models going to become eloquent writers by reasoning? Is this philosophy stuff going to be open? We don't know in the research of how much this will transfer. There's other things about how we can make soft verifiers and things like this. But there is more training after reasoning, which makes it easier to use these reasoning models.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11210.938

And that's what we're using right now. So we're going to talk about with 3Mini and O1. These have gone through these extra techniques that are designed for human preferences after being trained to elicit reasoning.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11233.154

It has a different flavor to it. Its behavior is less expressive than something like O1. It has fewer tracks than it is on. Quinn released a model last fall, QWQ, which was their preview reasoning model. And in DeepSeek had R1 Lite last fall, where these models kind of felt like they're on rails, where they really, really only can do math and code. And O1 is, it can answer anything.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11255.29

It might not be perfect for some tasks, but it's flexible. It has some richness to it. And this is kind of the art of Is a model a little bit undercooked? It's good to get a model out the door, but it's hard to gauge and it takes a lot of taste to be like, is this a full-fledged model? Can I use this for everything? They're probably more similar for math and code.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11278.527

My quick read is that Gemini Flash is not... trained the same way as 01, but taking an existing training stack, adding reasoning to it. So taking a more normal training stack and adding reasoning to it. And I'm sure they're going to have more. I mean, they've done quick releases on Gemini Flash, the reasoning, and this is the second version from the holidays. It's evolving fast and

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11303.382

it takes longer to make this training stack where you're doing this large scale.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11311.031

Yeah.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11315.014

The way I can ramble, why I can ramble about this so much is that we've been working on this at AI2 before O1 was fully available to everyone and before R1, which is essentially using this RL training for fine tuning.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11327.645

We use this in our like Tulu series of models and you can elicit the same behaviors where you say like weight and so much on, but it's so late in the training process that this kind of reasoning expression is much lighter. Yeah. So there's essentially a gradation, and just how much of this RL training you put into it determines how the output looks.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1133.038

One of the most read PDFs of the year last year is the Lama 3 paper. But in some ways, it's slightly less actionable. It has less details on the training specifics and less plots. And so on. And the Lama 3 license is more restrictive than MIT. And then between the DeepSeek custom license and the Lama license, we could get into this whole rabbit hole.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11354.622

It summarized the prompt as humans, self-domesticated apes.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11368.528

Click to expand. Click to expand.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11375.119

See how it just looks a little different? It looks like a normal output.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1151.87

I think we'll make sure we want to go down the license rabbit hole before we do specifics.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1186.886

Yeah, especially in the DeepSeq v3, which is their pre-training paper. They were very clear that they are doing interventions on the technical stack that go at many different levels. For example, to get highly efficient training, they're making modifications at or below the CUDA layer for NVIDIA chips.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11932.534

Is running in parallel actually search? Because I don't know if we have the full information on how O1 Pro works. I don't have enough information to confidently say that it is search. It is parallel samples. Yeah.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11943.864

And it selects something. And we don't know what the selection function is. The reason why we're debating is because since 01 was announced, there's been a lot of interest in techniques called Monte Carlo research, which is where you will break down the chain of thought into intermediate steps. We haven't defined chain of thought.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11958.969

Chain of thought is from a paper from years ago where you introduce the idea to ask a language model that at the time was much less easy to use. You would say, let's verify step by step, and it would induce the model to do this bulleted list of steps. Chain of thought is now... Almost a default in models where if you ask it a math question, you don't need to tell it to think step by step.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

11978.996

And the idea with Monte Carlo tree search is that you would take an intermediate point in that train, do some sort of expansion, spend more compute, and then select the right one. That's like a very complex form of search that has been used in things like Mu0 and Alpha0 potentially. I know Mu0 does this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12027.094

There are many extensions to this. I would say the simplest one is that our language models to date have been designed to give the right answer the highest percentage of the time in one response. And we are now opening the door to different ways of running inference on our models in which we need to reevaluate many parts of the training process.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12047.839

which normally opens the door to more progress, but we don't know if OpenAI changed a lot or if just sampling more and multiple choice is what they're doing or if it's something more complex where they change the training and they know that the inference mode is going to be different.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1206.214

I have never worked there myself, and there are a few people in the world that do that very well, and some of them are at DeepSeq. And these types of people are... at DeepSeek and leading American frontier labs, but there are not many places.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12286.96

We are both NVIDIA bulls here, I would say. And in some ways, the market response is reasonable. NVIDIA's biggest customers in the US are major tech companies, and they're spending a ton on AI. And a simple interpretation of DeepSeek is you can get really good models without spending as much on AI.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12307.806

So in that capacity, it's like, oh, maybe these big tech companies won't need to spend as much on AI and go down. The actual thing that happened, it's much more complex where there's social factors, where there's the rising in the app store, the social contagion that is happening. And then I think a lot of some of it is just like, I don't trade. I don't know anything about financial markets.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12325.26

But it builds up over the weekend or the social pressure where it's like if it was during the week and there was multiple days of trading when this was really becoming, but it comes on the weekend and then everybody wants to sell. Yeah. And that is a social contagion.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12457.325

We were trying to get GPUs on a short notice this week for a demo and it wasn't that easy. We were trying to get just like 16 or 32 H100s for a demo and it was not very easy.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1249.305

Yeah, so these weights that you can download from Hugging Face or other platforms are very big matrices of numbers. You can download them to a computer in your own house that has no internet and you can run this model and you're totally in control of your data.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12526.723

The more progress that AI makes or the higher the... derivative of AI progress is, especially. Because NVIDIA is in the best place. The higher the derivative is, the sooner the market's going to be bigger and expanding. And NVIDIA is the only one that does everything reliably right now.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12548.874

Who historically has been a large NVIDIA customer.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1263.79

That is something that is different than how a lot of language model usage is actually done today, which is mostly through APIs, where you send your prompt to GPUs run by certain companies. And these companies will have different distributions and policies on how your data is stored, if it is used to train future models, where it is stored, if it is encrypted, and so on.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12724.4

I want to jump in. How much was the scale? I think there's been some number, like some people that are higher level economics people understanding, say that as you go from 1 billion of smuggling to 10 billion, it's like you're hiding certain levels of economic activity.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12737.093

And that's the most reasonable thing to me is that there's going to be some level where it's so obvious that it's easier to find this economic activity.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1282.977

So the open weights are, you have your fate of data in your own hands. And that is something that is deeply connected to the soul of open source computing.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12920.316

Chips are highest value per kilogram, probably by far. I have another question for you, Dylan. Do you track model API access internationally? How easy is it for Chinese companies to use hosted model APIs from the U.S. ?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12958.629

Distillation is standard practice in industry, whether or not if you're at a closed lab where you care about terms of service and IP closely, you distill from your own models. If you are a researcher and you're not building any products, you distill from the opening eye.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12977.995

We've talked a lot about training language models. They are trained on text. In post-training, you're trying to train on very high-quality text that you want the model to match the features of, or if you're using RL, you're letting the model find its own thing. But for supervised fine-tuning, for preference data, you need to have some completions what the model is trying to learn to imitate.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

12997.344

And what you do there is instead of a human data or instead of the model you're currently training, you take completions from a different, normally more powerful model. I think there's rumors that these big models that people are waiting for, these GPT-5s of the world, the CLOD-3 opuses of the world are used internally to do this distillation process.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13046.018

This is a long, at least in the academic side and research side, it's a long history because you're trying to interpret OpenAI's rule. OpenAI's terms of service say that you cannot build a competitor with outputs from their models. Terms of service are different than a license, which are essentially a contract between organizations.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13062.464

So if you have a terms of service on OpenAI's account, if I violate it, OpenAI can cancel my account. This is very different than like a license that says how you could use a downstream artifact. So a lot of it hinges on a word that is very unclear in the AI space, which is what is a competitor.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13094.615

There's also a clear loophole, which is that I generate data from open AI and then I upload it somewhere and then somebody else trains on it and the link has been broken. Like they're not under the same terms of service contract. There's a lot of hip hop. There's a lot of like to be discovered details that don't make a lot of sense.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13138.188

We have to do this if we serve a demo. We do research and we use OpenAI APIs because it's useful and we want to understand post-training. And like our research models, they will say they're written by OpenAI unless we put in the system prop that we talked about that like, I am Tulu. I am a language model trained by the Allen Institute for AI.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13154.344

And if you ask more people around industry, especially with post-training, it's a very doable task to make the model say who it is or to suppress the OpenAI thing. So in some levels, it might be that DeepSeq didn't care that it was saying that it was by OpenAI.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13169.857

If you're going to upload model weights, it doesn't really matter because anyone that's serving it in an application and cares a lot about serving is going to, when serving it, if they're using it for a specific task, they're going to tailor it to that. And it doesn't matter that it's saying it's ChatGPT.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13190.412

Like if we host the demo, you say, you are Tulu 3, a language model trained by the Allen Institute for AI. We also are benefited from OpenAI data because it's a great research tool.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13223.789

I think that they're trying to shift the narrative. They're trying to protect themselves, and we saw this years ago when ByteDance was actually banned from some open AI APIs for training on outputs. There's other AI startups that most people, if you're in the AI culture, were like, They just told us they trained on open AI outputs and they never got banned.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13242.795

Like that's how they bootstrapped their early models. So it's much easier to get off the ground using this than to set up human pipelines and build a strong model. So there's a long history here and a lot of the communications seem like narrative communications.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1325.856

Yes. So for one, I have very understanding of many people being confused by these two model names. So I would say the best way to think about this is that when training a language model, you have what is called pre-training, which is when you're predicting the large amounts of mostly internet text. You're trying to predict the next token.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13284.547

But like... It could break contracts. I don't think it's illegal. Like in any legal... Like no one's going to jail for this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13361.496

Genius. The early copyright lawsuits have fallen in the favor of AI training companies I would say that the long tail of use is going to go in the side of AI, which is if you do if you scrape trillions of data, you're not looking at the trillions of tokens of data. You're not looking and saying this one New York Times article is so important to me.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13381.145

But if you're doing a audio generation for music or image generation and you say, make it in the style of X person. That's a reasonable case where you could figure out what is their profit margin on inference. I don't know if it's going to be the 50-50 of YouTube creator program or something, but I would opt into that program as a writer. Please.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13402.56

It's going to be a rough journey, but there will be some solutions like that that make sense. But there's a long tail where it's just on the internet.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13429.573

Code and data is hard, but ideas is easy. Silicon Valley operates on the way that top employees get bought out by other companies for a pay raise. And a large reason why these companies do this is to bring ideas with them. And there are, there's no, I mean, in California, there's rules that like certain like non-competes or whatever are illegal in California.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1343.597

And what to know about these new DeepSeq models is that they do this internet large-scale pre-training once to get what is called DeepSeq v3 base. This is a base model. It's just going to finish your sentences for you. It's going to be harder to work with than ChatGPT.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13450.381

And whether or not there's NDAs and things, that is how a lot of it happens. Recently, there was somebody from Gemini who helped make this 1 million contacts length. And everyone is saying the next Lama who, I mean, he went to the meta team is going to have 1 million contacts length. And that's kind of how the world works.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

13547.071

Everybody else, not me. I'm too oblivious and I am not single. So I'm safe from one espionage access.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1360.309

And then what DeepSeek did is they've done two different post-training regimes to make the models have specific desirable behaviors. So what is the more normal model in terms of the last few years of AI, an instruct model, a chat model, a quote-unquote aligned model, a helpful model? There are many ways to describe this. is more standard post-training.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1382.541

So this is things like instruction tuning, reinforced learning from human feedback. We'll get into some of these words. And this is what they did to create the DeepSeq v3 model. This was the first model to be released, and it is very high-performance. It's competitive with GPT-4, LAMA-405b, so on.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1402.333

And then when this release was happening, we don't know their exact timeline or soon after they were finishing the training of a different training process from the same next token prediction based model that I talked about, which is when this new reasoning training that people have heard about comes in in order to create the model that is called DeepSeq R1.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

14207.96

Which is why nuclear is also good for it. Like long-term nuclear is a very natural fit, but you can't do solar or anything in the short term.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1423.007

The R through this conversation is good for grounding for reasoning, and the name is also similar to OpenAI's O1, which is the other reasoning model that people have heard about. And we'll have to break down the training for R1 in more detail because for one, we have a paper detailing it, but also it is a far newer set of techniques for the AI community.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1442.482

So it's a much more rapidly evolving area of research.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

14533.239

People should just go to Google, like scale, like what does X watts do and go through all the scales from one watt to a kilowatt to a megawatt. And you look and stare at that and you're how high on the list a gigawatt is. And it's mind blowing. Yeah.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1462.185

Yeah, so pre-training, I'm using some of the same words to really get the message across is you're doing what is called autoregressive prediction to predict the next token in a series of documents. This is done over standard practices, trillions of tokens. So this is a ton of data that is mostly scraped from the web.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1481.355

In some of DeepSeq's earlier papers, they talk about their training data being distilled for math. I shouldn't use this word yet, but taken from Common Crawl. And that's a public access that anyone listening to this could go download data from the Common Crawl website. This is a crawler that is maintained publicly.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

14889.38

Because at the end of pre-training is when you increase the context length for these models. And we've talked earlier in the conversation about how the context length, when you have a long input, is much easier to manage than output. And a lot of these post-training and reasoning techniques rely on a ton of sampling, and it's becoming increasingly long context.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

14907.366

So it's just like, effectively, your compute efficiency goes down. I think flops is the standard for how you measure it. But with RL, and you have to do all these things where you... move your weights around in a different way than at pre-training and just generation. It's going to become less efficient, and flops is going to be less of a useful term.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

14928.196

And then as the infrastructure gets better, it's probably going to go back to flops.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1498.85

Yes, other tech companies eventually shift to their own crawler, and DeepSeq likely has done this as well as most frontier labs do. But this sort of data is something that people can get started with. And you're just predicting text in a series of documents. This can be scaled to be very efficient.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15046.085

You know, if it doesn't exist already.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1517.763

And there's a lot of numbers that are thrown around in AI training, like how many floating point operations or flops are used. And then you can also look at how many hours of these GPUs that are used. And it's largely one loss function taken to a very large amount of compute usage, you just you set up really efficient systems. And then at the end of that, you have the space model.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15357.345

Well, it's easier. It's harder to switch than it is to do it. There's big fees for switching, too.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15374.681

Yeah, one day Amazon Prime will triple in price.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15384.926

Yeah, one would think.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1542.345

And pre training is where there is a lot more of complexity in terms of how the process is emerging or evolving and the different types of training losses that you will use. I think this is a lot of techniques grounded in the natural language processing literature. The oldest technique which is still used today is something called instruction tuning or also known as supervised fine tuning.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15473.023

I mean, NVIDIA's entire culture is designed from the bottom up to do this. There's this recent book, The NVIDIA Way by Taekim, that details this and how they look for future opportunities and ready their CUDA software libraries to make it so that new applications of high-performance computing can very rapidly be evolved on CUDA and NVIDIA chips.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15493.975

And that is entirely different than Google as a services business.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1566.923

These acronyms will be IFT or SFT. People really go back and forth throughout them and I will probably do the same which is where you add this

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15687.361

The default leader has been Google because of their infrastructure advantage.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15695.207

They're the leading in the narrative. They have the best model. They have the best model that people can use and they're experts. And they have the most AI revenue. Yeah. OpenAI is winning.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1576.309

formatting to the model where it knows to take a question that is like, explain the history of the Roman Empire to me, or a sort of question you'll see on Reddit or Stack Overflow, and then the model will respond in a information-dense but presentable manner. The core of that formatting is in this instruction tuning phase.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15875.099

There are companies that will benefit from AI, but not because they trained the best model. Like Meta has so many avenues to benefit from AI in all of their services. People are there, people spend time on Meta's platforms, and it's a way to make more money per user per hour.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15927.67

Not soon, but who knows what robotics will be used for.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1596.508

And then there's two other categories of loss functions that are being used today. One I will classify as preference fine tuning. Preference fine tuning is a generalized term for what came out of reinforcement learning from human feedback, which is RLHF. This reinforcement learning from human feedback is credited as the technique that helped

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15960.905

All of the value of OpenAI right now as a brand is in ChatGPT. And there is actually not that, for most users, there's not that much of a reason that they need OpenAI to be spending billions and billions of dollars on the next best model when they could just license Lama 5 and Furby way cheaper. So that's kind of like ChatGPT is an extremely valuable entity to them.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

15983.873

But they could make more money just off that.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16044.321

Yes.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16070.183

If you think about AWS, AWS does not make a lot of money on each individual machine. And the same can be said for the most powerful AI platform, which is even though the calls to the API are so cheap, there's still a lot of money to be made by owning that platform. And there's a lot of discussions as it's the next compute layer.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16129.226

There are a lot of wrappers making a lot of money.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16160.71

It is a common saying that the best businesses being made now are ones that are predicated on models getting better.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16172.033

The short term, the company that could make the most money is the one that figures out what advertising targeting method works for language model generations. We have the meta ads, which are hyper-targeted in feed, not within specific pieces of content. And we have search ads that are used by Google and Amazon has been rising a lot on search.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1618.64

ChatGPT breakthrough is a technique to make the responses that are nicely formatted, like these Reddit answers, more in tune with what a human would like to read. This is done by collecting pairwise preferences from actual humans out in the world to start. And now AIs are also labeling this data and we'll get into those trade-offs.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16189.266

But within a piece, within a return from ChatGPT, it is not clear how you get a high quality placed ad within the output. And if you can do that with model costs coming down, you're you can just get super high revenue. Like that revenue is totally untapped and it's not clear technically how it is done.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16218.938

And it could be very subtle. It could be in conversation. Like we have voice mode now. It could be some way of making it so the voice introduces certain things. It's much harder to measure and it takes imagination, but yeah.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16247.774

They don't care about it right now. I think it's places like Perplexity are experimenting on that more.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16311.751

Okay, so mostly the term agent is obviously overblown. We've talked a lot about reinforcement learning as a way to train for verifiable outcomes. Agents should mean something that is open-ended and is solving a task independently on its own and able to adapt to uncertainty.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16327.04

There's a lot of the term agent applied to things like Apple intelligence, which we still don't have after the last WWDC, which is orchestrating between apps. And that type of tool use thing is something that language models can do really well. Apple intelligence, I suspect, will come eventually. It's a closed domain. It's your messages app integrating with your photos, with AI in the background.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16349.013

That will work. That has been described as an agent by a lot of software companies to get into the narrative. The question is, what ways can we get language models to generalize to new domains and solve their own problems in real time?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16365.483

Maybe some tiny amount of training when they are doing this with fine-tuning themselves or in-context learning, which is the idea of storing information in a prompt, and you can use learning algorithms to update that, and whether or not you believe that that is going to actually generalize to things like

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1638.19

And you have this kind of contrastive loss function between a good answer and a bad answer. And the model learns to pick up these trends. There's different implementation ways. You have things called reward models. You could have direct alignment algorithms. There's a lot of really specific things you can do, but all of this is about fine tuning to human preferences.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16382.134

And me saying, book my trip to go to Austin in two days, I have X, Y, Z constraints and actually trusting it. I think there's an HCI problem, coming back for information.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16530.445

That's the thing, if we can't get intelligence that's enough to solve the human world on its own. We can create infrastructure like the human operators for Waymo over many years that enable certain workflows.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16549.588

Yeah, it's like an API call and it's hilarious. There's going to be teleoperation markets when we get human robots, which is there's going to be somebody around the world that's happy to fix the fact that it can't finish loading my dishwasher when I'm unhappy with it. But that's just going to be part of the Tesla service package.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1657.237

And the final stage is much newer and will link to what is done in R1. And these reasoning models is, I think, OpenAI's name for this. They had this new API in the fall, which they called the Reinforcement Fine Tuning API. This is the idea that you use the techniques of reinforcement learning, which is a whole framework of AI. There's a deep literature here.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16575.165

But if you can make things that are good at one step, you can stack them together. So that's why I'm like, if it takes a long time, we're going to build infrastructure that enables it. You see the operator launch. They have partnerships with certain websites, with DoorDash, with OpenTable, with things like this. Those partnerships are going to let them climb really fast.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16594.163

Their model is going to get really good at those things. It's Proof of concept, that might be a network effect where more companies want to make it easier for AI. Some companies will be like, no, let's put blockers in place. And this is the story of the internet we've seen. We see it now with training data for language models where companies are like, no, you have to pay. Business working it out.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16629.034

You actually can't call an American Airlines agent anymore.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16655.015

But think about it. United has accepted the Starlink term, which is they have to provide Starlink for free and the users are going to love it. What if one airline is like, we're going to take a year and we're going to make our website have white text that works perfectly for the AIs. Every time anyone asks about an AI flight... They buy whatever airline it is.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16748.84

To be clear, these sandboxes already exist in research. There are people who have built clones of all the most popular websites of Google, Amazon, blah, blah, blah, to make it so that there's... I mean, OpenAI probably has them internally to train these things. It's the same as DeepMind's robotics team for years has had clusters for robotics where you interact with robots fully remotely.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16768.956

They just have a lab in London and you send tasks to it, arrange the blocks, and you do this research. Obviously, there's techs there that fix stuff, but... We've turned these cranks of automation before. You go from sandbox to progress, and then you add one more domain at a time and generalize.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1678.838

To summarize, it's often known as trial and error learning or the subfield of AI where you're trying to make sequential decisions in a certain potentially noisy environment. There's a lot of ways we could go down that. but fine tuning language models where they can generate an answer and then you check to see if the answer matches the true solution.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16786.714

In the history of NLP and language processing, instruction tuning in tasks per language model used to be like one language model did one task. And then in the instruction tuning literature, there's this point where you start adding more and more tasks together, where it just starts to generalize to every task. And we don't know where on this curve we are.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

16803.346

I think for reasoning with this RL and verifiable domains, we're early, but we don't know where the point is where you just start training on enough domains and poof, like more domains just start working and you've crossed the generalization barrier.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1699.736

For math or code, you have an exactly correct answer for math. You can have unit tests for code. And what we're doing is we are checking the language models work and we're giving it multiple opportunities on the same questions to see if it is right. And if you keep doing this, the models can learn to improve in verifiable domains. to a great extent. It works really well.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17021.993

The big picture is that I don't think it's going to be a cliff. It's like we talked to a really good example of how growth changes is when meta added stories. So Snapchat was on an exponential. They added stories. It flatlined. Software engineers, then up until the right. AI is going to come in. It's probably just going to be flat. It's a lot like everyone's going to lose their job.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17044.6

It's hard because the supply corrects more slowly. So the amount of students is still growing and that'll correct on a multi-year, like a year delay, but the amount of jobs will just turn. And then maybe in 20, 40 years, it'll be well down. But in the few years, there'll never going to be the snap moment where it's like software engineers aren't useful.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17096.643

Kind of like, yes, adding the human... Designing the perfect Google button. Google's famous for having people design buttons that are so perfect. And it's like, how is AI going to do that? Like, they could give you all the ideas, but...

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17127.259

And humans are actually very good at reading or judging between two things. This goes back to the core of what RLHF and preference tuning is, is that it's hard to generate a good answer for a lot of problems, but it's easy to see which one is better. And that's how we're using humans for AI now, is judging which one is better. And that's what software engineering could look like.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17145.687

The PR review, here's a few options. Here are some potential pros and cons. And they're going to be judges.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1719.851

It's a newer technique in the academic literature. It's been used at frontier labs in the US that don't share every detail for multiple years. So this is the idea of using reinforcement learning with language models, and it has been taking off, especially in this deep-seek moment.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17274.59

I'll explain what a Tulu is. A Tulu is a hybrid camel when you breed a dromedary with a Bacchian camel. Back in the early days after Chachipiti, there was a big wave of models coming out like Alpaca, Vicuna, et cetera, that were all named after various mammalian species. So Tulu, the brand is multiple years old, which comes from that. And

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17295.649

We've been playing at the frontiers of post-training with open source code. And this first part of this release was in the fall where we built on Lama's open models, open weight models, and then we add in our fully open code, our fully open data. There's a popular benchmark that is Chatbot Arena, and that's generally the metric by which how these chat models are evaluated.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17319.285

And it's humans compare random models from different organizations. And if you looked at the leaderboard in November or December, among the top 60 models from 10s to 20s of organizations, none of them had open code or data for just post-training. Among that, even fewer or none have pre-training data and code available, but post-training is much more accessible at this time.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17338.736

It's still pretty cheap and you can do it. And the thing is, how high can we push this number where people have access to all the code and data? So that's kind of the motivation of the project. We draw on lessons from Lama. NVIDIA had a Nemotron model where the recipe for their post-training was fairly open with some data and a paper.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17356.207

And it's putting all these together to try to create a recipe that people can fine-tune models like GPT-4 to their domain.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17373.277

Tulu has been a series of recipes for post-training. So we've done multiple models over years.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17381.132

Yeah, if you start with an open weight-based model, the whole model technically is an open source because you don't know what Lama put into it, which is why we have a separate thing that we'll get to. But it's just getting parts of the pipeline where people can zoom in and customize.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17394.615

I know I hear from startups and businesses, they're like, okay, I can take this post-training and try to apply it to my domain. We talk about verifiers a lot. We use this idea, which is reinforcement learning with verifiable domain rewards, RLVR, kind of similar to RLHF. And we applied it to math and the model today, which is like we applied it to the Lama 405B base model from last year.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17418.748

And we have our other stuff. We have our instruction tuning and our preference tuning. But the math thing is interesting, which is like it's easier to improve this math benchmark. There's a benchmark, M-A-T-H, math, all capitals. tough name on the benchmark is name is the area that you're evaluating. We're researchers. We're not, we're not brands, brand strategists.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17439.005

And this is something that the deep seek paper talked about as well as like at this bigger model, it's easier to elicit powerful capabilities with this RL training. And then they distill it down from that big model to the small model. And this model we released today, we saw the same thing as it were AI too. We don't have a ton of compute. We can't train four or five B models all the time.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17458.598

So we just did a few runs and they tend to work. And it's like, it just shows that there's a lot of room for people to play in these things. And they crushed Lama's actual release, right? Like they're way better than it. Yeah. So our Val numbers, I mean, we have extra months in this, but our Val numbers are like much better than the Lama instruct model that they released.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17480.836

Yeah, on our eval benchmark. DeepSeq v3 is really similar. We have a safety benchmark to understand if it will say harmful things and things like that. And that's what draws down most of the way.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17494.543

Yeah, so we have a 10 evaluation. This is standard practice in post-training is you choose your evaluations you care about. In academics, in smaller labs, you'll have fewer evaluations. In companies, you'll have a really one domain that you really care about. In frontier labs, you'll have 10s to 20s to maybe even like 100 evaluations of specific things.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17511.176

So we'd choose a representative suite of things that look like chat, precise instruction following, which is like respond only in emojis. Like does the model follow weird things like that? Math, code. And you create a suite like this. So safety would be one of 10. In that type of suite where you have like, what is the broader community of AI care about?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17529.192

And for example, in comparison to deep seek, it would be something like our average eval for our model would be 80, including safety and similar without and deep seek would be like 79% average score average. without safety and their safety score would bring it down to like 76 on average.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17548.208

Yeah, so this is something that internally it's like, I don't want to win only by like how you shape the valve benchmark. So if there's something that's like people may or may not care about safety in their model, safety can come downstream. Safety can be when you host the model for an API. Like safety is... addressed in a spectrum of locations and applications.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17565.047

So it's like, if you want to say that you have the best recipe, you can't just gate it on these things that some people might not want. And this is because it's like the time of progress. We benefit if we can release a model later. We have more time to learn new techniques like this RL technique. We had started this in the fall. It's now really popular as reasoning models.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17584.297

The next thing to do for open source post-training is to scale up verifiers, to scale up data, to replicate some of DeepSeq's results. And it's awesome that we have a paper to draw on and it makes it a lot easier. And that's the type of things that is going on among academic and closed frontier research in AI.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17613.339

This goes very back to the license discussion. So DeepSeq R1 with a friendly license is a major reset. So it's like the first time that we've had a really clear frontier model that is open weights and with a commercially friendly license with no restrictions on downstream use cases, synthetic data, distillation, whatever.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17629.234

This has never been the case at all in the history of AI in the last few years since ChatGPT. There have been models that are off the frontier or models with weird licenses that you can't really use them.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17644.743

So this goes to what open source AI is, which is there's also use case restrictions in the Lama license, which says you can't use it for specific things. So if you come from an open source software background, you would say that that is not an open source license. What kind of things are those, though? At this point, I can't pull them off the top of my head.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1766.161

So let's start with DeepSeq v3 again. It's what more people would have tried something like it. You ask it a question. It'll start generating tokens very fast. And those tokens will look like a very human legible answer. It'll be some sort of markdown list. It might have formatting to help you draw to the core details in the answer. And it'll generate tens to hundreds of tokens.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17662.547

It used to be military use was one, and they removed that for scale. It'll be like... like CSAM, like child abuse material. That's the type of thing that is forbidden there, but that's enough from an open source background to say it's not an open source license. And also the Lama license has this horrible thing where you have to name your model Lama if you touch it to the Lama model.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17685.363

So it's like the branding thing. So if a company uses Lama, technically the license says that they should say built with Lama at the bottom of their application. And from a marketing perspective, that just hurts. I could suck it up as a researcher. I'm like, oh, it's fine. It says Lama Dash on all of our materials for this release.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17701.56

But this is why we need truly open models, which is, we don't know DeepSeek R1's data.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17712.391

hell yeah that's that's what i'm saying and yeah and that's why it's like we want this whole open language models thing the olmo thing is to try to keep the model where everything is open with the data as close to the frontier as possible so we're compute constrained we're personnel constrained we're

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17729.57

We rely on getting insights from people like John Schulman tells us to do RL on outputs like we can make these big jumps, but it just takes a long time to push the frontier of open source. And fundamentally, I would say that that's because open source AI does not have the same feedback loops as open source software. We talked about open source software for security.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17748.949

Also, it's just because you build something once and you can reuse it. If you go into a new company, there's so many benefits. But if you open source a language model, you have this data sitting around, you have this training code. It's not like that easy for someone to come and build on and improve because you need to spend a lot on compute. You need to have expertise.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17766.496

So until there are feedback loops of open source AI, it seems like mostly an ideological mission. Like people like Mark Zuckerberg, which is like America needs this. And I agree with him, but In the time where the motivation ideologically is high, we need to capitalize and build this ecosystem around what benefits do you get from seeing the language model data. And there's not a lot about that.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17788.892

We're going to try to launch a demo soon where you can look at an OMO model and a query and see what pre-training data is similar to it, which is like legally risky and complicated, but it's like... what does it mean to see the data that the AI was trained on? It's hard to parse. It's terabytes of files. It's like, I don't know what I'm going to find in there.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

17808.788

But that's what we need to do as an ecosystem if people want open source AI to be financially useful.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1789.757

A token is normally a word for common words or a subword part in a longer word. And it'll look like a very high quality Reddit or Stack Overflow answer. These models are really getting good at doing these across a wide variety of domains. Even things that if you're an expert, things that are close to the fringe of knowledge, they will still be fairly good at.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1814.026

Cutting edge AI topics that I do research on, these models are capable for study aid and they're regularly updated. Where this changes is with the DeepSeq R1, what is called these reasoning models, is when you see tokens coming from these models to start, it will be a large chain of thought process.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1835.057

We'll get back to chain of thought in a second, which looks like a lot of tokens where the model is explaining the problem. The model will often break down the problem and be like, okay, they asked me for this. Let's break down the problem. I'm going to need to do this. and you'll see all of this generating from the model. It'll come very fast in most user experiences.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1852.371

These APIs are very fast, so you'll see a lot of tokens, a lot of words show up really fast. It'll keep flowing on the screen, and this is all the reasoning process. And then eventually the model will change its tone in R1, and it'll write the answer, where it summarizes its reasoning process and writes a similar answer to the first types of model.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18541.499

I had a while to think about this while listening to Dylan's beautiful response. He didn't listen to me. He was so dumb. No, I knew this was coming. And it's like, realistically, training models is very fun because there's so much low-hanging fruit. And the thing that makes my job entertaining, I train models. I write analysis about what's happening with models.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18560.806

And it's fun because there is obviously so much more progress to be had. And the real motivation why I do this somewhere where I can share things is that there's just I don't trust people that are like, trust me, bro. We're going to make AI good. It's like, we're the ones that it's like, we're going to do it and you can trust us. And we're just going to have all the AI.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18579.232

And it's just like, I would like a future where more people have a say in what AI is and can understand it. And that's, it's a little bit less fun that it's not a like positive thing of like, this is just all really fun. Like training models is fun and bring people in as fun, but it's really like AI. If it is going to be the most powerful technology of my lifetime, it's like,

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18598.875

We need to have a lot of people involved in making that. Making it open helps with that.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18609.378

My read of the last few years is that more openness would help the AI ecosystem in terms of having more people understand what's going on, rather than researchers from non-AI fields to governments to everything. It doesn't mean that openness will always be the answer. I think that it will.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18699.147

I started with AI of learning to fly a silly quadrotor. It's like learn to fly. And it just like, it learned to fly up. It would hit the ceiling and stop and catch it. It's like, okay, that is like really stupid compared to what's going on now.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1870.926

But in DeepSeq's case, which is part of why this was so popular even outside the AI community, is that you can see how the language model is breaking down problems. And then you get this answer on a technical side.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18719.48

There's low-level blockers. We have to do some weird stuff for that.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18741.949

I think humans will definitely be around in a thousand years. I think there's ways that very bad things could happen and there'll be way fewer humans, but humans are very good at surviving. There's been a lot of things that that is true. I don't think they're necessarily, we're good at long-term credit assignment of risk, but when the risk becomes immediate, we tend to figure things out.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18763.317

And for that reason, I'm like, There's physical constraints to things like AGI, recursive improvement to kill us all type stuff. For physical reasons and for how humans have figured things out before, I'm not too worried about AI takeover. There are other international things that are worrying, but... there's just fundamental human goodness and trying to amplify that. We're on a tenuous time.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18791.609

And I mean, if you look at humanity as a whole, there's been times where things go backwards. There's times when things don't happen at all. And we're on what should be very positive trajectory right now.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1883.869

They train the model to do this specifically where they have a section, which is reasoning, and then it generates a special token, which is probably hidden from the user most of the time, which says, okay, I'm starting to answer. So the model is trained to do this two-stage process on its own. If you use a similar model in, say, OpenAI, OpenAI's user interface is...

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18912.786

Scrolling holds the status quo of the world.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

18936.141

Thanks for having us. Thanks for having us.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1901.986

trying to summarize this process for you nicely by kind of showing the sections that the model is doing. And it'll kind of click through, it'll say breaking down the problem, making X calculation, cleaning the result, and then the answer will come for something like OpenAI.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1923.502

Yeah, so if you're looking at the screen here, what you'll see is a screenshot of the DeepSea chat app. And at the top is thought for 151.7 seconds with the dropdown arrow. Underneath that, if we were in an app that we were running, the dropdown arrow would have the reasoning.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

1995.489

It's going to have pages and pages of this. It's almost too much to actually read, but it's nice to skim as it's coming.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2066.11

This is a potential digression, but a lot of people have found that these reasoning models can sometimes produce much more eloquent text. That is a at least interesting example, I think, depending on how open-minded you are, you find language models interesting or not, and there's a spectrum there. Yeah.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2271.76

Should we break down where it actually applies and go into the transformer? Is that useful? Let's go. Let's go into the transformer. So the transformer is a thing that is talked about a lot, and we will not cover every detail.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2283.662

Essentially, the transformer is built on repeated blocks of this attention mechanism and then a traditional, dense, fully connected, multilayer perception, whatever word you want to use for your normal neural network, and you alternate these blocks. There's other details.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2299.286

And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2358.951

Every different type of model has a different scaling law for it, which is effectively for how much compute you put in, the architecture will get to different levels of performance at test tasks. And mixture of experts is one of the ones at training time, even if you don't consider the inference benefits, which are also big.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2376.927

At training time, your efficiency with your GPUs is dramatically improved by using this architecture if it is well implemented. So you can get effectively the same performance model and evaluation scores with numbers like 30% less compute. I think there's going to be a wide variation depending on your implementation details and stuff.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2395.898

But it is just important to realize that this type of technical innovation is something that gives value. huge gains. And I expect most companies that are serving their models to move to this mixture of experts implementation. Historically, the reason why not everyone might do it is because it's an implementation complexity, especially when doing these big models.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2415.503

So this is one of the things that's DeepSeq gets credit for is they do this extremely well. They do mixture of experts extremely well. This architecture for what is called DeepSeq MOE, MOE is the shortened version of mixture of experts, is multiple papers old. This part of their training infrastructure is not new to these models. alone.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2435.276

And same goes for what Dylan mentioned with multi-head latent attention. It's all about reducing memory usage during inference and same things during training by using some fancy low-rank approximation math.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2448.026

If you get into the details with this latent attention, it's one of those things I look at and say, okay, they're doing really complex implementations because there's other parts of language models such as embeddings that are used to extend the context length. The common one that DeepSeq uses is rotary positional embeddings, which is called rope.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2467.18

And if you want to use rope with a normal MOE, it's kind of a sequential thing. You take two of the attention matrices and you rotate them by a complex value rotation, which is a matrix multiplication. With DeepSeq's MLA, with this new attention architecture, they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2488.652

So they're managing all of these things. And these are probably the sort of things that OpenAI, these closed labs are doing. We don't know if they're doing the exact same techniques, but they actually shared them with the world, which is really nice to feel like this is the cutting edge of efficient language model training.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2745.878

And there's different implementations for mixture of experts where you can have... some of these experts that are always activated, which this just looks like a small neural network. And then all the tokens go through that. And then they also go through some that are selected by this routing mechanism. And one of the

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2764.795

innovations in DeepSeq's architecture is that they changed the routing mechanism in mixture of expert models. There's something called an auxiliary loss, which effectively means during training, you want to make sure that all of these experts are used across the tasks that the model sees. Why there can be failures in mixture of experts is that

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2784.927

When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2804.695

But if you think about the loss functions of deep learning, this even connects to the You want to have the minimum inductive bias in your model to let the model learn maximally. And this auxiliary loss, this balancing across experts could be seen as intention with the prediction accuracy of the tokens.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2824.067

So we don't know the exact extent that the deep seek MOE change, which is instead of doing an auxiliary loss, they have an extra parameter in their routing, which after the batches, they update this parameter to make sure that the next batches all have a similar use of experts. And this type of change can be big, it can be small, but they add up over time.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2842.58

And this is the sort of thing that just points to them innovating. And I'm sure all the labs that are training big MOEs are looking at this sort of things, which is getting away from the auxiliary loss. Some of them might already use it, but you just keep accumulating gains. And we'll talk about... the philosophy of training and how you organize these organizations.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2860.095

And a lot of it is just compounding small improvements over time in your data, in your architecture, in your post-training, and how they integrate with each other. And DeepSeek does the same thing, and some of them are shared. We have to take them on face value that they share their most important details. I mean, the architecture and the weights are out there, so we're seeing what they're doing.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2878.493

And it adds up.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2995.39

I think we should summarize what the bitter lesson actually is about. The bitter lesson, essentially, if you paraphrase it, is that the types of training that will win out in deep learning as we go are those methods that are which are scalable in learning and search is what it calls out. And This scale word gets a lot of attention in this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3019.104

The interpretation that I use is effectively to avoid adding human priors to your learning process. And if you read the original essay, this is what it talks about, is how

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3031.773

Researchers will try to come up with clever solutions to their specific problem that might get them small gains in the short term, while simply enabling these deep learning systems to work efficiently and for these bigger problems in the long term might be more likely to scale and continue to drive success.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3052.802

And therefore, we were talking about relatively small implementation changes to the mixture of experts model. And therefore, it's like, okay, like, we will need a few more years to know if one of these are actually really crucial to the bitter lesson. But the bitter lesson is really this long term arc of how. Simplicity can often win.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3072.057

There's a lot of sayings in the industry like the models just want to learn. You have to give them the simple loss landscape where you put compute through the model and they will learn and getting barriers out of the way.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3099.563

I'm sure they have, DeepSeek definitely has code bases that are extremely messy where they're testing these new ideas. multi-head latent attention. Probably could start in something like a Jupyter notebook, where somebody tries something on a few GPUs, and that is really messy.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3114.373

But the stuff that trains the DeepSeq v3 and DeepSeq R1, those libraries, if you were to present them to us, I would guess are extremely high-quality code.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3229.289

Some of them you do. Some of them are bad data. Can I give an AI2's example of what blew up our earlier models? It's a subreddit called Microwave Gang. We love to shout this out. It's a real thing. You can pull up Microwave Gang. Essentially, it's a subreddit where everybody makes posts that are just the letter M. So it's like, mmm.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3246.792

So there's extremely long sequences of the letter M. And then the comments are like, beep, beep, because it's in the microwave ends. But if you pass this into a model that's trained to be a normal producing text, it's extremely high loss. Because normally you see an M. You don't predict M's for a long time. So this is something that causes a lot of spikes for us.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3264.384

But when you have much like this is old. This is not recent. And when you have more mature data systems, that's not the thing that causes the loss spike. And what Dylan is saying is true. But it's levels to this sort of idea. With regards to the stress, right?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3292.471

Tokens per second. Lost, not blown up. They're just walking, watching this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3310.438

There are even different types of spikes. So Dirk Greneveld has a theory that I do that's like fast spikes and slow spikes, where there are sometimes where you're looking at the loss and there are other parameters, you can see it start to creep up. and then blow up. And that's really hard to recover from. So you have to go back much further.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3325.788

So you have the stressful period where it's like flat or might start going up. And you're like, what do I do? Whereas there are also lost spikes that are, it looks good. And then there's one spiky data point. And what you can do is you just skip those. You see that there's a spike. You're like, okay, I can ignore this data. Don't update the model and do the next one and it'll recover quickly.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3341.653

But these like, on trickier implementations. So as you get more complex in your architecture and you scale up to more GPUs, you have more potential for your loss blowing up.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3378.746

Every company has failed runs. You need failed runs to push the envelope on your infrastructure. So a lot of news cycles are made of X company had Y failed run. Every company that's trying to push the frontier of AI has these. So yes, it's noteworthy because it's a lot of money and it can be week to month setback, but it is part of the process.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3423.381

Key hyperparameters like learning rate and regularization and things like this. And you find the regime that works for your code base. I've... Talking to people at Frontier Labs, there's a story that you can tell where training language models is kind of a path that you need to follow. So you need to unlock the ability to train a certain type of model or a certain scale.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3443.976

And then your code base and your internal know-how of which hyperparameters work for it is kind of known. And you look at the DeepSeq papers and models, they've scaled up, they've added complexity, and it's just continuing to build the capabilities that they have.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3528.109

you know have that innate gut instinct of like this is the yolo run like you know looking at the data this is it this is why you want to work in post-training because the gpu cost for training is lower so you can make a higher percentage of your training runs yolo runs yeah for now yeah for now for now so some of this is fundamentally luck still luck is skill right in many cases

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3551.124

Yeah, I mean, it looks lucky, right, when you're... But the hill to climb, if you're on one of these labs, you have an evaluation, you're not crushing. There's a repeated playbook of how you improve things. There are localized improvements, which might be data improvements, and these add up into the whole model just being much better.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3565.896

And when you zoom in really close, it can be really obvious that this model is just really bad at this thing, and we can fix it, and you just add these up. So some of it feels like luck, but on the ground, especially with these new reasoning models we're talking to, it's just... so many ways that we can poke around. And normally it's that some of them give big improvements.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3786.653

Before expert controls started.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3833.989

Total AGI vibes. They're like, we need to do this. We need to make a new ecosystem of open AI. We need China to lead on this sort of ecosystem because historically the Western countries have led on software ecosystems. And he straight up acknowledges, like, in order to do this, we need to do something different. DeepSeek is his way of doing this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3855.681

Some of the translated interviews with him are fantastic.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3862.525

There hasn't been one yet, but I would try it.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3878.613

Very direct quotes, like, we will not switch to closed source when asked about this stuff. Very long-term motivated and... how the ecosystem of AI should work. And I think from a Chinese perspective, he wants a Chinese company to build this vision.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

3988.059

Accepted practice is that for any given model that is a notable advancement, you're going to do two to four X compute of the full training run in experiments alone.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4007.53

Research gets you O1. Research gets you breakthroughs, and you need to bet on it.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4082.486

Research and ablations. For ballpark, how much would OpenAI or Anthropic have? I think the clearest example we have, because Meta is also open, they talk about like order of 60k to 100k H100 equivalent GPUs in their training clusters.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4110.736

Or whatever, right? I mean, we could get into a cost of like, what is the cost of ownership for a 2000 GPU cluster, 10,000? There's just different sizes of companies that can afford these things. And DeepSeek is...

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4122.74

Can you in general actually just zoom out and also talk about the Hopper architecture, the NVIDIA Hopper GPU architecture, and the difference between H100 and H800, like you mentioned, the interconnects?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4317.066

This is very abstract. I think this can be the goal of how some people describe export controls, is this super powerful AI. You touched on the training run idea. There's not many worlds where China cannot train AI models. I think export controls are kneecapping the amount of compute or the density of compute that China can have.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4340.402

And if you think about the AI ecosystem right now, as all of these AI companies, revenue numbers are up and to the right. Their AI usage is just continuing to grow. More GPUs are going to inference. A large part of export controls, if they work, is just that the amount of AI that can be run in China is going to be much lower.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4359.952

So on the training side, DeepSeek v3 is a great example, which you have a very focused team that can still get to the frontier of AI. This 2,000 GPUs is not that hard to get, all considering in the world. They're still going to have those GPUs. They're still going to be able to train models. But if there's going to be a huge market for AI, if you have strong export controls and you

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4382.54

With good expert controls, it also just makes it so that AI can be used much less. And I think that is a much easier goal to achieve than trying to debate on what AGI is. And if you have these extremely intelligent, autonomous AIs and data centers, those are the things that could be running in these GPU clusters in the United States, but not in China.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4533.061

buzzy words in the AI community about this, test time compute, inference time compute, whatever. But Dylan has good research on this. You can get to the specific numbers on the ratio of when you train a model, you can look at things about the amount of compute used at training and amount of compute used at inference.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4547.632

These reasoning models are making inference way more important to doing complex tasks. In the fall, in December, OpenAI announced this O3 model. There's another thing in AI when things move fast, we get both announcements and releases.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4560.52

Announcements are essentially blog posts where you pat yourself on the back and you say you did things and releases are on the models out there, the papers out there, etc. So OpenAI has announced O3 and we can check if O3 Mini is out as of recording potentially. But that doesn't really change the point, which is that

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4575.888

The breakthrough result was something called Arc AGI task, which is the abstract reasoning corpus, a task for artificial general intelligence. Francois Chollet is the guy who's been, it's a multi-year old paper. It's a brilliant benchmark. And the number for OpenAI 03 to solve this was that it used some sort of number of samples in the API. The API has like thinking effort and number of samples.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4600.367

They used a thousand samples to solve this task and it comes out to be like $5 to $20 per question, which you're putting in effectively a math puzzle. And then it takes orders of dollars to answer one question. And this is a lot of compute. If those are going to take off in the US, OpenAI needs a ton of GPUs on inference to capture this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4619.228

They have this OpenAI ChatGPT Pro subscription, which is $200 a month. Which Sam said they're losing money on. Which means that people are burning a lot of GPUs on inference. And I'm I've signed up with it. I've played with it. I don't think I'm a power user, but I use it.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4632.841

And it's like, that is the thing that a Chinese company with mediumly strong expert controls, there will always be loopholes, might not be able to do at all. And if that, the main result for O3 is also a spectacular coding performance. And if that feeds back into AI companies being able to experiment better.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4670.861

This is what people like CEO or leaders of open AI and anthropic talk about is like autonomous AI models, which is you give them a task and they work on it in the background. I think my personal definition of AGI is much, simpler.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4684.826

I think language models are a form of AGI and all of this super powerful stuff is a next step that's great if we get these tools, but a language model has so much value in so many domains. It is a general intelligence to me.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4695.891

But this next step of agentic things where they're independent and they can do tasks that aren't in the training data is what the few year outlook that these AI companies are driving for.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4732.802

And he has a much more positive view in his essay, Machines of Love and Grace. I've read into this. I don't have enough background in physical sciences to gauge exactly how competent I am in if AI can revolutionize biology. I'm safe saying that AI is going to accelerate the progress of any computational science.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4779.096

I don't like to attribute specific abilities because predicting specific abilities and when is very hard. I think mostly if you're going to say that I'm feeling the AGI is that I expect continued rapid surprising progress over the next few years. So something like R1 is less surprising to me from deep seek because I expect there to be new paradigms where substantial progress can be made.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4799.938

I think DeepSeq R1 is so unsettling because we're kind of on this path with ChatGPT. It's like, it's getting better, it's getting better, it's getting better. And then we have a new direction for changing the models. And we took one step like this, and we took a step up. So it looks like a really fast slope, and then we're going to just take more steps.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4815.465

So it's just really unsettling when you have these big steps. And I expect that to keep happening. I've tried OpenAI Operator. I've tried Cloud computer use. They're not there yet. I understand the idea. But it's just so hard to predict what is the breakthrough that will make something like that work.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4832.273

And I think it's more likely that we have breakthroughs that work and things that we don't know what they're going to do. So everyone wants agents. Dario has a very eloquent way of describing this. And I just think that it's like, there's going to be more than that. So just expect these things to come.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4963.269

There's some research that shows that the distribution is actually the limiting factor. So language models haven't yet made misinformation particularly change the equation there. The internet is still ongoing. I think there's a blog, AI Snake Oil, and some of my friends at Princeton that write on this stuff. So there is research.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4981.558

It's a default that everyone assumes, and I would have thought the same thing, is that Misinformation doesn't get far worse with language models. I think in terms of internet posts and things that people have been measuring, it hasn't been a exponential increase or something extremely measurable.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

4995.651

And things you're talking about with like voice calls and stuff like that, it could be in modalities that are harder to measure. So it's something that it's too soon to tell in terms of, I think that's like political instability via the web is very, it's monitored by a lot of researchers to see what's happening. I think that you're asking about like the AGI thing.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5017.409

If you make me give a year, I would be like, okay, I have AI CEOs saying this. They've been saying two years for a while. I think that they're... People like Dario Anthropic, the CEO, had thought about this so deeply. I need to take their word seriously, but also understand that they have different incentives.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5037.887

So I'd be like, add a few years to that, which is how you get something similar to 2030 or a little after 2030.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5391.512

So many of these AI teams are all people without a US passport.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5506.351

I think we need to make the point clear on why the time is now for people that don't think about this, because essentially with export controls, you're making it so China cannot make or get cutting edge chips. And the idea is that If you time this wrong, China is pouring a ton of money into their chip production.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5523.476

And if you time it wrong, they are going to have more capacity for production, more capacity for energy, and figure out how to make the chips and have more capacity than the rest of the world to make the chips because everybody can buy. They're going to sell their Chinese chips to everybody. They might subsidize them.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5537.358

And therefore, if AI takes a long time to become differentiated, we've kneecapped the financial performance of American companies. NVIDIA can sell less. TSMC cannot sell to China. So therefore, we have less demand to therefore keep driving the production cycle. So that's the assumption behind the timing being important.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5682.965

And the tools.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5717.915

I think the fulcrum point is the transition from 7 nanometer to 5 nanometer chips, where I think it was Huawei that had the 7 nanometer chip a few years ago, which caused another political brouhaha, almost like this moment. And then it's the ASML deep UV, what is that?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5774.578

Many companies have done 7 nanometer chips. And the question is, we don't know how much Huawei was subsidizing production of that chip. Intel has made 7 nanometer chips that are not profitable. Right. and things like this. So this is how it all feeds back into the economic engine of export controls.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

5825.011

That's what a lot of people are worried about. People in AI have been worried that this is going towards a Cold War, or already is.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

6001.241

I think we should comment that why Chinese economy would be hurt by that is that they're export heavy. I think the United States buys so much. If that goes away, that's how their economy- Well, also they just would not be able to import

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

6340.261

I think you could say it simply, which is the cost per fab goes up. And if you are a small player that makes a few types of chips, you're not going to have the demand to pay back the cost of the fab. Whereas NVIDIA can have many different customers and aggregate all this demand into one place. And then they're the only person that makes enough money building chips to build the next fab.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

6363.078

So this is kind of why the companies slowly get killed, because they have 10 years ago a chip that is profitable and is good enough, but the cost to build the next one goes up. They may try to do this, fail because they don't have the money to make it work, and then they don't have any chips, or they build it and it's too expensive and they just have...

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

6601.095

The ants just know. It's like one person just specializes on these one task. And it's like, you're going to take this one tool and you're the best person in the world. And this is what you're going to do for your whole life is this one task in the fab.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

6899.117

This is what China is investing in. as well. It's like they can build out this long tail fab where the techniques are much more known. You don't have to figure out these problems with EUV. They're investing in this. And then they have large supply for things like the car door handles and the random stuff.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

6914.613

And that trickles down into this whole economic discussion as well, which is they have far more than we do. And having supply for things like this is crucial to normal life.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7053.392

I think if you have the demand and the money is on the line, the American companies figure it out. It's going to take handholding with the government, but I think that the culture helps TSMC break through and it's easier for them.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7112.476

I'm sure we agree with you.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7207.444

We told you Dylan knows all this stuff.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7258.76

I mean, ultimately, the export controls are pointing towards a separate future economy. I think the US has made it clear to Chinese leaders that we intend to control this technology at whatever cost to global economic integration. It's hard to unwind that. The card has been played.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7432.334

It seems like it's already happening. As much as I want there to be centuries of prolonged peace, it looks like further instability internationally is ahead.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7751.8

Memory, right?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7771.576

We need to go through a lot of specific technical things of transformers to make this easy for people.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7827.478

Yeah, so the attention operator has three core things. It's queries, keys, and values. QKV is the thing that goes into this. You'll look at the equation. You see that these matrices are multiplied together.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7840.969

These words, query, key, and value come from information retrieval backgrounds where the query is the thing you're trying to get the values for and you access the keys and the values is reweighting. My background's not in information retrieval and things like this. It's just fun to have backlinks. And

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7856.801

What effectively happens is that when you're doing these matrix multiplications, you're having matrices that are of the size of the context length. So the number of tokens that you put into the model and the KV cache is effectively some form of compressed representation of all the previous tokens in the model. So when you're doing this, we talk about autoregressive models.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7877.025

You predict one token at a time. You start with whatever your prompt was. You ask a question like who was the president in 1825? the model then is going to generate its first token. For each of these tokens, you're doing the same attention operator where you're multiplying these query key value matrices.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7894.55

But the math is very nice so that when you're doing this repeatedly, this KV cache, this key value matrix, operation, you can keep appending the new values to it. So you keep track of what your previous values you're inferring over in this autoregressive chain. You keep it in memory the whole time. And this is a really crucial thing to manage when serving inference at scale.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7918.709

There are far bigger experts in this, and there are so many levels of detail that you can go into. Essentially, one of the key quote-unquote drawbacks of the attention operator and the transformer is that there is a form of quadratic memory cost in proportion to the context length.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7935.979

So as you put in longer questions, the memory used in order to make that computation is going up in the form of a quadratic. You'll hear about a lot of other language model architectures that are like sub-quadratic or linear attention forms, which is like state-space models. We don't need to go down all these now.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7954.528

And then there's innovations on attention to make this memory usage and the ability to attend over long contexts much more accurate and high performance.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7968.297

They help with memory constraint and performance. So if you put in a book into... I think Gemini is the model that has the longest context length that people are using. Gemini is known for 1 million and now 2 million context length. You put a whole book into Gemini and... Sometimes it'll draw facts out of it. It's not perfect. They're getting better. So there's two things.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

7987.697

One, to be able to serve this on the memory level. Google has magic with their TPU stack where they can serve really long contexts. And then there's also many decisions along the way to actually make long context performance work. This implies the data. There's subtle changes to these computations and attention. And it changes the architecture.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8005.533

But serving long context is extremely memory-consuming. constrained, especially when you're making a lot of predictions. I actually don't know why input and output tokens are more expensive, but I think essentially output tokens, you have to do more computation because you have to sample from the model.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8114.291

So these are features that APIs are shipping, which is like, prompt caching, pre-filling, because you can drive prices down and you can make APIs much faster. If you know you're going to keep, if you run a business and you're going to keep passing the same initial content to Cloud's API, you can load that in to the Anthropic API and always keep it there.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8132.96

But it's very different than we're kind of leading to the reasoning models, which we talked, we showed this example earlier and read some of this kind of mumbling stuff. And what happens is that the output context length is so much higher.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8145.827

And I mean, I learned a lot about this from Dylan's work, which is essentially, as the output length gets higher, you're writing this quadratic in terms of memory used. And then the GPUs that we have, effectively, you're going to run out of memory, and they're all trying to serve multiple requests at once.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8161.577

So they're doing this batch processing where not all of the prompts are exactly the same, really complex handling. And then as context length gets longer, there's this, I think you call it critical batch size, where your ability to serve So how much you can parallelize your inference plummets because of this long contract.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8179.789

So your memory usage is going way up with these reasoning models, and you still have a lot of users. So effectively, the cost to serve multiplies by a ton.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

823.669

Yeah, so DeepSeq v3 is a new mixture of experts, transformer language model from DeepSeq, who is based in China. They have some new specifics in the model that we'll get into. Largely, this is a open weight model, and it's a instruction model like what you would use in ChatGPT. They also released what is called the base model, which is before these techniques of post-training.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8263.522

for your kv cache that you end up not being able to run uh a certain number of you know uh you know your sequence length is capped or the number of users let's say the model so this is this is showing for a 405b model and batch size 64 llama 3145b yeah and batch size is crucial to essentially they just like you want to have higher batch size to parallelize parallel your throughput.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8370.671

Let's go into deep seek again. So we're in the post deep seek R1 time, I think. And there's two sides to this market watching how hard it is to serve it. On one side, we're going to talk about DeepSeek themselves. They now have a chat app that got to number one on the App Store. Disclaimer, number one on the App Store is measured by velocity.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8389.506

So it's not necessarily saying that more people have the DeepSeek app than the ChatGPT app. But it is... Still remarkable, Claude has never hit the number one in the app store, even though everyone in San Francisco is like, oh my God, you got to use Claude, don't use strategy BT. So DeepSeek hit this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8402.078

They also launched an API product recently where you can ping their API and get these super long responses for R1 out. In... At the same time as these are out, we'll get to what's happened to them.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8413.531

Because the model weights for DeepSeq R1 are openly available and the license is very friendly, the MIT license is commercially available, all of these mid-sized companies and big companies are trying to be first to serve R1 to their users. We were trying to evaluate R1 because we have really similar research going on. We released the model and we're trying to compare to it.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8433.74

And out of all the companies that are... quote unquote serving R1 and they're doing it at prices that are way higher than the deep seek API. Most of them barely work and the throughput is really low.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8488.257

Related to our previous discussion, this multi-head latent attention can save about 80% to 90% in memory from the attention mechanism, which helps especially along context.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

850.884

Most people use instruction models today, and those are what served in all sorts of applications. This was released on, I believe, December 26th, or that week. And then weeks later, on January 20th, DeepSeq released DeepSeq R1, which is a reasoning model, which... really accelerated a lot of this discussion.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8504.035

This 80% to 90% doesn't say that the whole model is 80% to 90% cheaper, just this one part of it.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8539.543

We think that OpenAI had a large margin built in. There's multiple factors.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8722.069

All their low-level libraries that we talked about in training, some of them probably translate to inference, and those weren't released.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

873.064

This reasoning model has a lot of overlapping training steps to DeepSeq v3, and it's confusing that you have a base model called v3 that you do something to to get a chat model, and then you do some different things to get a reasoning model. I think a lot of the AI industry is going through this challenge of communications right now where OpenAI makes fun of their own naming schemes.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8783.717

Some of the interviews, there's discussion on how like doing this as a recruiting tool. You see this at the American companies too. It's like, Having GPUs, recruiting tool. Being at the cutting edge of AI, recruiting tool. Open sourcing. Open sourcing, recruiting tool.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8823.192

They released it on Inauguration Day. They know what is on the international calendar, but I don't expect them to. If you listen to their motivations for AI, it's like,

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8860.991

I think that's one of their big advantages. We know that a lot of the American companies are very invested in safety. And that is the central culture of a place like Anthropic. And I think Anthropic sounds like a wonderful place to work. But if safety is your number one goal, it takes way longer to get artifacts out. That's why Anthropic is not open sourcing things. That's their claims.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8882.373

But there's reviews internally. Anthropic mentions things to international governments. There's been news of how Anthropic has done pre-release testing with the UK AI Safety Institute. All of these things add inertia to the process of getting things out. And we're on this trend line where the progress is very high.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8899.245

So if you reduce the time from when your model is done training, you run evals, it's good. You want to get it out as soon as possible to maximize the perceived quality of your outputs. Deep Seat does this so well.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

894.359

They have GPT-4.0. They have OpenAI-01. And there's a lot of types of models. So we're going to break down what each of them are. There's a lot of technical specifics on training and go from high level to specific and kind of go through each of them.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

8945.79

I mean, like people are infatuated with you. You're telling me this is a high value thing and it works and it's

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9035.171

This is something that Dario talks about. It's like, That's a situation that Dario wants to avoid. Dario talks too about the difference between race to the bottom and race to the top. And the race to the top is where there's a very high standard on safety. There's a very high standard on your model forms and certain crucial evaluations.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9051.737

And when certain companies are really good to it, they will converge. This is the idea. And Ultimately, AI is not confined to one nationality or to one set of morals for what it should mean. And there's a lot of arguments on like, should we stop open sourcing models? And if the US stops, it's pretty clear.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9073.145

I mean, it's way easier to see now at DeepSeek that a different international body will be the one that builds it. We talk about the cost of training. DeepSeek has this... Shocking $5 million number. Think about how many entities in the world can afford 100 times that to have the best open source model that people use in the world. And it's like,

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9092.475

It's a scary reality, which is that these open models are probably going to keep coming for the time being, whether or not we want to stop them. And stopping them might make it even worse and harder to prepare. But it just means that the preparation and understanding what AI can do is just so much more important. That's why I'm here at the end of the day.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9112.62

But it's like letting that sink into people, especially not in AI, is that like... this is coming, there are some structural things in a global, interconnected world that you have to accept.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9157.369

Yeah, Mark Zuckerberg is not new to having American values in how he presents his company's trajectory. I think their products have long since been banned in China, and I respect the saying it directly.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

917.938

Yeah, so this discussion has been going on for a long time in AI. It became more important since ChatGPT or more focal since ChatGPT at the end of 2022. Open weights is the accepted term for when model weights of a language model are available on the internet for people to download. Those weights can have different licenses, which is effectively the terms by which you can use the model.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9267.849

I mean, it's like carp with each of you. The English is the hottest programming language, and that English is defined by a bunch of companies that primarily are in San Francisco.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9336.145

Yeah, they're cultural backdoors. The thing that amplifies the relevance of culture with language models is that We are used to this mode of interacting with people in back and forth conversation. And we now have a very powerful computer system that slots into a social context that we're used to, which makes people very... We don't know the extent to which people can be impacted by that.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9405.519

Anthropic has research on this where they... show that if you put certain phrases in at pre-training, you can then elicit different behavior when you're actually using the model because they've poisoned the pre-training data. As of now, I don't think anybody in a production system is trying to do anything like this. I think it's mostly...

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

941.098

There are licenses that come from history and open source software. There are licenses that are designed by companies specifically. All of Lama, DeepSeek, Quen, Mistral, these popular names in... open weight models have some of their own licenses. It's complicated because not all the same models have the same terms. The big debate is on what makes a model open weight. Why are we saying this term?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9427.213

Anthropic is doing very direct work and mostly just subtle things. We don't know what these models are going to, how they are going to generate tokens, what information they're going to represent, and what the complex representations they have are.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9526.001

I mean, we've already seen this with recommendation systems.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9550.093

There's no reason in some number of years that you can't train a language model to maximize time spent on a chat app. Like right now they are trained.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9560.954

Their time per session is like two hours. Yeah. Character AI very likely could be optimizing this where it's like the way that this data is collected is naive or it's like you're presented a few options and you choose them. But there's that's not the only way that these models are going to be trained. It's naive stuff like talk to an anime girl.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9606.277

I know where you're going. I mean, you can see it physiologically. Like I take three days if I'm like backpacking or something and you. You're literally breaking down addiction cycles.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9639.914

I mean, there are already tons of AI bots on the internet. Right now, it's not frequent, but every so often, I have replied to one, and they're instantly replying. I'm like, crap, that was a bot. And that is just going to become more common. They're going to get good.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

967.852

It's kind of a mouthful. It sounds close to open source, but it's not the same. there's still a lot of debate on the definition and soul of open source AI. Open source software has a rich history on freedom to modify, freedom to take on your own, freedom from any restrictions on how you would use the software, and what that means for AI is still being defined.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9755.861

There's probably a few things to keep in mind here. One is the kind of Tiananmen Square factual knowledge. How does that get embedded into the models? Two is the Gemini, what you called the Black Nazi model. incident, which is when Gemini as a system had this extra thing put into it that dramatically changed the behavior.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9779.795

And then three is what most people would call general alignment, RLHF post-training. Each of these have very different scopes in how they are applied. In order to do, if you're just going to look at the model weights, in order to audit specific facts is extremely hard because you have to chrome through the pre-training data and look at all of this.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9801.185

And then that's terabytes of files and look for very specific words or hints of the words.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9815.935

So if you want to get rid of facts in a model, you have to do it at every stage. You have to do it at the pre-training. So most people think that pre-training is where most of the knowledge is put into the model and then you can elicit and move that in different ways, whether through post-training or whether through systems afterwards.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9852.404

I almost think it's practically impossible. Because you effectively have to remove them from the internet.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9864.146

It gets filtered out. So you have quality filters, which are small language models that look at a document and tell you how good is this text? Is it close to a Wikipedia article, which is a good... that we want language models to be able to imitate.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

988.881

So for what I do, I work at the Allen Institute for AI. We're a nonprofit. We want to make AI open for everybody. And we try to lead on what we think is truly open source. There's not full agreement in the community. But for us, that means releasing the training data, releasing the training code, and then also having open weights like this. And we'll get into the details of the models and

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9881.609

Yes, but is it going to catch wordplay or encoded language?

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9965.61

It'll have the ability to express it.

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

9972.797

This is what happens. A lot of what is called post-training is a series of techniques to get the model... on rails of a really specific behavior.

Appearances

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast

Lex Fridman Podcast