Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

2784.927 - 2804.215 Dylan Patel

When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.

💬 0

Comments

There are no comments yet.

Back to full episode

Lex Fridman Podcast

#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters

Comments

Login Required