Lex Fridman Podcast
#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters
Dylan Patel
When you're doing this training, the one objective is token prediction accuracy. And if you just let training go with a mixture of expert model on your own, it can be that the model learns to only use a subset of the experts. And in the MOE literature, there's something called the auxiliary loss, which helps balance them.
0
💬
0
Comments
Log in to comment.
There are no comments yet.