Lex Fridman Podcast
#459 – DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters
Dylan Patel
And where mixture of experts is applied is that this dense model, the dense model holds most of the weights if you count them in a transformer model. So you can get really big gains from those mixture of experts on parameter efficiency at training and inference because you get this efficiency by not activating all of these parameters.
0
💬
0
Comments
Log in to comment.
There are no comments yet.