Architecture · advanced
Mixture of Experts (MoE)
Mixture of Experts is a transformer variant where each layer has many parallel "expert" feed-forward networks, but only a few are activated per token. Total parameters grow without growing per-token compute.
Explanation
In a dense transformer, every parameter contributes to every token. In an MoE transformer, a small router picks the top-K experts (usually 2 of 8 or 16) for each token. The unused experts cost nothing to run.
This means you can have a 400-billion-parameter model that costs roughly the same per token as a 50-billion-parameter dense model. The trade-off is added complexity in training, routing, and serving.
Mixtral 8x7B, DeepSeek-V3, and rumored GPT-4 are MoE models. The technique is the main reason "frontier" models can be so large while still running at reasonable speed.
Examples
- Mixtral 8x7B: 8 experts of ~7B params, 2 active per token.
- DeepSeek-V3: 256+ experts with very fine-grained routing.
When to use mixture of experts
When you want frontier-scale capability without paying frontier-scale per-token compute.
Frequently asked
What is Mixture of Experts?
Mixture of Experts is a transformer variant where each layer has many parallel "expert" feed-forward networks, but only a few are activated per token. Total parameters grow without growing per-token compute.
What is an example of mixture of experts?
Mixtral 8x7B: 8 experts of ~7B params, 2 active per token.
How is Mixture of Experts related to Transformer?
Mixture of Experts and Transformer are both architecture concepts. The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.
When should I use mixture of experts?
When you want frontier-scale capability without paying frontier-scale per-token compute.
Is Mixture of Experts considered advanced?
Mixture of Experts is generally considered advanced-level material in the AI and LLM space.