Skip to main content
ModelTerms

Architecture · advanced

Mixture of Experts (MoE)

Mixture of Experts is a transformer variant where each layer has many parallel "expert" feed-forward networks, but only a few are activated per token. Total parameters grow without growing per-token compute.

Explanation

In a dense transformer, every parameter contributes to every token. In an MoE transformer, a small router picks the top-K experts (usually 2 of 8 or 16) for each token. The unused experts cost nothing to run.

This means you can have a 400-billion-parameter model that costs roughly the same per token as a 50-billion-parameter dense model. The trade-off is added complexity in training, routing, and serving.

Mixtral 8x7B, DeepSeek-V3, and rumored GPT-4 are MoE models. The technique is the main reason "frontier" models can be so large while still running at reasonable speed.

Examples

  • Mixtral 8x7B: 8 experts of ~7B params, 2 active per token.
  • DeepSeek-V3: 256+ experts with very fine-grained routing.

When to use mixture of experts

When you want frontier-scale capability without paying frontier-scale per-token compute.

Frequently asked

What is Mixture of Experts?

Mixture of Experts is a transformer variant where each layer has many parallel "expert" feed-forward networks, but only a few are activated per token. Total parameters grow without growing per-token compute.

What is an example of mixture of experts?

Mixtral 8x7B: 8 experts of ~7B params, 2 active per token.

How is Mixture of Experts related to Transformer?

Mixture of Experts and Transformer are both architecture concepts. The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

When should I use mixture of experts?

When you want frontier-scale capability without paying frontier-scale per-token compute.

Is Mixture of Experts considered advanced?

Mixture of Experts is generally considered advanced-level material in the AI and LLM space.

TransformerArchitecture

The transformer is the neural network architecture behind virtually every modern large language model. It uses self-attention to model relationships between all positions in a sequence in parallel.

InferenceInference

Inference is what happens when you actually run a trained model on new input. For LLMs that means generating tokens one at a time, with sampling and a KV cache.

QuantizationInfrastructure

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

Scaling LawsTraining

Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

Parameter CountArchitecture

Parameter count is the total number of learnable weights in a model — "7B" means 7 billion parameters. It is the most cited model-size metric, though not always the most informative.

Side-by-side comparisons

Sources