Mixture of Experts
MoEAn architecture where only a few specialized sub-networks ("experts") fire for any given input, chosen by a learned router.
In one line
A router picks a few experts per token instead of running the whole network — sparse compute, dense-looking capacity.
What it actually means
Inside a transformer FFN block, instead of one big MLP, you have N smaller expert MLPs and a tiny router. For each token, the router computes scores over experts and sends the token to the top-k (usually k=1 or k=2). Only those experts run, the others sit idle. The result is a model with far more total parameters than a dense model of the same inference cost. Training tricks (load-balancing loss, auxiliary losses, expert capacity limits) exist to prevent the router from collapsing onto a single favorite expert.
Why it matters
Mixtral 8x7B, DeepSeek-MoE, Qwen-MoE, and most rumored frontier models use MoE. It’s how you get a model that behaves like a 200B dense model at the inference cost of a ~30B dense model. The downside is memory — you still have to load all experts into GPU RAM even if you only run two per token — and serving complexity. MoE is the dominant scaling lever for open models in 2026.
Example
token x → router(x) → top-2 experts (E3, E7) with weights (0.7, 0.3)
output = 0.7 * E3(x) + 0.3 * E7(x)
You’ll hear it when
- Reading any recent open-weights model card.
- Discussing why Mixtral is cheap to run but big to host.
- Comparing dense vs sparse scaling laws.
- Debugging routing collapse or load-balance losses.