Mixture-of-Experts (MoE) replaces a single dense feed-forward block with many expert sub-networks, activating only a few per token. The result is a model with billions of parameters that costs far less to run than its size suggests.

How routing works

A lightweight gating network scores each token and sends it to the top-k experts. Only those experts compute, so per-token FLOPs stay low while total capacity grows.

token -> gate -> [expert 2, expert 7] -> combine -> output

The trade-offs

MoE introduces load-balancing and memory challenges: experts can collapse onto a few popular routes, and all parameters must still fit in memory. The paper proposes an auxiliary balancing loss that keeps utilization even.