Mixture-of-Experts (MoE) replaces a single dense feed-forward block with many expert sub-networks, activating only a few per token. The result is a model with billions of parameters that costs far less to run than its size suggests.
How routing works
A lightweight gating network scores each token and sends it to the top-k experts. Only those experts compute, so per-token FLOPs stay low while total capacity grows.
The trade-offs
MoE introduces load-balancing and memory challenges: experts can collapse onto a few popular routes, and all parameters must still fit in memory. The paper proposes an auxiliary balancing loss that keeps utilization even.