LLMs/Paper Summary/01/11/2026/6 min read

Sparse Mixture-of-Experts at Inference Scale

How routing tokens to a small subset of expert networks delivers larger effective capacity without proportional compute cost.

Fedus, Zoph et al. · Google Research · JMLR 2025

Jon Bell

Research writer

Sparse Mixture-of-Experts at Inference Scale

Mixture-of-Experts (MoE) replaces a single dense feed-forward block with many expert sub-networks, activating only a few per token. The result is a model with billions of parameters that costs far less to run than its size suggests.

How routing works

A lightweight gating network scores each token and sends it to the top-k experts. Only those experts compute, so per-token FLOPs stay low while total capacity grows.

token -> gate -> [expert 2, expert 7] -> combine -> output

The trade-offs

MoE introduces load-balancing and memory challenges: experts can collapse onto a few popular routes, and all parameters must still fit in memory. The paper proposes an auxiliary balancing loss that keeps utilization even.

Citation

Fedus, W., Zoph, B. et al. (2025). Switch Transformers: Scaling to Trillion Parameter Models. arXiv:2101.03961.

Source paper

Keep reading

Related papers

More LLMs

LLMs

Comments

Add a practical note, implementation detail, or question. Comments are saved for editorial review.

No approved comments are visible yet. Start the discussion below.

Sparse Mixture-of-Experts at Inference Scale

How routing works

The trade-offs

Citation

Related papers

Compute-Optimal Training: Scaling Laws Revisited

CSS Container Queries Explained

Practical AI Tools for Small Teams

Comments

Sparse Mixture-of-Experts at Inference Scale

How routing works

The trade-offs

Citation

Related papers

Compute-Optimal Training: Scaling Laws Revisited

CSS Container Queries Explained

Practical AI Tools for Small Teams

Comments

The papers that matter, summarized weekly.