Computer Vision/Paper Summary/01/08/2026/7 min read

Diffusion Transformers for High-Fidelity Vision

Replacing the U-Net backbone in diffusion models with a transformer improves scaling and sample quality across image benchmarks.

Peebles, Xie · UC Berkeley · ICCV 2025

Priya Shah

Vision research writer

Diffusion Transformers for High-Fidelity Vision

Diffusion models historically relied on convolutional U-Nets. This work shows that a pure transformer operating on latent patches scales more predictably and produces sharper samples as compute grows.

Patches, not pixels

Images are encoded into a latent grid, split into patches, and denoised by a transformer conditioned on the timestep and class. Larger transformers consistently lower the achievable loss.

Why it matters: a single architecture family (the transformer) now spans language, vision, and generation, simplifying tooling and infrastructure.

Benchmark results

On standard image-generation benchmarks the largest diffusion transformer sets new fidelity scores while following clean scaling curves, mirroring trends seen in language models.

Citation

Peebles, W., Xie, S. (2025). Scalable Diffusion Models with Transformers. arXiv:2212.09748.

Source paper

Keep reading

Related papers

More Computer Vision

LLMs

Comments

Add a practical note, implementation detail, or question. Comments are saved for editorial review.

No approved comments are visible yet. Start the discussion below.

Diffusion Transformers for High-Fidelity Vision

Patches, not pixels

Benchmark results

Citation

Related papers

Compute-Optimal Training: Scaling Laws Revisited

CSS Container Queries Explained

Sparse Mixture-of-Experts at Inference Scale

Comments

Diffusion Transformers for High-Fidelity Vision

Patches, not pixels

Benchmark results

Citation

Related papers

Compute-Optimal Training: Scaling Laws Revisited

CSS Container Queries Explained

Sparse Mixture-of-Experts at Inference Scale

Comments

The papers that matter, summarized weekly.