Diffusion models historically relied on convolutional U-Nets. This work shows that a pure transformer operating on latent patches scales more predictably and produces sharper samples as compute grows.

Patches, not pixels

Images are encoded into a latent grid, split into patches, and denoised by a transformer conditioned on the timestep and class. Larger transformers consistently lower the achievable loss.

Why it matters: a single architecture family (the transformer) now spans language, vision, and generation, simplifying tooling and infrastructure.

Benchmark results

On standard image-generation benchmarks the largest diffusion transformer sets new fidelity scores while following clean scaling curves, mirroring trends seen in language models.