Diffusion models historically relied on convolutional U-Nets. This work shows that a pure transformer operating on latent patches scales more predictably and produces sharper samples as compute grows.
Patches, not pixels
Images are encoded into a latent grid, split into patches, and denoised by a transformer conditioned on the timestep and class. Larger transformers consistently lower the achievable loss.
Benchmark results
On standard image-generation benchmarks the largest diffusion transformer sets new fidelity scores while following clean scaling curves, mirroring trends seen in language models.