Efficient Diffusion Models without Attention - Apple Machine Learning Research

AuthorsJing Nathan Yan, Jiatao Gu, Alexander M. Rush

View publication

Copy Bibtex

Transformers have demonstrated impressive performance on class-conditional ImageNet benchmarks, achieving state-of-the-art FID scores. However, their computational complexity increases with transformer depth/width or the number of input tokens and requires patchy approximation to operate on even latent input sequences. In this paper, we address these issues by presenting a novel approach to enhance the efficiency and scalability of image generation models, incorporating state space models (SSMs) as the core component and deviating from the widely adopted transformer-based and U-Net architectures. We introduce a class of SSM-based models that significantly reduce forward pass complexity while maintaining comparable performance and taking input exact sequences without patchy approximations. Through extensive experiments and rigorous evaluation, we demonstrate that our proposed approach reduces the Gflops utilized in the model without sacrificing the quality of generated images. Our findings suggest that state space models can be an effective alternative to attention mechanisms in transformer-based architectures, offering a more efficient solution for large-scale image generation tasks.

Motivated by the effective implementation of transformer architectures in natural language processing, machine learning researchers introduced the concept of a vision transformer (ViT) in 2021. This innovative approach serves as an alternative to convolutional neural networks (CNNs) for computer vision applications, as detailed in the paper, An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Our research in machine learning breaks new ground every day.

Work with us