作为扩散学习者探索视觉变形金刚

论文标题

作为扩散学习者探索视觉变形金刚

Exploring Vision Transformers as Diffusion Learners

论文作者

Cao, He, Wang, Jianan, Ren, Tianhe, Qi, Xianbiao, Chen, Yihao, Yao, Yuan, Zhang, Lei

论文摘要

基于得分的扩散模型已引起广泛的关注，并资助了最近视力生成任务的快速进展。在本文中，我们专注于以前被忽略的扩散模型骨干。我们系统地探索视觉变形金刚作为各种生成任务的扩散学习者。随着我们的改进，基于香草VIT的骨干（IU-VIT）的性能与传统的基于U-NET的方法相提并论。我们进一步提供了关于将生成主链解散为编码器解码器结构的含义的假设，并显示了概念验证实验，以验证使用不对称编码器解码器（ASCEND）的强大编码器对生成任务的有效性（ASCEND）。我们的改进在CIFAR-10，Celeba，Lsun，Cub Bird和大分辨率的文本到图像任务上取得了竞争成果。据我们所知，我们是第一个成功训练单个扩散模型的文本对象任务，超过64x64分辨率。我们希望这将激励人们重新考虑基于扩散的生成模型的建模选择和培训管道。

Score-based diffusion models have captured widespread attention and funded fast progress of recent vision generative tasks. In this paper, we focus on diffusion model backbone which has been much neglected before. We systematically explore vision Transformers as diffusion learners for various generative tasks. With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods. We further provide a hypothesis on the implication of disentangling the generative backbone as an encoder-decoder structure and show proof-of-concept experiments verifying the effectiveness of a stronger encoder for generative tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution text-to-image tasks. To the best of our knowledge, we are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution. We hope this will motivate people to rethink the modeling choices and the training pipelines for diffusion-based generative models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题