高分辨率图像合成的驯服变压器

论文标题

高分辨率图像合成的驯服变压器

Taming Transformers for High-Resolution Image Synthesis

论文作者

Esser, Patrick, Rombach, Robin, Ommer, Björn

论文摘要

旨在学习顺序数据上的长距离交互，变形金刚继续在各种任务上显示出最新的结果。与CNN相比，它们不包含优先级相互作用的归纳偏差。这使它们表现出表达性，但在长序列（例如高分辨率图像）上也无法进行计算。我们证明了如何将CNN的电感偏置的有效性与变压器的表现力相结合，使它们能够建模并合成高分辨率图像。我们展示了如何（i）使用CNN学习图像成分的上下文词汇，然后（ii）利用变形金刚在高分辨率图像中有效地对其组成进行建模。我们的方法很容易应用于条件综合任务，在该任务中，非空间信息（例如对象类）和空间信息（例如分割）都可以控制生成的图像。特别是，我们介绍了具有变形金刚的百万像素图像的语义引导的合成，并在课堂条件图像网的自回归模型中获得最新技术。可以在https://github.com/compvis/taming-transformers上找到代码和预算模型。

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .

下载PDF全文

下载文献需遵守相关版权规定

论文标题