Humandiffusion：可控文本驱动的人图像生成的粗线对齐扩散框架

论文标题

Humandiffusion：可控文本驱动的人图像生成的粗线对齐扩散框架

HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for Controllable Text-Driven Person Image Generation

论文作者

Zhang, Kaiduo, Sun, Muyi, Sun, Jianxin, Zhao, Binghao, Zhang, Kunbo, Sun, Zhenan, Tan, Tieniu

论文摘要

文本驱动的人形象生成是跨模式图像生成中的一项新兴而具有挑战性的任务。可控的人形象生成促进了广泛的应用，例如数字人类互动和虚拟尝试。但是，先前的方法主要采用单模式信息作为先验的条件（例如姿势引导的人形象产生），或者将预设单词用于文本驱动的人类合成。引入一个由免费单词组成的句子，其中包含可编辑的语义姿势地图来描述人的外观，这是一种更友好的方式。在本文中，我们提出了humandiffusion，这是一种粗到5的对齐扩散框架，用于文本驱动的人形象。具体而言，提出了两个协作模块，用于数据处理中的细颗粒特征蒸馏的程式化的内存检索（SMR）模块，以及在扩散中进行粗到1的多尺度跨模式对齐（MCA）模块。这两个模块保证了文本和图像的对齐质量，从图像级到特征级别，从低分辨率到高分辨率。结果，Humandiffusion意识到了带有所需语义姿势的开放式摄影师形象的产生。与以前的方法相比，对DeepFashion进行的广泛实验证明了我们方法的优越性。此外，对于具有各种细节和罕见姿势的复杂人图像，可以获得更好的结果。

Text-driven person image generation is an emerging and challenging task in cross-modality image generation. Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on. However, previous methods mostly employ single-modality information as the prior condition (e.g. pose-guided person image generation), or utilize the preset words for text-driven human synthesis. Introducing a sentence composed of free words with an editable semantic pose map to describe person appearance is a more user-friendly way. In this paper, we propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation. Specifically, two collaborative modules are proposed, the Stylized Memory Retrieval (SMR) module for fine-grained feature distillation in data processing and the Multi-scale Cross-modality Alignment (MCA) module for coarse-to-fine feature alignment in diffusion. These two modules guarantee the alignment quality of the text and image, from image-level to feature-level, from low-resolution to high-resolution. As a result, HumanDiffusion realizes open-vocabulary person image generation with desired semantic poses. Extensive experiments conducted on DeepFashion demonstrate the superiority of our method compared with previous approaches. Moreover, better results could be obtained for complicated person images with various details and uncommon poses.

下载PDF全文

下载文献需遵守相关版权规定

论文标题