论文标题
零射击样式转移用于手势动画,由文本和语音使用多模式样式编码的对抗性驱动器驱动
Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding
论文作者
论文摘要
用行为方式对虚拟试剂进行建模是个性化人类代理相互作用的一个因素。我们提出了一种有效但有效的机器学习方法,以综合由韵律特征和文本驱动的手势,以不同的扬声器风格,包括在培训期间看不见的手势。我们的模型执行零射击多式模式转移,该样式由包含各种扬声器视频的PATS数据库中的多模式数据驱动。我们认为风格在说话时是普遍的,它使交流行为表现出色,而语音内容则由多模式信号和文本携带。这种内容和样式的解开方案使我们能够直接推断出数据的样式嵌入,即使数据属于培训阶段的一部分,而无需进行任何进一步的培训或微调。我们模型的第一个目标是根据两个音频和文本模式的内容生成源扬声器的手势。第二个目标是调节源扬声器预测目标扬声器的多模式行为样式的手势。第三个目标是允许在训练过程中不看到扬声器的零射击样式转移,而不会重新培训模型。我们的系统包括:(1)扬声器样式编码器网络,该网络学会从目标扬声器的多模式数据中生成固定的尺寸扬声器嵌入样式,以及(2)序列合成网络的序列,该序列根据源扬声器的输入方式和扬声器样式的条件来合成手势。我们评估我们的模型可以合成源扬声器的手势,并将目标扬声器样式变异性的知识转移到零拍摄设置中的手势生成任务中。我们将2D手势转换为3D姿势并产生3D动画。我们进行客观和主观评估以验证我们的方法并将其与基线进行比较。
Modeling virtual agents with behavior style is one factor for personalizing human agent interaction. We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers including those unseen during training. Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers. We view style as being pervasive while speaking, it colors the communicative behaviors expressivity while speech content is carried by multimodal signals and text. This disentanglement scheme of content and style allows us to directly infer the style embedding even of speaker whose data are not part of the training phase, without requiring any further training or fine tuning. The first goal of our model is to generate the gestures of a source speaker based on the content of two audio and text modalities. The second goal is to condition the source speaker predicted gestures on the multimodal behavior style embedding of a target speaker. The third goal is to allow zero shot style transfer of speakers unseen during training without retraining the model. Our system consists of: (1) a speaker style encoder network that learns to generate a fixed dimensional speaker embedding style from a target speaker multimodal data and (2) a sequence to sequence synthesis network that synthesizes gestures based on the content of the input modalities of a source speaker and conditioned on the speaker style embedding. We evaluate that our model can synthesize gestures of a source speaker and transfer the knowledge of target speaker style variability to the gesture generation task in a zero shot setup. We convert the 2D gestures to 3D poses and produce 3D animations. We conduct objective and subjective evaluations to validate our approach and compare it with a baseline.