带视觉对话的不可能培训的统一多模型模型

论文标题

带视觉对话的不可能培训的统一多模型模型

Unified Multimodal Model with Unlikelihood Training for Visual Dialog

论文作者

Wang, Zihao, Wang, Junli, Jiang, Changjun

论文摘要

视觉对话框的任务需要一个多模式聊天机器人来回答人类有关图像内容的顺序问题。先前的工作在积极实例（涉及正确的答案）上进行了标准的可能性培训，以生成答案。但是，可能性目标通常会导致频繁和沉闷的产出，并且无法从负面实例（涉及不正确的答案）中利用有用的知识。在本文中，我们提出了一个统一的多模型模型，并使用名为Unimm-ul的不可能培训来解决这个问题。首先，为了通过多任务学习来提高视觉对话的理解和生成，我们的模型将Vilbert从仅支持答案歧视到持有答案歧视并通过不同的注意性掩码无缝地构成答案。具体而言，为了使原始的判别模型与答案的生成兼容，我们设计了新颖的生成注意力掩码，以实现自回归的掩盖语言建模（自动回调的MLM）任务。为了减轻可能性目标的不利影响，我们在负面实例上利用了不可能的培训，以使模型不太可能产生错误的答案。然后，为了利用密集的注释，我们采用不同的微调方法来生成和区分答案，而不仅仅是像以前的工作那样歧视答案。最后，在Visdial数据集上，我们的模型获得了最佳的生成结果（69.23 NDCG得分）。我们的模型还可以在单模型和集合设置（75.92和76.17 NDCG得分）中与最先进的歧视结果相当。

The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation, we design novel generative attention masks to implement the autoregressive Masked Language Modeling (autoregressive MLM) task. And to attenuate the adverse effects of the likelihood objective, we exploit unlikelihood training on negative instances to make the model less likely to generate incorrect answers. Then, to utilize dense annotations, we adopt different fine-tuning methods for both generating and discriminating answers, rather than just for discriminating answers as in the prior work. Finally, on the VisDial dataset, our model achieves the best generative results (69.23 NDCG score). And our model also yields comparable discriminative results with the state-of-the-art in both single-model and ensemble settings (75.92 and 76.17 NDCG scores).

下载PDF全文

下载文献需遵守相关版权规定

论文标题