视觉扎根的构图短语的持续学习

论文标题

视觉扎根的构图短语的持续学习

Visually Grounded Continual Learning of Compositional Phrases

论文作者

Jin, Xisen, Du, Junyi, Sadhu, Arka, Nevatia, Ram, Ren, Xiang

论文摘要

与当代的NLP系统相比，人类不断地获取语言，一次获得更有限的访问数据样本。为了研究这种类似人类的语言获取能力，我们提出了Viscoll，这是一项视觉扎根的语言学习任务，该任务模拟了从流媒体视觉场景中持续获取组成短语。在任务中，模型在具有变化对象分布的成对图像捕捉流上进行了训练。在持有测试集中的视觉掩盖语言预测任务中不断评估。 Viscoll将持续学习的挑战（即，从连续转移数据分布中学习）和组成概括（即对新颖组成的概括）的挑战更加复杂。为了促进Viscoll的研究，我们构建了两个数据集，分别是可可转移和换档，并使用不同的持续学习方法对其进行基准测试。结果表明，SOTA持续学习方法几乎没有改进Viscoll，因为存储所有可能组成的示例是不可行的。我们进行进一步的消融和分析以指导未来的工作。

Humans acquire language continually with much more limited access to data samples at a time, as compared to contemporary NLP systems. To study this human-like language acquisition ability, we present VisCOLL, a visually grounded language learning task, which simulates the continual acquisition of compositional phrases from streaming visual scenes. In the task, models are trained on a paired image-caption stream which has shifting object distribution; while being constantly evaluated by a visually-grounded masked language prediction task on held-out test sets. VisCOLL compounds the challenges of continual learning (i.e., learning from continuously shifting data distribution) and compositional generalization (i.e., generalizing to novel compositions). To facilitate research on VisCOLL, we construct two datasets, COCO-shift and Flickr-shift, and benchmark them using different continual learning methods. Results reveal that SoTA continual learning approaches provide little to no improvements on VisCOLL, since storing examples of all possible compositions is infeasible. We conduct further ablations and analysis to guide future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题