Bevbert：语言指导导航的多模式地图预训练

论文标题

Bevbert：语言指导导航的多模式地图预训练

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

论文作者

An, Dong, Qi, Yuankai, Li, Yangguang, Huang, Yan, Wang, Liang, Tan, Tieniu, Shao, Jing

论文摘要

大规模的预训练已显示出关于视觉和语言导航（VLN）任务的有希望的结果。但是，大多数现有的预训练方法采用离散全景图来学习视觉文本关联。这要求模型在全景中隐式相关联，重复观察结果，这可能会损害代理人的空间理解。因此，我们提出了一个新的基于地图的预训练范式，该范式是可用于VLN的空间意识。具体而言，我们构建了一个本地度量图，以明确汇总不完整的观察结果并删除重复项，同时在全球拓扑图中对导航依赖性进行建模。这种混合设计可以平衡VLN在短期推理和长期计划中的需求。然后，根据混合图，我们设计了一个预训练框架来学习多模式图表示，该框架增强了空间感知的跨模式推理，从而促进了语言指导的导航目标。广泛的实验证明了基于MAP的VLN训练途径的有效性，并且所提出的方法在四个VLN基准上实现了最新的方法。

Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题