论文标题
Bevbert:语言指导导航的多模式地图预训练
BEVBert: Multimodal Map Pre-training for Language-guided Navigation
论文作者
论文摘要
大规模的预训练已显示出关于视觉和语言导航(VLN)任务的有希望的结果。但是,大多数现有的预训练方法采用离散全景图来学习视觉文本关联。这要求模型在全景中隐式相关联,重复观察结果,这可能会损害代理人的空间理解。因此,我们提出了一个新的基于地图的预训练范式,该范式是可用于VLN的空间意识。具体而言,我们构建了一个本地度量图,以明确汇总不完整的观察结果并删除重复项,同时在全球拓扑图中对导航依赖性进行建模。这种混合设计可以平衡VLN在短期推理和长期计划中的需求。然后,根据混合图,我们设计了一个预训练框架来学习多模式图表示,该框架增强了空间感知的跨模式推理,从而促进了语言指导的导航目标。广泛的实验证明了基于MAP的VLN训练途径的有效性,并且所提出的方法在四个VLN基准上实现了最新的方法。
Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.