具有标识机制的可扩展视频对象细分

论文标题

具有标识机制的可扩展视频对象细分

Scalable Video Object Segmentation with Identification Mechanism

论文作者

Yang, Zongxin, Miao, Jiaxu, Wei, Yunchao, Wang, Wenguan, Wang, Xiaohan, Yang, Yi

论文摘要

本文深入研究了为半监督视频对象细分（VOS）实现可扩展有效的多对象建模的挑战。先前的VOS方法将具有单个正对象的特征解码功能，从而限制了多对象表示的学习，因为它们必须在多对象方案下分别匹配和分割每个目标。此外，早期的技术符合特定的应用程序目标，并且缺乏满足不同速度准确要求的灵活性。为了解决这些问题，我们提出了两种创新的方法，将对象与变压器（AOT）相关联，并将对象与可扩展的变压器（AOST）相关联。在追求有效的多对象建模时，AOT引入了识别（ID）机制，将每个对象分配为唯一的身份。这种方法使网络能够同时建模所有对象之间的关联，从而促进对象在单个网络通过中的跟踪和分割。为了应对不灵活的部署的挑战，AOST进一步集成了可扩展的长期变压器，这些变压器结合了可扩展的监督和基于层的ID注意力。这使在线体系结构首次在VOS中可扩展性，并克服ID嵌入式的表示限制。鉴于缺乏涉及密度多对象注释的VO的基准，我们提出了野生（VOSW）基准中具有挑战性的视频对象细分，以验证我们的方法。我们使用VOSW和五个常用的VOS基准测试的广泛实验评估了各种AOT和AOST变体，包括YouTube-VOS 2018和2019 Val，Davis-2017 Val＆Test和Davis-2016。我们的方法超过了最先进的竞争对手，并且在所有六个基准测试中都始终如一地显示出卓越的效率和可伸缩性。项目页面：https：//github.com/yoxu515/aot-benchmark。

This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting the learning of multi-object representation as they must match and segment each target separately under multi-object scenarios. Additionally, earlier techniques catered to specific application objectives and lacked the flexibility to fulfill different speed-accuracy requirements. To address these problems, we present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST). In pursuing effective multi-object modeling, AOT introduces the IDentification (ID) mechanism to allocate each object a unique identity. This approach enables the network to model the associations among all objects simultaneously, thus facilitating the tracking and segmentation of objects in a single network pass. To address the challenge of inflexible deployment, AOST further integrates scalable long short-term transformers that incorporate scalable supervision and layer-wise ID-based attention. This enables online architecture scalability in VOS for the first time and overcomes ID embeddings' representation limitations. Given the absence of a benchmark for VOS involving densely multi-object annotations, we propose a challenging Video Object Segmentation in the Wild (VOSW) benchmark to validate our approaches. We evaluated various AOT and AOST variants using extensive experiments across VOSW and five commonly used VOS benchmarks, including YouTube-VOS 2018 & 2019 Val, DAVIS-2017 Val & Test, and DAVIS-2016. Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks. Project page: https://github.com/yoxu515/aot-benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题