论文标题
视频对象检测的双语义融合网络
Dual Semantic Fusion Network for Video Object Detection
论文作者
论文摘要
由于在复杂环境下捕获的视频序列的质量恶化,视频对象检测是一项艰巨的任务。当前,该区域以一系列基于功能增强的方法为主导,这些方法从多个帧中提取有益的语义信息,并通过融合蒸馏信息来生成增强功能。但是,蒸馏和融合操作通常使用其他信息(例如光流和特征存储器)在帧级别或实例级别上进行外部指南进行。在这项工作中,我们提出了一个双语语义融合网络(缩写为DSFNET),以在没有外部指导的情况下在统一的融合框架中充分利用帧级别和实例级别的语义。此外,我们将几何相似性度量引入融合过程中,以减轻噪声引起的信息失真的影响。结果,所提出的DSFNET可以通过多粒性融合产生更健壮的特征,并避免受外部指导不稳定的影响。为了评估所提出的DSFNET,我们在Imagenet VID数据集上进行了广泛的实验。值得注意的是,据我们所知,拟议的双语语义融合网络实现了84.1 \%映射的最佳性能,在带有RESNET-101和85.4 \%MAP的当前最新视频对象检测器中,无需使用任何后处理步骤。
Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic information from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by noise. As a result, the proposed DSFNet can generate more robust features through the multi-granularity fusion and avoid being affected by the instability of external guidance. To evaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet VID dataset. Notably, the proposed dual semantic fusion network achieves, to the best of our knowledge, the best performance of 84.1\% mAP among the current state-of-the-art video object detectors with ResNet-101 and 85.4\% mAP with ResNeXt-101 without using any post-processing steps.