论文标题
通过主动对比度挖掘的强大视听实例歧视
Robust Audio-Visual Instance Discrimination via Active Contrastive Set Mining
论文作者
论文摘要
视听表示学习的最新成功可以主要归因于它们的视听同步的普遍属性,这可以用作自称的监督。作为一种最新的解决方案,视听实例歧视(AVID)将实例歧视扩展到视听领域。现有的AVID方法基于以下假设:从所有其他视频中的音频和视觉片段构建对比度。我们认为这一假设是粗糙的,因为由此产生的对比集有大量错误的负面因素。在本文中,我们通过提出一种新型的主动对比套开采(ACSM)来克服这一局限性,该开采(ACSM)旨在挖掘出具有丰富和多样化的否定性的对比度,以促进强大的狂热。此外,我们还将语义意识的硬样品采矿策略集成到我们的ACSM中。提出的ACSM将其实施为两种最新的最新狂热方法,并显着提高了其性能。在多个数据集上对动作和声音识别进行的广泛实验表明,我们方法的性能明显提高。
The recent success of audio-visual representation learning can be largely attributed to their pervasive property of audio-visual synchronization, which can be used as self-annotated supervision. As a state-of-the-art solution, Audio-Visual Instance Discrimination (AVID) extends instance discrimination to the audio-visual realm. Existing AVID methods construct the contrastive set by random sampling based on the assumption that the audio and visual clips from all other videos are not semantically related. We argue that this assumption is rough, since the resulting contrastive sets have a large number of faulty negatives. In this paper, we overcome this limitation by proposing a novel Active Contrastive Set Mining (ACSM) that aims to mine the contrastive sets with informative and diverse negatives for robust AVID. Moreover, we also integrate a semantically-aware hard-sample mining strategy into our ACSM. The proposed ACSM is implemented into two most recent state-of-the-art AVID methods and significantly improves their performance. Extensive experiments conducted on both action and sound recognition on multiple datasets show the remarkably improved performance of our method.