用弱标记的数据对声音进行大规模的视听学习

论文标题

用弱标记的数据对声音进行大规模的视听学习

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

论文作者

Fayek, Haytham M., Kumar, Anurag

论文摘要

识别声音是计算音频场景分析和机器感知的关键方面。在本文中，我们提倡声音识别本质上是一个多模式的视听任务，因为使用音频和视觉方式而不是另一个或另一个，更容易区分声音。我们提出了一个视听融合模型，该模型学会从弱标记的视频录制中识别声音。提出的融合模型利用注意机制动态结合了单个音频和视觉模型的输出。大规模声音事件数据集Audioset的实验证明了所提出的模型的功效，该模型的表现优于单模式模型，以及最先进的融合和多模式模型。我们在音频集上达到平均平均精度（MAP）为46.16，表现优于先前的最新状态，大约+4.35 MAP（相对：10.4％）。

Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4%).

下载PDF全文

下载文献需遵守相关版权规定

论文标题