论文标题
使用空间对齐方式从音频数据中自我监督学习音频表示形式
Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment
论文作者
论文摘要
从视听数据中学习提供了许多可能性来表达音频和视觉内容之间对应关系的可能性,类似于与听觉和视觉信息相关的人类感知。在这项工作中,我们提出了一种基于视听空间对齐(AVSA)的自我监督表示学习的方法,这是一种比视听通讯(AVC)更复杂的对齐任务。除了对应关系外,AVSA还从声学和视觉内容的空间位置学习。基于360 $^\ text {o} $视频和Ambisonics音频,我们建议使用对象检测选择视觉对象,并将音频信号朝向检测到的对象,试图学习对象和声音之间的空间对齐。我们研究了空间音频功能以表示音频输入以及不同的音频格式:Ambisonics,Mono和Stereo。实验结果表明,与日志频谱图相比,第一阶Ambisonics强度向量(FOA-IV)对AVSA的AVSA有10美元的改进;以对象为导向的作物的添加还为下游任务的人类行动识别带来了显着的性能提高。设计了许多仅一音频的下游任务来测试学习音频功能表示的有效性,从而获得了与Ambisonic和Binaural Audio有关声学场景分类的最新方法可比的性能。
Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360$^\text{o}$ video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10 $\%$ improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstream task. A number of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.