DELORES：低资源音频表示学习的潜在空间

论文标题

DELORES：低资源音频表示学习的潜在空间

DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning

论文作者

Ghosh, Sreyan, Seth, Ashish, Mittal, and Deepak, Singh, Maneesh, Umesh, S.

论文摘要

受到计算机视觉的自我监督学习的最新进展的启发，在本文中，我们介绍了Delores，这是一种新的通用音频表示方法。我们的主要目标是使我们的网络学习在资源受限的设置（数据和计算）中，可以很好地跨越各种下游任务。受Barlow Twins目标功能的启发，我们建议学习对输入音频样本失真不变的嵌入，同时确保它们包含有关样本的非冗余信息。为此，我们测量了两个相同的网络的输出之间的互相关矩阵，该网络用扭曲的版本的音频段从音频文件采样，并使其尽可能接近身份矩阵。我们将大规模音频集数据集和FSD50K的一小部分组合用于自学学习，并且与最先进的算法相比，参数的一半不到一半。为了进行评估，我们将这些学习的表示形式转移到9个下游分类任务，包括语音，音乐和动物声音，并在不同的评估设置下显示竞争成果。除了简单和直观的态度外，我们的预训练算法还可以通过其固有的构造本质来计算，并且不需要仔细的实施细节以避免琐碎或退化的解决方案。此外，我们对结果进行消融研究，并将所有代码和预先培训的模型公开可用https://github.com/speech-lab-iitm/delores。

Inspired by the recent progress in self-supervised learning for computer vision, in this paper we introduce DeLoRes, a new general-purpose audio representation learning approach. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute), that can generalize well across a diverse set of downstream tasks. Inspired from the Barlow Twins objective function, we propose to learn embeddings that are invariant to distortions of an input audio sample, while making sure that they contain non-redundant information about the sample. To achieve this, we measure the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of an audio segment sampled from an audio file and make it as close to the identity matrix as possible. We use a combination of a small subset of the large-scale AudioSet dataset and FSD50K for self-supervised learning and are able to learn with less than half the parameters compared to state-of-the-art algorithms. For evaluation, we transfer these learned representations to 9 downstream classification tasks, including speech, music, and animal sounds, and show competitive results under different evaluation setups. In addition to being simple and intuitive, our pre-training algorithm is amenable to compute through its inherent nature of construction and does not require careful implementation details to avoid trivial or degenerate solutions. Furthermore, we conduct ablation studies on our results and make all our code and pre-trained models publicly available https://github.com/Speech-Lab-IITM/DeLoRes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题