论文标题
用于造成音频表示形式的生成和对比度学习的框架
A Framework for Generative and Contrastive Learning of Audio Representations
论文作者
论文摘要
在本文中,我们在自我监督的框架工作中提供了一个用于音频表示的对比度学习的框架,而无需访问任何地面真相标签。自我监督的对比学习中的核心思想是绘制音频信号及其各种增强版本(代表音调,音色,音色等的显着方面)到它们靠近的空间,并与其他不同的信号分开。此外,我们还基于基于最先进的变压器架构的状态探索生成模型,用于学习音频信号的潜在空间,而无需访问任何标签。在这里,我们以较小的规模映射音频信号,以离散的字典元素和训练变压器来预测下一个字典元素。我们仅将数据用作监督方法,绕开了作为培训深神经网络的监督所需的标签的需求。然后,我们使用线性分类器头来评估模型的性能,以供学习的自我监督对比度和基于生成变压器的表示。与完全监督的方法相比,我们的系统实现了相当大的性能,并访问了训练神经网络模型的地面真相标签。这些表示形式,大规模音频数据的可用性显示了各种任务中有望理解任务
In this paper, we present a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels. The core idea in self supervised contrastive learning is to map an audio signal and its various augmented versions (representative of salient aspects of audio like pitch, timbre etc.) to a space where they are close together, and are separated from other different signals. In addition we also explore generative models based on state of the art transformer based architectures for learning latent spaces for audio signals, without access to any labels. Here, we map audio signals on a smaller scale to discrete dictionary elements and train transformers to predict the next dictionary element. We only use data as a method of supervision, bypassing the need of labels needed to act as a supervision for training the deep neural networks. We then use a linear classifier head in order to evaluate the performance of our models, for both self supervised contrastive and generative transformer based representations that are learned. Our system achieves considerable performance, compared to a fully supervised method, with access to ground truth labels to train the neural network model. These representations, with avail-ability of large scale audio data show promise in various tasks for audio understanding tasks