节拍：音频训练的音频训练器

论文标题

节拍：音频训练的音频训练器

BEATs: Audio Pre-Training with Acoustic Tokenizers

论文作者

Chen, Sanyuan, Wu, Yu, Wang, Chengyi, Liu, Shujie, Tompkins, Daniel, Chen, Zhuo, Wei, Furu

论文摘要

在过去的几年中，在语言，远见，言语和音频领域中见证了自我监督学习（SSL）的大量增长。虽然离散标签预测被广泛用于其他方式，但最先进的音频SSL模型仍采用重建损失进行预训练。与重建损失相比，语义丰富的离散标签预测鼓励SSL模型抽象高级音频语义，并像人类感知一样丢弃冗余细节。但是，由于音频和不可用的音素序列（如语音）的连续属性，通常不直接获得用于一般音频预训练的语义传递器，通常不直接获得。为了应对这一挑战，我们提出了Beats，这是一个迭代音频预训练框架，以从音频变压器中学习双向编码器表示，在该框架中，声音令牌和音频SSL模型通过迭代进行了优化。在第一次迭代中，我们将随机投影用作声音令牌，以掩码和标签预测方式训练音频SSL模型。然后，我们通过将语义知识从预先训练或微调的音频SSL模型中提炼出来，以训练一个声音令牌为下一次迭代。重复迭代，希望共同促进声音令牌和音频SSL模型。实验结果表明，我们的声学令牌可以生成具有丰富音频语义的离散标签，并且我们的音频SSL模型在各种音频分类基准中实现了最新的结果，甚至超过了以前的模型，这些模型使用了更多的训练数据和模型参数。具体来说，我们在不使用任何外部数据的情况下为仅音频模型设置了新的最新地图50.6％，用于仅有音频模型，而ESC-50的精度为98.1％。代码和预培训模型可在https://aka.ms/beats上找到。

The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and an audio SSL model are optimized by iterations. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model. The iteration is repeated with the hope of mutual promotion of the acoustic tokenizer and audio SSL model. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for audio-only models without using any external data, and 98.1% accuracy on ESC-50. The code and pre-trained models are available at https://aka.ms/beats.

下载PDF全文

下载文献需遵守相关版权规定

论文标题