通过变压器到CNN知识蒸馏有效的大规模音频标记

论文标题

通过变压器到CNN知识蒸馏有效的大规模音频标记

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

论文作者

Schmid, Florian, Koutini, Khaled, Widmer, Gerhard

论文摘要

音频谱图变压器模型统治音频标记的字段，超过先前主导的卷积神经网络（CNN）。它们的优越性基于扩展和利用大规模数据集（例如音频集）的能力。但是，与CNN相比，变压器在模型大小和计算要求方面要求。我们提出了一个基于离线知识蒸馏（KD）的培训程序，从高性能但复杂的变压器中提出了培训程序。提出的训练模式和基于MobileNetV3的有效CNN设计导致模型在参数和计算效率和预测性能方面优于先前的解决方案。我们提供不同复杂性水平的模型，从低复杂模型扩展到音频集上的.483 MAP的新最新性能。源代码可用：https：//github.com/fschmid56/felffited

Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of .483 mAP on AudioSet. Source Code available at: https://github.com/fschmid56/EfficientAT

下载PDF全文

下载文献需遵守相关版权规定

论文标题