Dcase 2022：在低复杂性考虑下，CNN进行声学场景分类的比较分析

论文标题

Dcase 2022：在低复杂性考虑下，CNN进行声学场景分类的比较分析

DCASE 2022: Comparative Analysis Of CNNs For Acoustic Scene Classification Under Low-Complexity Considerations

论文作者

Zaragoza-Paredes, Josep, Naranjo-Alcazar, Javier, Naranjo, Valery, Zuccarello, Pedro

论文摘要

声学场景分类是一个自动听力问题，旨在根据其音频数据将音频记录分配给预定义的场景。多年来（在过去的版本中）这个问题通常通过称为合奏的技术解决（使用多种机器学习模型将其预测结合在推理阶段）。尽管这些解决方案可以在准确性方面显示出性能，但在计算能力方面它们可能非常昂贵，因此无法将其部署在IoT设备中。由于该研究领域的漂移，该任务在模型复杂性方面有两个局限性。应该注意的是，不匹配设备的复杂性（提供的音频都由不同的信息来源记录）。该技术报告对两个不同的网络架构进行了比较研究：常规CNN和Conv-Mixer。尽管这两个网络都超过了竞争对手所需的基线，但常规CNN的性能较高，超过基线高出8个百分点。基于Conv-Mixer体系结构的解决方案表现出较差的性能，尽管它们的解决方案要较轻。

Acoustic scene classification is an automatic listening problem that aims to assign an audio recording to a pre-defined scene based on its audio data. Over the years (and in past editions of the DCASE) this problem has often been solved with techniques known as ensembles (use of several machine learning models to combine their predictions in the inference phase). While these solutions can show performance in terms of accuracy, they can be very expensive in terms of computational capacity, making it impossible to deploy them in IoT devices. Due to the drift in this field of study, this task has two limitations in terms of model complexity. It should be noted that there is also the added complexity of mismatching devices (the audios provided are recorded by different sources of information). This technical report makes a comparative study of two different network architectures: conventional CNN and Conv-mixer. Although both networks exceed the baseline required by the competition, the conventional CNN shows a higher performance, exceeding the baseline by 8 percentage points. Solutions based on Conv-mixer architectures show worse performance although they are much lighter solutions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题