论文标题
使用图像嵌入的零击音频分类
Zero-Shot Audio Classification using Image Embeddings
论文作者
论文摘要
监督学习方法可以在存在大量标记数据的情况下解决给定的问题。但是,涵盖所有目标类的数据集的采集通常需要昂贵且耗时的手动标签。零击学习模型能够利用其语义信息来对看不见的概念进行分类。本研究通过使用非线性声音 - 语义投影介绍了图像嵌入作为有关零击音频分类的附带信息。我们从开放图像数据集中提取语义图像表示,并使用不同域中的语义信息在音频集的音频子集上评估模型的性能;图像,音频和文字。我们证明,图像嵌入可以用作语义信息来执行零击音频分类。实验结果表明,图像和文本嵌入式单独和一起显示相似的性能。我们还从测试样品中计算出语义声嵌入,以提供性能的上限。结果表明,分类性能对测试和训练类之间的语义关系以及文本和图像嵌入之间的语义关系高度敏感,当时可见和看不见的类在语义上相似时,可以直至语义声学嵌入。
Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and time-consuming. Zero-shot learning models are capable of classifying the unseen concepts by utilizing their semantic information. The present study introduces image embeddings as side information on zero-shot audio classification by using a nonlinear acoustic-semantic projection. We extract the semantic image representations from the Open Images dataset and evaluate the performance of the models on an audio subset of AudioSet using semantic information in different domains; image, audio, and textual. We demonstrate that the image embeddings can be used as semantic information to perform zero-shot audio classification. The experimental results show that the image and textual embeddings display similar performance both individually and together. We additionally calculate the semantic acoustic embeddings from the test samples to provide an upper limit to the performance. The results show that the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddings can reach up to the semantic acoustic embeddings when the seen and unseen classes are semantically similar.