Starss22：真实场景的空间录音数据集，并带有空间量表的声音事件注释

论文标题

Starss22：真实场景的空间录音数据集，并带有空间量表的声音事件注释

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

论文作者

Politis, Archontis, Shimada, Kazuki, Sudarsanam, Parthasaarathy, Adavanne, Sharath, Krause, Daniel, Koyama, Yuichiro, Takahashi, Naoya, Takahashi, Shusuke, Mitsufuji, Yuki, Virtanen, Tuomas

论文摘要

该报告介绍了Sony-Tau现实的空间声景2022（Stars22）用于声音事件定位和检测的数据集，该数据集由在两个不同站点的各个内部收集的真实场景的空间记录组成。该数据集用高分辨率球形麦克风阵列捕获，并以两种4通道格式传递，一阶Ambisonics和四面体麦克风阵列。属于13个目标声音类的数据集中的声音事件通过人类注释和光学跟踪的结合在时间和空间上进行注释。与先前的迭代相比，该数据集作为DCASE2022挑战的任务3的开发和评估数据集，并针对该任务引入了重大的新挑战，该挑战是基于综合空间化的声音场景记录。详细详细介绍了数据集规范，包括记录和注释过程，目标类及其存在以及有关开发和评估拆分的详细信息。此外，该报告还介绍了在挑战中伴随数据集的基线系统，重点是与先前迭代的基线差异。也就是说，引入多ACCDOA表示以处理同一类事件的多个同时出现，并支持麦克风阵列格式的其他改进的输入功能。基线的结果表明，使用合适的培训策略，可以在真实的声音录音中实现合理的检测和本地化性能。该数据集可在https://zenodo.org/record/6387880中找到。

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events in the dataset belonging to 13 target sound classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. The dataset serves as the development and evaluation dataset for the Task 3 of the DCASE2022 Challenge on Sound Event Localization and Detection and introduces significant new challenges for the task compared to the previous iterations, which were based on synthetic spatialized sound scene recordings. Dataset specifications are detailed including recording and annotation process, target classes and their presence, and details on the development and evaluation splits. Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format. Results of the baseline indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6387880.

下载PDF全文

下载文献需遵守相关版权规定

论文标题