论文标题
从粗到细的多个声音来源本地化
Multiple Sound Sources Localization from Coarse to Fine
论文作者
论文摘要
如何在无约束的视频中视觉上将多个声源定位是一个巨大的问题,尤其是当缺乏成对的声音对象注释时。为了解决这个问题,我们开发了一个两阶段的视听学习框架,该框架将不同类别的音频和视觉表示与复杂的场景相关,然后以粗到5的方式执行跨模式特征对齐。我们的模型在公共本地化数据集上实现了最新的结果,并在复杂场景中的多源声音本地化方面具有相当大的性能。然后,我们采用本地化结果进行声音分离,并获得与现有方法相当的性能。这些结果证明了我们模型的能力,可以有效地使声音与特定的视觉源对齐。代码可在https://github.com/shvdiwnkozbw/multi-source-sound-localization上获得
How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations. To solve this problem, we develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine manner. Our model achieves state-of-the-art results on public dataset of localization, as well as considerable performance on multi-source sound localization in complex scenes. We then employ the localization results for sound separation and obtain comparable performance to existing methods. These outcomes demonstrate our model's ability in effectively aligning sounds with specific visual sources. Code is available at https://github.com/shvdiwnkozbw/Multi-Source-Sound-Localization