动态音源的主动视听分离

论文标题

动态音源的主动视听分离

Active Audio-Visual Separation of Dynamic Sound Sources

论文作者

Majumder, Sagnik, Grauman, Kristen

论文摘要

我们探索动态声源的主动音频分离，其中体现的代理在3D环境中智能移动，以连续隔离感兴趣的对象发出的随时间变化的音频流。该经纪人听到了多种音频来源的混合流（例如，在嘈杂的派对上演奏音乐和乐队的音乐）。考虑到有限的时间预算，它需要使用以自我为中心的视听观察来准确提取目标声音。我们提出了一种配备了新型变压器记忆的增强式学习剂，该学习者使用自我意见来学习动态策略，以控制其相机和麦克风以恢复动态目标音频，从而对当前时间段进行高质量的估计，并同时改善其过去的估计。使用在现实世界扫描的Matterport3D环境中使用高度逼真的声音空间模拟，我们表明我们的模型能够学习有效的行为，以进行动态音频目标的连续分离。项目：https：//vision.cs.utexas.edu/projects/active-av-dynamic-separation/。

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using self-attention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces simulations in real-world scanned Matterport3D environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a dynamic audio target. Project: https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题