论文标题
动态音源的主动视听分离
Active Audio-Visual Separation of Dynamic Sound Sources
论文作者
论文摘要
我们探索动态声源的主动音频分离,其中体现的代理在3D环境中智能移动,以连续隔离感兴趣的对象发出的随时间变化的音频流。该经纪人听到了多种音频来源的混合流(例如,在嘈杂的派对上演奏音乐和乐队的音乐)。考虑到有限的时间预算,它需要使用以自我为中心的视听观察来准确提取目标声音。我们提出了一种配备了新型变压器记忆的增强式学习剂,该学习者使用自我意见来学习动态策略,以控制其相机和麦克风以恢复动态目标音频,从而对当前时间段进行高质量的估计,并同时改善其过去的估计。使用在现实世界扫描的Matterport3D环境中使用高度逼真的声音空间模拟,我们表明我们的模型能够学习有效的行为,以进行动态音频目标的连续分离。项目:https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/。
We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using self-attention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces simulations in real-world scanned Matterport3D environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a dynamic audio target. Project: https://vision.cs.utexas.edu/projects/active-av-dynamic-separation/.