端到端主动扬声器检测

论文标题

端到端主动扬声器检测

End-to-End Active Speaker Detection

论文作者

Alcazar, Juan Leon, Cordes, Moritz, Zhao, Chen, Ghanem, Bernard

论文摘要

主动扬声器检测（ASD）问题的最新进展基于两个阶段的过程：特征提取和时空上下文集合。在本文中，我们提出了一个端到端的ASD工作流程，其中特征学习和上下文预测是共同学习的。我们的端到端可训练网络同时学习了多模式的嵌入和汇总时空上下文。这会导致更合适的功能表示，并改善了ASD任务的性能。我们还介绍了交织的图形神经网络（IGNN）块，该块根据ASD问题中的上下文的主要来源分开消息。实验表明，来自IGNN块的汇总特征更适合ASD，从而导致最先进的性能。最后，我们设计了一种弱监督的策略，该策略表明，ASD问题也可以通过使用视听数据来解决，但仅依赖于音频注释。我们通过对音频信号与可能的声源（扬声器）之间的直接关系进行建模以及引入对比度损失来实现这一目标。该项目的所有资源将在以下网址提供：https：//github.com/fuankarion/end-to-end-end-asd。

Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss. All the resources of this project will be made available at: https://github.com/fuankarion/end-to-end-asd.

下载PDF全文

下载文献需遵守相关版权规定

论文标题