论文标题
通过递归融合通过联合共同注意来进行视听事件定位
Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
论文作者
论文摘要
视听事件本地化任务的主要挑战在于如何有效地从多种模式中融合信息。最近的工作表明,注意机制对融合过程有益。在本文中,我们提出了一种新型的联合注意机制,采用多模式融合方法进行视听事件定位。特别是,我们提出了一种简洁而有效的体系结构,该体系结构以共同的方式有效地从多种方式中学习表示形式。最初,视觉特征与听觉功能相结合,然后变成关节表示。接下来,我们将使用联合表示形式分别参与视觉功能和听觉功能。借助这一联合共同注意,产生了新的视觉和听觉功能,因此这两种功能都可以彼此相互改善。值得注意的是,联合共同注意单位是递归的,这意味着可以多次进行逐渐获得更好的联合表示。公共AVE数据集进行的广泛实验表明,所提出的方法比最新方法获得了明显更好的结果。
The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. In this paper, we propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization. Particularly, we present a concise yet valid architecture that effectively learns representations from multiple modalities in a joint manner. Initially, visual features are combined with auditory features and then turned into joint representations. Next, we make use of the joint representations to attend to visual features and auditory features, respectively. With the help of this joint co-attention, new visual and auditory features are produced, and thus both features can enjoy the mutually improved benefits from each other. It is worth noting that the joint co-attention unit is recursive meaning that it can be performed multiple times for obtaining better joint representations progressively. Extensive experiments on the public AVE dataset have shown that the proposed method achieves significantly better results than the state-of-the-art methods.