论文标题
MSAF:多模式分裂注意融合
MSAF: Multimodal Split Attention Fusion
论文作者
论文摘要
多模式学习模仿人类多感官系统的推理过程,该过程用于感知周围世界。在做出预测的同时,人脑倾向于从多种信息来源中关键提示。在这项工作中,我们提出了一个新型的多模式融合模块,该模块学会强调各种方式的更多贡献特征。具体而言,所提出的多模式分裂注意融合(MSAF)模块将每种模态分为符合通道相等的特征块,并创建一个关节表示,该表示可以在特征块上为每个通道产生软注意力。此外,MSAF模块设计为与适用于CNN和RNN的各种空间维度和序列长度的特征兼容。因此,可以轻松地将MSAF添加到任何单峰网络的保险丝特征中,并利用现有的验证的单形模型权重。为了证明我们的融合模块的有效性,我们设计了三个具有MSAF的多模式网络,用于情绪识别,情感分析和行动识别任务。我们的方法在每个任务中都取得了竞争成果,并胜过其他特定于应用程序的网络和多模式融合基准。
Multimodal learning mimics the reasoning process of the human multi-sensory system, which is used to perceive the surrounding world. While making a prediction, the human brain tends to relate crucial cues from multiple sources of information. In this work, we propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities. Specifically, the proposed Multimodal Split Attention Fusion (MSAF) module splits each modality into channel-wise equal feature blocks and creates a joint representation that is used to generate soft attention for each channel across the feature blocks. Further, the MSAF module is designed to be compatible with features of various spatial dimensions and sequence lengths, suitable for both CNNs and RNNs. Thus, MSAF can be easily added to fuse features of any unimodal networks and utilize existing pretrained unimodal model weights. To demonstrate the effectiveness of our fusion module, we design three multimodal networks with MSAF for emotion recognition, sentiment analysis, and action recognition tasks. Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.