多模式通道混合：面部动作单元检测的通道和空间蒙版自动编码器

论文标题

多模式通道混合：面部动作单元检测的通道和空间蒙版自动编码器

Multimodal Channel-Mixing: Channel and Spatial Masked AutoEncoder on Facial Action Unit Detection

论文作者

Zhang, Xiang, Yang, Huiyuan, Wang, Taoyue, Li, Xiaotian, Yin, Lijun

论文摘要

最近的研究集中在利用多模式数据开发面部动作单元（AU）检测的强大模型。但是，多模式数据的异质性在学习有效表示方面带来了挑战。一个挑战是使用单个特征提取器从多种方式中提取相关特征。此外，以前的研究尚未充分探索多模式融合策略的潜力。与晚期融合的广泛工作相反，关于渠道信息探索的早期融合的研究有限。本文提出了一个新型的多模式重建网络，称为多模式通道混合（MCM），是一种预先培训的模型，以学习促进多模式融合的强大表示形式。该方法遵循早期的融合设置，集成了一个通道混合模块，其中五个通道中有两个被随机删除。然后使用蒙版自动编码器从其余通道重建掉落的通道。该模块不仅降低了通道的冗余，而且还促进了多模式学习和重建功能，从而获得了鲁棒的功能学习。编码器在自动面部动作单元检测的下游任务上进行了微调。在BP4D+上进行了训练前实验，然后对BP4D和DISFA进行微调，以评估所提出框架的有效性和鲁棒性。结果表明，我们的方法达到并超过了最先进的基线方法的性能。

Recent studies have focused on utilizing multi-modal data to develop robust models for facial Action Unit (AU) detection. However, the heterogeneity of multi-modal data poses challenges in learning effective representations. One such challenge is extracting relevant features from multiple modalities using a single feature extractor. Moreover, previous studies have not fully explored the potential of multi-modal fusion strategies. In contrast to the extensive work on late fusion, there are limited investigations on early fusion for channel information exploration. This paper presents a novel multi-modal reconstruction network, named Multimodal Channel-Mixing (MCM), as a pre-trained model to learn robust representation for facilitating multi-modal fusion. The approach follows an early fusion setup, integrating a Channel-Mixing module, where two out of five channels are randomly dropped. The dropped channels then are reconstructed from the remaining channels using masked autoencoder. This module not only reduces channel redundancy, but also facilitates multi-modal learning and reconstruction capabilities, resulting in robust feature learning. The encoder is fine-tuned on a downstream task of automatic facial action unit detection. Pre-training experiments were conducted on BP4D+, followed by fine-tuning on BP4D and DISFA to assess the effectiveness and robustness of the proposed framework. The results demonstrate that our method meets and surpasses the performance of state-of-the-art baseline methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题