论文标题
3D CNNS的时间随机软max:面部表达识别中的应用
Temporal Stochastic Softmax for 3D CNNs: An Application in Facial Expression Recognition
论文作者
论文摘要
培训深度学习模型,以准确的时空识别视频中的面部表情需要大量的计算资源。出于实际原因,3D卷积神经网络(3D CNN)通常接受从视频中随机提取的相对较短的剪辑进行训练。但是,这种均匀的采样通常是优化的,因为将相等的重要性分配给每个时间夹。在本文中,我们提出了一种对3D CNN的有效基于视频培训的策略。它依赖于软性时间池和加权采样机制来选择最相关的训练夹。 The proposed softmax strategy provides several advantages: a reduced computational complexity due to efficient clip sampling, and an improved accuracy since temporal weighting focuses on more relevant clips during both training and inference.用拟议方法获得的几种面部表达识别基准获得的实验结果表明,专注于培训视频中更有用的剪辑的好处。特别是,我们的方法通过减少视频的修剪和粗糙注释的影响以及跨时间的视觉信息的异质分布来提高性能和计算成本。
Training deep learning models for accurate spatiotemporal recognition of facial expressions in videos requires significant computational resources. For practical reasons, 3D Convolutional Neural Networks (3D CNNs) are usually trained with relatively short clips randomly extracted from videos. However, such uniform sampling is generally sub-optimal because equal importance is assigned to each temporal clip. In this paper, we present a strategy for efficient video-based training of 3D CNNs. It relies on softmax temporal pooling and a weighted sampling mechanism to select the most relevant training clips. The proposed softmax strategy provides several advantages: a reduced computational complexity due to efficient clip sampling, and an improved accuracy since temporal weighting focuses on more relevant clips during both training and inference. Experimental results obtained with the proposed method on several facial expression recognition benchmarks show the benefits of focusing on more informative clips in training videos. In particular, our approach improves performance and computational cost by reducing the impact of inaccurate trimming and coarse annotation of videos, and heterogeneous distribution of visual information across time.