暂时双线性编码视听率的视听特征网络

论文标题

暂时双线性编码视听率的视听特征网络

Temporal Bilinear Encoding Network of Audio-Visual Features at Low Sampling Rates

论文作者

Hu, Feiyan, Mohedano, Eva, O'Connor, Noel, McGuinness, Kevin

论文摘要

当前基于深度学习的视频分类体系结构通常是大量数据的端到端训练，需要大量的计算资源。本文旨在以每秒抽样率1帧的方式来利用视频分类中的视听信息。我们提出了时间双线性编码网络（TBEN），用于使用双线性池进行编码和视觉远距离时间信息编码，并证明双线性池优于平均汇总，用于较低采样率的视频的时间维度。我们还将标签层次结构嵌入了TBEN，以进一步提高分类器的鲁棒性。使用TBEN进行FGA240细粒分类数据集上的实验实现了新的最先进（hit@1=47.95%）。我们还利用将TBEN与多种解耦模式结合在一起的可能性，例如视觉语义和运动功能：以1 fps采样的UCF101上的实验达到接近最新的准确性（hit@1 = 91.03%），而对培训和预测的竞争方法要比竞争方法要少得多。

Current deep learning based video classification architectures are typically trained end-to-end on large volumes of data and require extensive computational resources. This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling and demonstrate bilinear pooling is better than average pooling on the temporal dimension for videos with low sampling rate. We also embed the label hierarchy in TBEN to further improve the robustness of the classifier. Experiments on the FGA240 fine-grained classification dataset using TBEN achieve a new state-of-the-art (hit@1=47.95%). We also exploit the possibility of incorporating TBEN with multiple decoupled modalities like visual semantic and motion features: experiments on UCF101 sampled at 1 FPS achieve close to state-of-the-art accuracy (hit@1=91.03%) while requiring significantly less computational resources than competing approaches for both training and prediction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题