论文标题
基于剪切的图形学习网络,以发现顺序视频数据的组成结构
Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data
论文作者
论文摘要
常规的顺序学习方法,例如复发性神经网络(RNN),重点关注连续输入之间的相互作用,即一阶马尔可夫依赖性。但是,如视频所见,大多数顺序数据具有复杂的依赖性结构,暗示着长度的语义流及其组成,并且很难通过常规方法捕获这些结构。在这里,我们通过发现视频的这些复杂结构来提出基于剪切的图形学习网络(CB-GLN),以学习视频数据。 CB-GLNs将视频数据表示为图,其节点和边缘分别与视频框架及其依赖关系相对应。 CB-GLN通过带有图形示意图和消息传递框架的参数化内核找到数据的组成依赖性。我们在两个不同的任务上评估了提出的方法,以了解视频理解:视频主题分类(YouTube-8M数据集)和视频问题和答案(TVQQA数据集)。实验结果表明,我们的模型有效地了解视频数据的语义组成结构。此外,与其他基线方法相比,我们的模型达到了最高的性能。
Conventional sequential learning methods such as Recurrent Neural Networks (RNNs) focus on interactions between consecutive inputs, i.e. first-order Markovian dependency. However, most of sequential data, as seen with videos, have complex dependency structures that imply variable-length semantic flows and their compositions, and those are hard to be captured by conventional methods. Here, we propose Cut-Based Graph Learning Networks (CB-GLNs) for learning video data by discovering these complex structures of the video. The CB-GLNs represent video data as a graph, with nodes and edges corresponding to frames of the video and their dependencies respectively. The CB-GLNs find compositional dependencies of the data in multilevel graph forms via a parameterized kernel with graph-cut and a message passing framework. We evaluate the proposed method on the two different tasks for video understanding: Video theme classification (Youtube-8M dataset) and Video Question and Answering (TVQA dataset). The experimental results show that our model efficiently learns the semantic compositional structure of video data. Furthermore, our model achieves the highest performance in comparison to other baseline methods.