论文标题
流星指导视频字幕的分歧
METEOR Guided Divergence for Video Captioning
论文作者
论文摘要
自动视频字幕旨在旨在了解整体视觉场景。它需要一种机制来捕获视频框架中的时间上下文,并能够理解给定时间表中对象的动作和关联的能力。这样的系统应还学会将视频序列抽象成明智的表示形式,并生成自然的书面语言。尽管大多数字幕模型仅关注视觉输入,但对视听方式的关注很少。为了解决这个问题,我们提出了一种新颖的两倍方法。首先,我们实施了奖励引导的KL差异,以训练一个视频字幕模型,该模型具有弹性的令牌排列。其次,我们利用双模式的层次结构增强学习(BMHRL)变压器体系结构来捕获输入数据的长期时间依赖性,作为我们层次结构字幕模块的基础。使用我们的BMHRL,我们通过在ActivationNet Captions DataSet上分别实现了$ 4.91 $,2.23美元的$ 4.91 $,2.23美元和$ 10.80 $的BLEU3,BLEU4和Meteor Socors,从而展示了HRL代理在生成内容完整和语法上声音句子中的适用性。最后,我们在https://github.com/d-rothen/bmhrl上为用户和开发人员公开提供BMHRL框架和经过培训的模型。
Automatic video captioning aims for a holistic visual scene understanding. It requires a mechanism for capturing temporal context in video frames and the ability to comprehend the actions and associations of objects in a given timeframe. Such a system should additionally learn to abstract video sequences into sensible representations as well as to generate natural written language. While the majority of captioning models focus solely on the visual inputs, little attention has been paid to the audiovisual modality. To tackle this issue, we propose a novel two-fold approach. First, we implement a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations. Second, we utilise a Bi-Modal Hierarchical Reinforcement Learning (BMHRL) Transformer architecture to capture long-term temporal dependencies of the input data as a foundation for our hierarchical captioning module. Using our BMHRL, we show the suitability of the HRL agent in the generation of content-complete and grammatically sound sentences by achieving $4.91$, $2.23$, and $10.80$ in BLEU3, BLEU4, and METEOR scores, respectively on the ActivityNet Captions dataset. Finally, we make our BMHRL framework and trained models publicly available for users and developers at https://github.com/d-rothen/bmhrl.