视频标题数据集用于描述日语的人类行为

论文标题

视频标题数据集用于描述日语的人类行为

Video Caption Dataset for Describing Human Actions in Japanese

论文作者

Shigeto, Yutaro, Yoshikawa, Yuya, Lin, Jiaqing, Takeuchi, Akikazu

论文摘要

近年来，自动视频字幕产生引起了广泛的关注。本文着重于描述人类行为的日本字幕的产生。虽然目前已为英语构建了当前可用的视频字幕数据集，但没有等效的日本数据集。为了解决这个问题，我们构建了一个大规模的日本视频标题数据集，该数据集由79,822个视频和399,233个字幕组成。我们数据集中的每个字幕都以“谁在做什么以及在哪里”的形式描述一个视频。要描述人类的行为，重要的是要确定一个人，地方和行动的细节。确实，当我们描述人类行为时，我们通常会提及场景，人和行动。在我们的实验中，我们评估了两种标题生成方法以获得基准结果。此外，我们调查了这些一代方法是否可以指定“谁在做什么以及在哪里做”。

In recent years, automatic video caption generation has attracted considerable attention. This paper focuses on the generation of Japanese captions for describing human actions. While most currently available video caption datasets have been constructed for English, there is no equivalent Japanese dataset. To address this, we constructed a large-scale Japanese video caption dataset consisting of 79,822 videos and 399,233 captions. Each caption in our dataset describes a video in the form of "who does what and where." To describe human actions, it is important to identify the details of a person, place, and action. Indeed, when we describe human actions, we usually mention the scene, person, and action. In our experiments, we evaluated two caption generation methods to obtain benchmark results. Further, we investigated whether those generation methods could specify "who does what and where."

下载PDF全文

下载文献需遵守相关版权规定

论文标题