视觉援助对自动音频字幕的影响

论文标题

视觉援助对自动音频字幕的影响

Impact of visual assistance for automated audio captioning

论文作者

Boes, Wim, Van hamme, Hugo

论文摘要

我们研究视觉援助对自动音频字幕的影响。利用多个编码器变压器体系结构，这些架构以前已在声音事件检测的背景下引入与视觉相关的信息，我们分析了合并各种预告片的功能的有用性。我们对基于YouTube的视听数据集进行实验，并研究根据各种字幕指标应用所考虑的转移学习技术的效果。我们发现，只有一种考虑的验证功能中只有一种提供一致的改进，而其他人根本没有提供任何值得注意的收益。有趣的是，先前的研究工作的结果表明，在声音事件检测的情况下，完全相反的情况是正确的，这使我们得出结论，视觉嵌入的最佳选择很大程度上取决于手头的任务。更具体地说，关注语义的视觉功能在自动音频字幕的背景下看起来很合适，而对于声音事件检测，时间信息似乎更为重要。

We study the impact of visual assistance for automated audio captioning. Utilizing multi-encoder transformer architectures, which have previously been employed to introduce vision-related information in the context of sound event detection, we analyze the usefulness of incorporating a variety of pretrained features. We perform experiments on a YouTube-based audiovisual data set and investigate the effect of applying the considered transfer learning technique in terms of a variety of captioning metrics. We find that only one of the considered kinds of pretrained features provides consistent improvements, while the others do not provide any noteworthy gains at all. Interestingly, the outcomes of prior research efforts indicate that the exact opposite is true in the case of sound event detection, leading us to conclude that the optimal choice of visual embeddings is strongly dependent on the task at hand. More specifically, visual features focusing on semantics appear appropriate in the context of automated audio captioning, while for sound event detection, time information seems to be more important.

下载PDF全文

下载文献需遵守相关版权规定

论文标题