调查视觉视频解析中的模式偏见

论文标题

调查视觉视频解析中的模式偏见

Investigating Modality Bias in Audio Visual Video Parsing

论文作者

Pasi, Piyush Singh, Nemani, Shubham, Jyothi, Preethi, Ramakrishnan, Ganesh

论文摘要

我们专注于视频视频解析（AVVP）问题，该问题涉及检测具有时间边界的音频和视觉事件标签。该任务尤其具有挑战性，因为它仅在每个视频中只能用作一袋标签的事件标签，因此尤其具有挑战性。 AVVP的现有最新模型使用混合注意网络（HAN）为音频和视觉方式生成跨模式功能，以及一个汇总的专心池模块预测音频和视觉段级别的事件概率，以产生视频级别的事件概率。我们提供了现有HAN体系结构中模态偏差的详细分析，在预测期间，完全忽略了模态。我们还提出了HAN中特征聚合的变体，与现有的HAN模型相比，在节级和事件级别的视觉和视听事件的F评分中，F评分的绝对增益约为2％和1.6％。

We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries. The task is especially challenging since it is weakly supervised with only event labels available as a bag of labels for each video. An existing state-of-the-art model for AVVP uses a hybrid attention network (HAN) to generate cross-modal features for both audio and visual modalities, and an attentive pooling module that aggregates predicted audio and visual segment-level event probabilities to yield video-level event probabilities. We provide a detailed analysis of modality bias in the existing HAN architecture, where a modality is completely ignored during prediction. We also propose a variant of feature aggregation in HAN that leads to an absolute gain in F-scores of about 2% and 1.6% for visual and audio-visual events at both segment-level and event-level, in comparison to the existing HAN model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题