对自我监督预审议的模型的合奏特征的调查，以自动语音识别

论文标题

对自我监督预审议的模型的合奏特征的调查，以自动语音识别

Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition

论文作者

Arunkumar, A, Sukhadia, Vrunda N, Umesh, S.

论文摘要

已经证明，基于自学的学习（SSL）模型可以生成强大的表示，可用于改善下游语音任务的性能。有几种最先进的SSL模型可用，这些模型中的每一个都优化了不同的损失，这会导致其功能互补的可能性。本文提出了使用此类SSL表示和模型的集合，该集合利用了各种预告片模型提取的特征的互补性质。我们假设这会导致更丰富的特征表示，并显示了ASR下游任务的结果。为此，我们使用了三种SSL模型，这些模型在ASR任务上显示出了出色的结果，即Hubert，Wav2Vec2.0和Wavelm。我们使用从预训练的模型获得下游ASR任务的嵌入方式来探索用于ASR任务的模型集合和功能集合。我们使用LibrisPeech（100H）和WSJ数据集的单个模型和预训练的功能获得了改进的性能，用于下游任务。

Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks. Several state-of-the-art SSL models are available, and each of these models optimizes a different loss which gives rise to the possibility of their features being complementary. This paper proposes using an ensemble of such SSL representations and models, which exploits the complementary nature of the features extracted by the various pretrained models. We hypothesize that this results in a richer feature representation and shows results for the ASR downstream task. To this end, we use three SSL models that have shown excellent results on ASR tasks, namely HuBERT, Wav2vec2.0, and WaveLM. We explore the ensemble of models fine-tuned for the ASR task and the ensemble of features using the embeddings obtained from the pre-trained models for a downstream ASR task. We get improved performance over individual models and pre-trained features using Librispeech(100h) and WSJ dataset for the downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题