TERA：言语的变压器编码器表示的自我监督学习

论文标题

TERA：言语的变压器编码器表示的自我监督学习

TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech

论文作者

Liu, Andy T., Li, Shang-Wen, Lee, Hung-yi

论文摘要

我们介绍了一种称为TERA的自我监督的语音预训练方法，该方法代表了变化的变压器编码器表示。最近的方法通常通过使用一项辅助任务（例如对比度预测，自回归预测或掩盖重建）来学习。与以前的方法不同，我们使用沿三个正交轴的变化来预先训练变压器编码大量未标记的语音。该模型通过更改的声学框架的重建来学习，在那里我们使用随机策略来沿着各个维度改变：时间，频率和幅度。 TERA可用于语音表示提取或通过下游模型进行微调。我们在几个下游任务上评估TERA，包括音素分类，关键字发现，说话者识别和语音识别。我们进行了各种自我监督模型的大规模比较。 TERA通过改善表面特征和优于先前模型的表面特征，在比较中实现了强劲的性能。在我们的实验中，我们研究了应用不同的改变技术，对更多数据进行预训练以及对各种特征进行预训练的效果。我们分析了不同的模型大小，发现较小的模型比大型模型是强大的表示，而较大的模型比较小的模型更有效。此外，我们表明所提出的方法可以转移到未用于预训练中的下游数据集。

We introduce a self-supervised speech pre-training method called TERA, which stands for Transformer Encoder Representations from Alteration. Recent approaches often learn by using a single auxiliary task like contrastive prediction, autoregressive prediction, or masked reconstruction. Unlike previous methods, we use alteration along three orthogonal axes to pre-train Transformer Encoders on a large amount of unlabeled speech. The model learns through the reconstruction of acoustic frames from their altered counterpart, where we use a stochastic policy to alter along various dimensions: time, frequency, and magnitude. TERA can be used for speech representations extraction or fine-tuning with downstream models. We evaluate TERA on several downstream tasks, including phoneme classification, keyword spotting, speaker recognition, and speech recognition. We present a large-scale comparison of various self-supervised models. TERA achieves strong performance in the comparison by improving upon surface features and outperforming previous models. In our experiments, we study the effect of applying different alteration techniques, pre-training on more data, and pre-training on various features. We analyze different model sizes and find that smaller models are strong representation learners than larger models, while larger models are more effective for downstream fine-tuning than smaller models. Furthermore, we show the proposed method is transferable to downstream datasets not used in pre-training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题