通过共享稀疏门控专家以端到端语音识别，参数效率的构象异构体

论文标题

通过共享稀疏门控专家以端到端语音识别，参数效率的构象异构体

Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition

论文作者

Bai, Ye, Li, Jie, Han, Wenjing, Ni, Hao, Xu, Kaituo, Zhang, Zhuo, Yi, Cheng, Wang, Xiaorui

论文摘要

尽管变形金刚及其变体构型在语音识别方面表现出令人鼓舞的表现，但参数化的属性在训练和推理过程中带来了很大的记忆成本。一些作品使用跨层重量共享来减少模型的参数。但是，不可避免的能力损失会损害模型性能。为了解决这个问题，本文提出了通过共享稀疏门控专家的参数效率构象异构体。具体而言，我们使用稀疏门控的专家（MOE）来扩展构象与块的容量，而无需增加计算。然后，共享分组构象块的参数，以减少参数的数量。接下来，为了确保具有不同级别适应表示的灵活性的共享块，我们会单独设计MOE路由器和标准化。此外，我们使用知识蒸馏来进一步提高性能。实验结果表明，与全参数模型相比，所提出的模型使用编码器的1/3来实现竞争性能。

While transformers and their variant conformers show promising performance in speech recognition, the parameterized property leads to much memory cost during training and inference. Some works use cross-layer weight-sharing to reduce the parameters of the model. However, the inevitable loss of capacity harms the model performance. To address this issue, this paper proposes a parameter-efficient conformer via sharing sparsely-gated experts. Specifically, we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a conformer block without increasing computation. Then, the parameters of the grouped conformer blocks are shared so that the number of parameters is reduced. Next, to ensure the shared blocks with the flexibility of adapting representations at different levels, we design the MoE routers and normalization individually. Moreover, we use knowledge distillation to further improve the performance. Experimental results show that the proposed model achieves competitive performance with 1/3 of the parameters of the encoder, compared with the full-parameter model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题