论文标题
舒伯特:伯特的优化元素
schuBERT: Optimizing Elements of BERT
论文作者
论文摘要
变形金刚\ citep {vaswani2017}逐渐成为许多最新的自然语言表示模型的关键组成部分。最近基于变压器的模型 - bert \ citep {devlin2018bert}在各种自然语言处理任务上获得了最新的结果,包括胶水,v1.1和小队V2.0。但是,该模型在计算上是过敏性的,并且具有大量参数。在这项工作中,我们重新审视伯特(Bert)的架构选择,以获得更轻的模型。我们专注于减少参数的数量,但我们的方法可以应用于其他目标或延迟等其他目标。我们表明,可以通过减少算法选择正确的体系结构设计维度而不是减少变压器编码器层的数量来获得效率较高的光BERT模型。特别是,我们的舒伯特给出了$ 6.6 \%$ $上的平均准确性,而胶水和小队数据集则与三个编码器层的BERT相比,具有相同数量的参数。
Transformers \citep{vaswani2017attention} have gradually become a key component for many state-of-the-art natural language representation models. A recent Transformer based model- BERT \citep{devlin2018bert} achieved state-of-the-art results on various natural language processing tasks, including GLUE, SQuAD v1.1, and SQuAD v2.0. This model however is computationally prohibitive and has a huge number of parameters. In this work we revisit the architecture choices of BERT in efforts to obtain a lighter model. We focus on reducing the number of parameters yet our methods can be applied towards other objectives such FLOPs or latency. We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions rather than reducing the number of Transformer encoder layers. In particular, our schuBERT gives $6.6\%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers while having the same number of parameters.