BERT的最佳子体系结构提取

论文标题

BERT的最佳子体系结构提取

Optimal Subarchitecture Extraction For BERT

论文作者

de Wynter, Adrian, Perry, Daniel J.

论文摘要

我们从Devlin等人的BERT体系结构中提取了最佳的体系结构参数。（2018）通过在算法中应用最新突破进行神经体系结构搜索。这个最佳子集（我们称为“ bort”）明显较小，有效（即不计算嵌入层）尺寸为$ 5.5 \％$ $ $ $ $ $ $ $ $ $ $ $ $是原始的bert-large架构，而净大小的$ 16 \％$ $。 Bort is also able to be pretrained in $288$ GPU hours, which is $1.2\%$ of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large (Liu et al., 2019), and about $33\%$ of that of the world-record, in GPU hours, required to train BERT-large on the same hardware.它的CPU上的$ 7.9 $ x速度也比其他压缩变种的速度更快，并且表现更好，并且某些未压缩的变体：它获得了$ 0.3 \％$ $ $ $ $ 31 \％$的性能提高，绝对是bert-large，对于多种公共自然语言理解（nlu）（nlu）berchsmarks barchs benchs bench bench beant bert-large。

We extract an optimal subset of architectural parameters for the BERT architecture from Devlin et al. (2018) by applying recent breakthroughs in algorithms for neural architecture search. This optimal subset, which we refer to as "Bort", is demonstrably smaller, having an effective (that is, not counting the embedding layer) size of $5.5\%$ the original BERT-large architecture, and $16\%$ of the net size. Bort is also able to be pretrained in $288$ GPU hours, which is $1.2\%$ of the time required to pretrain the highest-performing BERT parametric architectural variant, RoBERTa-large (Liu et al., 2019), and about $33\%$ of that of the world-record, in GPU hours, required to train BERT-large on the same hardware. It is also $7.9$x faster on a CPU, as well as being better performing than other compressed variants of the architecture, and some of the non-compressed variants: it obtains performance improvements of between $0.3\%$ and $31\%$, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题