Adapler：通过减少自适应长度加速推断

论文标题

Adapler：通过减少自适应长度加速推断

AdapLeR: Speeding up Inference by Adaptive Length Reduction

论文作者

Modarressi, Ali, Mohebbi, Hosein, Pilehvar, Mohammad Taher

论文摘要

预训练的语言模型已在各种下游任务中表现出出色的表现。但是，这通常是以高潜伏期和计算为代价的，从而阻碍了它们在资源有限的设置中的使用。在这项工作中，我们提出了一种新的方法，用于降低BERT的计算成本，而下游性能的损失最小。我们的方法动态地消除了通过层的贡献较少的代币，从而导致长度较短，从而降低了计算成本。为了确定每个令牌表示的重要性，我们使用基于梯度的显着性方法为每层培训每个层的贡献预测指标。我们对几个不同分类任务的实验表明，在推理时间内，高达22倍的速度没有太多的绩效牺牲。我们还使用橡皮擦基准中的人类注释在我们的方法中验证了所选令牌的质量。与选择重要代币（例如显着性和注意力）的其他广泛使用的策略相比，我们提出的方法在产生理由的情况下的假阳性率显着降低。我们的代码可在https://github.com/amodaresi/adapler上免费获得。

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost. To determine the importance of each token representation, we train a Contribution Predictor for each layer using a gradient-based saliency method. Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance. We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark. In comparison to other widely used strategies for selecting important tokens, such as saliency and attention, our proposed method has a significantly lower false positive rate in generating rationales. Our code is freely available at https://github.com/amodaresi/AdapLeR .

下载PDF全文

下载文献需遵守相关版权规定

论文标题