深度：基于单一可解释性的伯特修剪

论文标题

深度：基于单一可解释性的伯特修剪

DeepCuts: Single-Shot Interpretability based Pruning for BERT

论文作者

Grover, Jasdeep Singh, Gawri, Bhavesh, Manku, Ruskin Raj

论文摘要

随着语言模型在参数和层次上的增长，在单个GPU上与它们一起训练和推断变得更加困难。这严重限制了大型语言模型的可用性，例如GPT-3，Bert-Large等。解决此问题的一种常见技术是通过删除变压器头，完全连接的权重和其他模块来修剪网络体系结构。主要的挑战是从不太重要的参数中辨别重要参数。我们的目标是找到识别此类参数的强大指标。因此，我们提出了两种策略：基于GradCam解释的CAM切割，并根据SmoothGrad进行平滑切割，以计算重要性得分。通过这项工作，我们表明我们的评分功能能够为网络参数分配更相关的基于任务的分数，因此我们的修剪方法的表现都显着优于标准权重和基于梯度的策略，尤其是在基于BERT的模型中较高压缩比下。我们还分析了修剪口罩，并发现它们与使用标准指标获得的面具明显不同。

As language models have grown in parameters and layers, it has become much harder to train and infer with them on single GPUs. This is severely restricting the availability of large language models such as GPT-3, BERT-Large, and many others. A common technique to solve this problem is pruning the network architecture by removing transformer heads, fully-connected weights, and other modules. The main challenge is to discern the important parameters from the less important ones. Our goal is to find strong metrics for identifying such parameters. We thus propose two strategies: Cam-Cut based on the GradCAM interpretations, and Smooth-Cut based on the SmoothGrad, for calculating the importance scores. Through this work, we show that our scoring functions are able to assign more relevant task-based scores to the network parameters, and thus both our pruning approaches significantly outperform the standard weight and gradient-based strategies, especially at higher compression ratios in BERT-based models. We also analyze our pruning masks and find them to be significantly different from the ones obtained using standard metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题