论文标题
成功地将稳定的彩票假设应用于变压器体系结构
Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture
论文作者
论文摘要
稀疏模型需要更少的存储器才能存储,并通过减少必要数量的拖鞋来启用更快的推断。这对于使用神经网络的时间关键和设备计算都相关。稳定的彩票票证假设指出,使用基于未经修复的融合模型计算的掩模,可以在没有或很少的训练迭代后修剪网络。在变压器架构和WMT 2014英语对德国人和英语对英语的任务上,我们表明稳定的彩票售票的修剪性能类似于对高达85%的稀疏度的修剪,并提出了一种新的修剪技术组合,以使所有其他技巧更高的稀疏水平更高。此外,我们确认该参数的初始符号而不是其特定值是成功训练的主要因素,并表明可以使用修剪来查找获胜的彩票票。
Sparse models require less memory for storage and enable a faster inference by reducing the necessary number of FLOPs. This is relevant both for time-critical and on-device computations using neural networks. The stabilized lottery ticket hypothesis states that networks can be pruned after none or few training iterations, using a mask computed based on the unpruned converged model. On the transformer architecture and the WMT 2014 English-to-German and English-to-French tasks, we show that stabilized lottery ticket pruning performs similar to magnitude pruning for sparsity levels of up to 85%, and propose a new combination of pruning techniques that outperforms all other techniques for even higher levels of sparsity. Furthermore, we confirm that the parameter's initial sign and not its specific value is the primary factor for successful training, and show that magnitude pruning could be used to find winning lottery tickets.