深度神经网络注意机制的数据信息全球稀疏性

论文标题

深度神经网络注意机制的数据信息全球稀疏性

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

论文作者

Rugina, Ileana, Dangovski, Rumen, Jing, Li, Nakov, Preslav, Soljačić, Marin

论文摘要

注意机制在自然语言处理（NLP）的神经革命中起着至关重要的作用。随着基于注意力的模型的增长，已经开发了几种修剪技术来识别和利用稀疏度，从而使这些模型更加有效。大多数努力都集中在刻苦编码注意力模式或根据培训数据修剪注意力的权重。我们提出了注意修剪（AP），该框架在固定数据集中观察注意力模式并生成全局稀疏面膜。 AP为语言建模节省了90％的注意力计算，并为机器翻译和胶水任务节省了约50％，从而保持了结果质量。我们的方法揭示了自我和交叉注意模式之间的重要区别，从而指导未来的NLP研究。我们的框架可以减少任何基于注意力的模型的延迟和内存需求，从而有助于开发现有或新的NLP应用程序的改进模型。我们已经使用Triton GPU内核与编码器和自回旋变压器模型一起证明了这一点，并在https://github.com/irugina/ap上公开代码。

Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题