通过状态空间增强变压器的有效长序列建模

论文标题

通过状态空间增强变压器的有效长序列建模

Efficient Long Sequence Modeling via State Space Augmented Transformer

论文作者

Zuo, Simiao, Liu, Xiaodong, Jiao, Jian, Charles, Denis, Manavoglu, Eren, Zhao, Tuo, Gao, Jianfeng

论文摘要

变压器模型已在各种自然语言处理任务中取得了卓越的性能。但是，注意机制的二次计算成本限制了其对长序列的实用性。现有的注意变体可以提高计算效率，但是它们有效地计算全球信息的能力有限。与变压器模型并联，状态空间模型（SSM）是针对长序列量身定制的，但它们的灵活性不足以捕获复杂的本地信息。我们提出了Spade，缩短了$ \ usewinline {\ textbf {s}} $ tate s $ \ usepline {\ textbf {p}} $ ace $ \ ace $ \ useverline {\ textbf {a}} $变换$ \ useverline {\ textbf {e}} $ r。具体而言，我们将SSM扩展到Spade的底层中，并为其他层采用有效的局部注意方法。 SSM增强了全球信息，该信息补充了当地注意方法中缺乏远程依赖问题的补充。远程竞技场基准和语言建模任务的实验结果证明了该方法的有效性。为了进一步证明Spade的可扩展性，我们预先培训了大型编码器模型，并就自然语言理解和自然语言生成任务提出了微调结果。

Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. We propose SPADE, short for $\underline{\textbf{S}}$tate s$\underline{\textbf{P}}$ace $\underline{\textbf{A}}$ugmente$\underline{\textbf{D}}$ Transform$\underline{\textbf{E}}$r. Specifically, we augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. The SSM augments global information, which complements the lack of long-range dependency issue in local attention methods. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method. To further demonstrate the scalability of SPADE, we pre-train large encoder-decoder models and present fine-tuning results on natural language understanding and natural language generation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题