多头适配器路由用于交叉任务概括

论文标题

多头适配器路由用于交叉任务概括

Multi-Head Adapter Routing for Cross-Task Generalization

论文作者

Caccia, Lucas, Ponti, Edoardo, Su, Zhan, Pereira, Matheus, Roux, Nicolas Le, Sordoni, Alessandro

论文摘要

用于交叉任务概括的参数有效的微调（PEFT）包括在几次改编之前进行多任务训练集的预训练适配器，以测试任务。 Polytropon [Ponti等，2023]（$ \ texttt {poly} $）共同学习适配器的清单和一个路由函数，在预训练和少量适应过程中为每个任务选择A（可变大小）的适配器子集。在本文中，我们研究了适配器路由在其成功中扮演的角色，并根据我们的发现设计了新的变体。首先，我们以直觉为基础，即细粒的路由提供了更多的表现力。因此，我们提出了$ \ texttt {mhr} $（多头路由），该$结合了适配器参数的子集，在可比参数预算下均优于$ \ texttt {poly} $ $ \ texttt {poly}；通过仅微调路由功能，而不是适配器（$ \ texttt {mhr} $ - $ z $），我们可以通过极端参数效率实现竞争性能。其次，我们发现$ \ texttt {poly} $/$ \ texttt {mhr} $ performance是更好的多任务优化的结果，而不是模块化的电感偏见，可促进适配器重新组合和本地适应性，如前所述。实际上，我们发现$ \ texttt {MHR} $在培训任务之间表现出很高的梯度对齐。我们发现，在多任务预训练期间，而不是在几次改编过程中，路由是最有益的，并建议$ \ texttt {MHR} $ - $ $ $ $ $ $ $ $，它会丢弃路由和微调每个下游任务的预训练适配器的平均值。这将建立$ \ texttt {mhr} $ - $μ$作为单次调整微调的有效方法。我们还表明，$ \ texttt {MHR} $ - $ $ $可以通过训练预先训练的适配器的平均值来作为多任务训练集的其他几个步骤来用作有效的零射传输方法：在绝对准确性W.R.T.上，这可以提高多任务训练集：这将产生高达3％的增长。基线。

Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before few-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] ($\texttt{Poly}$) jointly learns an inventory of adapters and a routing function that selects a (variable-size) subset of adapters for each task during both pre-training and few-shot adaptation. In this paper, we investigate the role that adapter routing plays in its success and design new variants based on our findings. First, we build on the intuition that finer-grained routing provides more expressivity. Hence, we propose $\texttt{MHR}$ (Multi-Head Routing) which combines subsets of adapter parameters and outperforms $\texttt{Poly}$ under a comparable parameter budget; by only fine-tuning the routing function and not the adapters ($\texttt{MHR}$-$z$) we achieve competitive performance with extreme parameter efficiency. Second, we find that $\texttt{Poly}$/$\texttt{MHR}$ performance is a result of better multi-task optimization, rather than modular inductive biases that facilitate adapter recombination and local adaptation, as previously hypothesized. In fact, we find that $\texttt{MHR}$ exhibits high gradient alignment between training tasks. We find that routing is most beneficial during multi-task pre-training rather than during few-shot adaptation and propose $\texttt{MHR}$-$μ$, which discards routing and fine-tunes the average of the pre-trained adapters on each downstream tasks. This establishes $\texttt{MHR}$-$μ$ as an effective method for single-adapter fine-tuning. We also show that $\texttt{MHR}$-$μ$ can be used as an effective zero-shot transfer method by training the average of the pre-trained adapters for a few additional steps on the multi-task training set: this yields gains up to 3% on absolute accuracy w.r.t. the baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题