与专家选择路由的专家专家混合物

论文标题

与专家选择路由的专家专家混合物

Mixture-of-Experts with Expert Choice Routing

论文作者

Zhou, Yanqi, Lei, Tao, Liu, Hanxiao, Du, Nan, Huang, Yanping, Zhao, Vincent, Dai, Andrew, Chen, Zhifeng, Le, Quoc, Laudon, James

论文摘要

稀疏激活的Experts（MOE）模型允许参数数量大大增加，同时保持给定令牌或给定样品不变的计算量。但是，不良的专家路由策略（例如，导致负载失衡的一种）可能会导致某些专家的培训不足，从而导致专家未经证实或过度培训。事先工作将使用TOP-K函数分配固定的专家，无论不同令牌的相对重要性如何。为了解决这个问题，我们提出了使用专家选择方法的异质混合物。我们没有让代币选择Top-K专家，而是让专家选择Top-K代币。结果，每个令牌可以路由到可变数量的专家，每个专家都可以具有固定的存储桶大小。我们使用开关变压器TOP-1和GSHARD TOP-2门控的相同计算资源系统地研究训练速度，并发现我们的方法将培训收敛时间提高了2倍以上。对于相同的计算成本，我们的方法在微调的11个选定任务中表现出更高的性能，并在胶水和超级粘液基准中进行了更高的性能。对于较小的激活成本，我们的方法在11个任务中的7个任务中的7个密集模型都优于T5密集模型。

Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题