DeepSpeed-Moe：推进专家的混合物推理和培训，以便下一代AI量表

论文标题

DeepSpeed-Moe：推进专家的混合物推理和培训，以便下一代AI量表

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

论文作者

Rajbhandari, Samyam, Li, Conglong, Yao, Zhewei, Zhang, Minjia, Aminabadi, Reza Yazdani, Awan, Ammar Ahmad, Rasley, Jeff, He, Yuxiong

论文摘要

随着巨型密集模型的训练在当今硬件资源的可用性和能力方面遇到了边界，与质量相等的密集模型相比，由于其大量培训成本降低，Experts（MOE）模型成为最有希望的模型体系结构之一。它的培训成本节省从编码器模型（先前的工作）展示到自动攻击性语言模型的5倍（此工作以及并行探索）。但是，由于模型的规模和独特的架构，如何提供快速MOE模型推断仍然具有挑战性和未解决，从而限制了其实际用途。为了解决这个问题，我们介绍了DeepSpeed-Moe，这是DeepSpeed库的一部分，包括新型MOE架构设计和模型压缩技术，可将MOE模型大小降低到3.7倍，以及与现有MOE相比，可提供7.3倍更好的延迟和成本。 DeepSpeed-Moe提供了前所未有的规模和效率，与质量等效的密集模型相比，最大4.5倍和9倍的推理提供了高达4.5倍的MOE模型。我们希望我们的创新和系统有助于在大型模型景观中为新方向开辟一条有希望的途径，这是从密集到稀疏的MOE模型的转变，在这种模型中，培训和部署具有更少资源的高质量模型变得更加广泛。

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

下载PDF全文

下载文献需遵守相关版权规定

论文标题