论文标题
君主:具有高效且准确训练的富有表现力矩阵
Monarch: Expressive Structured Matrices for Efficient and Accurate Training
论文作者
论文摘要
大型神经网络在许多领域都表现出色,但训练和微调价格昂贵。减少其计算或内存需求的一种流行方法是用结构化的矩阵替换密集的重量矩阵(例如稀疏,低级别,傅立叶变换)。由于效率不佳 - 质量折衷,这些方法在端到端培训中没有看到广泛采用(1),并且(2)由于缺乏可拖动算法而无法近似给定的致密重量矩阵,因此在密集到较小的微调中进行了微调。为了解决这些问题,我们提出了一类硬件有效的矩阵(君主)(它们被参数化为两个块 - 二角矩阵的产品,以获得更好的硬件利用率)和表现力(它们可以代表许多常用的转换)。令人惊讶的是,尽管非convex,但与君主矩阵近似密集的重量矩阵的问题具有分析性最佳解决方案。君主矩阵的这些特性解锁了训练和微调稀疏和密集模型的新方法。我们从经验上证明,君主可以在几个端到端的稀疏培训应用中实现有利的准确性效率折衷,以加快ImageNet分类的VIT和GPT-2培训,以及Wikitext-103通过2X使用可比的模型质量进行的Wikitext-103语言建模,并减少PDE解决方案和MRI RECONTUCTION任务的错误。在稀疏到密集的训练中,君主矩阵用一种称为“反向稀疏”的简单技术是一种有用的中间表示,可以加快openwebtext上的gpt-2预测,而无需降低质量。与设定MLPERF 1.1记录的NVIDIA非常优化的实现相比,相同的技术给BERT预处理带来了23%。作为概念验证,在密集到较小的微调中,我们的君主近似算法以可比的精度将胶水上的Bert微调加快了1.7倍。
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones (e.g., sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efficiency--quality tradeoffs, and (2) in dense-to-sparse fine-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms). Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution. These properties of Monarch matrices unlock new ways to train and fine-tune sparse and dense models. We empirically validate that Monarch can achieve favorable accuracy-efficiency tradeoffs in several end-to-end sparse training applications: speeding up ViT and GPT-2 training on ImageNet classification and Wikitext-103 language modeling by 2x with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called "reverse sparsification," Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on OpenWebText by 2x without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse fine-tuning, as a proof-of-concept, our Monarch approximation algorithm speeds up BERT fine-tuning on GLUE by 1.7x with comparable accuracy.