张量程序V：通过零射击超参数转移调整大型神经网络

论文标题

张量程序V：通过零射击超参数转移调整大型神经网络

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

论文作者

Yang, Greg, Hu, Edward J., Babuschkin, Igor, Sidor, Szymon, Liu, Xiaodong, Farhi, David, Ryder, Nick, Pachocki, Jakub, Chen, Weizhu, Gao, Jianfeng

论文摘要

深度学习中的超参数（HP）调整是一个昂贵的过程，对于具有数十亿个参数的神经网络（NNS）而言，这是一个昂贵的过程。我们表明，在最近发现的最大更新参数化（MUP）中，即使模型尺寸变化，许多最佳HPS仍保持稳定。这导致了一个新的HP调音范式，我们称为Mutransfer：在MUP中参数化目标模型，在较小的模型上间接调整HP，然后零弹将它们传递到全尺寸模型，即根本不直接调整后者。我们验证变压器和重新连接的mutransfer。例如，1）通过从13m参数的模型中转移预读的HP，我们的表现要优于发布数量的bert-large（350m参数），总调谐成本等同于预处理Bert-large一次； 2）通过从4000万参数转移，我们的表现超过了6.7B GPT-3模型的数字，调整仅占总预算成本成本的7％。可以在github.com/microsoft/mup上找到我们技术的pytorch实现，并通过`pip install mup''找到。

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at github.com/microsoft/mup and installable via `pip install mup`.

下载PDF全文

下载文献需遵守相关版权规定

论文标题