论文标题
来自数据歧管维度的神经缩放定律
A Neural Scaling Law from the Dimension of the Data Manifold
论文作者
论文摘要
当数据丰富时,训练有素的神经网络量身定标在网络参数$ n $的数量中,作为幂律$ l \ propto n^{ - α} $所获得的损失。这种经验缩放定律具有多种数据方式,并且可能在许多数量级上持续存在。如果神经模型有效地仅对固有维度$ d $的数据歧管进行回归,则可以解释缩放定律。这个简单的理论预测,缩放指数$α\约4/d $,用于跨境和均方误差损失。我们通过独立测量教师/学生框架中的固有维度和缩放指数来确认理论,在那里我们可以通过拨打随机教师网络的属性来研究各种$ d $和$α$。我们还在几个数据集和GPT型语言模型上使用CNN图像分类器测试理论。
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-α}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $α\approx 4/d$ for cross-entropy and mean-squared error losses. We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a variety of $d$ and $α$ by dialing the properties of random teacher networks. We also test the theory with CNN image classifiers on several datasets and with GPT-type language models.