要了解动量如何改善深度学习的概括

论文标题

要了解动量如何改善深度学习的概括

Towards understanding how momentum improves generalization in deep learning

论文作者

Jelassi, Samy, Li, Yuanzhi

论文摘要

具有动量的随机梯度下降（SGD）广泛用于训练现代深度学习体系结构。虽然众所周知，使用动量可以导致在各种环境中更快的收敛速率，但还观察到动量会产生更高的概括。先前的工作认为，动量在训练过程中稳定了SGD噪声，这会导致更高的概括。在本文中，我们采用了另一种观点，首先在经验上表明，与梯度下降（GD）相比，具有动量（GD+M）的梯度下降在某些深度学习问题中显着改善了概括。从这一观察结果，我们正式研究了动量如何提高概括。我们设计了一个二进制分类设置，在该设置中，当两种算法都类似地初始化时，经过GD+M训练的单个隐藏层（过度参数化）卷积神经网络比对使用GD的同一网络进行了更好的概括。我们分析中的关键见解是，动量在示例共享某些功能但余量不同的数据集中是有益的。与记住少量数据数据的GD相反，GD+M由于其历史梯度而在这些数据中仍然了解了该功能。最后，我们从经验上验证了我们的理论发现。

Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well-understood that using momentum can lead to faster convergence rate in various settings, it has also been observed that momentum yields higher generalization. Prior work argue that momentum stabilizes the SGD noise during training and this leads to higher generalization. In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized. The key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin. Contrary to GD that memorizes the small margin data, GD+M still learns the feature in these data thanks to its historical gradients. Lastly, we empirically validate our theoretical findings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题