论文标题
基于损失表面隐式正则化的深网的概括界
On generalization bounds for deep networks based on loss surface implicit regularization
论文作者
论文摘要
经典的统计学习理论意味着,拟合太多参数会导致过度拟合和性能差。尽管有大量参数与这一发现相矛盾,但现代深层神经网络却很好地概括了这一发现,并构成了解释深度学习成功的主要未解决问题。虽然先前的工作重点是由随机梯度下降(SGD)引起的隐式正则化,但我们在这里研究了局部最小值周围能量景观的局部几何形状如何影响SGD和高斯梯度噪声的统计特性。我们认为,在合理的假设下,局部几何形状迫使SGD保持接近低维子空间,并且这会引起另一种形式的隐式正则化形式,并导致深神经网络的概括误差的范围更严格。为了获得神经网络的概括误差界限,我们首先引入了局部最小值围绕停滞概念,并施加了人口风险的局部基本凸性。在这些条件下,SGD保留在这些停滞集中的下限。如果发生停滞,我们将在涉及权重矩阵的光谱规范但没有网络参数的频谱规范的深神经网络的概括误差上得出一个结合。从技术上讲,我们的证明是基于控制SGD迭代中参数值的变化,而基于局部最小值围绕局部最小值的熵的经验损失函数的局部均匀收敛。
The classical statistical learning theory implies that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. While previous work focuses on the implicit regularization induced by stochastic gradient descent (SGD), we study here how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces another form of implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima.