最小二乘重型随机梯度下降的算法稳定性

论文标题

最小二乘重型随机梯度下降的算法稳定性

Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares

论文作者

Raj, Anant, Barsbey, Melih, Gürbüzbalaban, Mert, Zhu, Lingjiong, Şimşekli, Umut

论文摘要

最近的研究表明，沉重的尾巴可以在随机优化中出现，并且尾巴的重度与概括误差有联系。尽管这些研究揭示了现代环境中概括行为的有趣方面，但它们依赖于强大的拓扑和统计规律性假设，这些假设在实践中很难验证。此外，从经验上说，与现有理论的结论相反，重尾巴与概括之间的关系可能并不总是单调的。在这项研究中，我们通过算法稳定性的镜头建立了尾巴行为和随机梯度下降（SGD）的概括性能之间的新联系。我们将二次优化问题考虑，并使用重尾随机微分方程（及其Euler离散化）作为对SGD中出现的重尾行为进行建模的代理。 We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. （ii）根据数据的差异，存在\ emph {`重尾阈值'}，这样，只要尾巴比尾巴较轻，概括误差会随着尾巴的重长而减小。这表明重尾巴与概括之间的关系并非全球单调。（iii）我们证明，在均匀稳定性上匹配较低的距离，这意味着我们的边界在尾巴的重度方面很紧。我们通过合成和真实的神经网络实验来支持我们的理论。

Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题