论文标题

弱梯度方向:大规模解释记忆,概括和硬度

Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale

论文作者

Zielinski, Piotr, Krishnan, Shankar, Chatterjee, Satrajit

论文摘要

连贯的梯度(CGH)是一个最近提出的假设,可以解释为什么过度参数化的神经网络以梯度下降良好训练,即使它们具有足够的能力来记住训练集。 CGH的关键见解是,由于SGD单个步骤的总体梯度是每个示例梯度的总和,因此在方向上最强,如果存在此类方向,则可以减少多个示例的损失。在本文中,我们在ImageNet上验证了Resnet,Inception和VGG模型的CGH。由于原始论文中介绍的技术不超过玩具模型和数据集,因此我们提出了新的方法。通过提出抑制弱梯度方向作为强大平均估计问题的问题,我们开发了基于坐标的均值方法中位数。我们介绍了该算法M3的两个版本,该版本将迷你批量分为3组并计算中位数和更有效的版本RM3,该版本是从前两个时间步骤中重新梯度来计算中位数的梯度。由于它们抑制弱梯度方向而无需每个示例梯度,因此可以用于大规模训练模型。在实验上,我们发现它们确实大大减少了过度拟合(和记忆),从而提供了CGH规模规模的第一个令人信服的证据。我们还提出了一项新的CGH测试,该测试不取决于在训练标签中添加噪声或抑制弱梯度方向。使用CGH背后的直觉,我们认为在培训过程的早期(即“简单”示例)中学到的示例恰恰是那些与其他培训示例有更多共同点的示例。因此,如根据CGH而言,简单的例子应该比自己的辛苦例子更好地概括自己。我们通过详细的实验验证了这一假设,并认为它为CGH提供了进一步的正交证据。

Coherent Gradients (CGH) is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. The key insight of CGH is that, since the overall gradient for a single step of SGD is the sum of the per-example gradients, it is strongest in directions that reduce the loss on multiple examples if such directions exist. In this paper, we validate CGH on ResNet, Inception, and VGG models on ImageNet. Since the techniques presented in the original paper do not scale beyond toy models and datasets, we propose new methods. By posing the problem of suppressing weak gradient directions as a problem of robust mean estimation, we develop a coordinate-based median of means approach. We present two versions of this algorithm, M3, which partitions a mini-batch into 3 groups and computes the median, and a more efficient version RM3, which reuses gradients from previous two time steps to compute the median. Since they suppress weak gradient directions without requiring per-example gradients, they can be used to train models at scale. Experimentally, we find that they indeed greatly reduce overfitting (and memorization) and thus provide the first convincing evidence that CGH holds at scale. We also propose a new test of CGH that does not depend on adding noise to training labels or on suppressing weak gradient directions. Using the intuition behind CGH, we posit that the examples learned early in the training process (i.e., "easy" examples) are precisely those that have more in common with other training examples. Therefore, as per CGH, the easy examples should generalize better amongst themselves than the hard examples amongst themselves. We validate this hypothesis with detailed experiments, and believe that it provides further orthogonal evidence for CGH.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源