灾难性渔民爆炸：早期渔民基质会影响概括

论文标题

灾难性渔民爆炸：早期渔民基质会影响概括

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

论文作者

Jastrzebski, Stanislaw, Arpit, Devansh, Astrand, Oliver, Kerg, Giancarlo, Wang, Huan, Xiong, Caiming, Socher, Richard, Cho, Kyunghyun, Geras, Krzysztof

论文摘要

训练深度神经网络的早期阶段对损失函数的局部曲率产生了巨大影响。例如，使用较小的学习率不能保证稳定的优化，因为优化轨迹倾向于以增加局部曲率的增加朝向损失表面的区域。我们询问这种趋势是否与广泛观察到的现象相关，即学习率的选择强烈影响概括。我们首先表明，从训练开始，随机梯度下降（SGD）隐含地惩罚了Fisher Information Matrix（FIM）（FIM）的痕迹，即局部曲率的度量。我们认为，这是SGD中的一个隐式正规化程序，表明明确惩罚FIM的痕迹可以显着改善概括。我们强调，最终概括不足与FIM在训练早期获得巨大价值的痕迹，我们将其称为灾难性的费舍尔爆炸。最后，为了深入了解惩罚FIM痕迹的正规化效果，我们表明，它通过降低嘈杂标签的示例的学习速度来限制记忆，而不是带有干净标签的示例。

The early phase of training a deep neural network has a dramatic effect on the local curvature of the loss function. For instance, using a small learning rate does not guarantee stable optimization because the optimization trajectory has a tendency to steer towards regions of the loss surface with increasing local curvature. We ask whether this tendency is connected to the widely observed phenomenon that the choice of the learning rate strongly influences generalization. We first show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM), a measure of the local curvature, from the start of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We highlight that poor final generalization coincides with the trace of the FIM attaining a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that it limits memorization by reducing the learning speed of examples with noisy labels more than that of the examples with clean labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题