对抗性鲁棒性和解释性的二阶优化

论文标题

对抗性鲁棒性和解释性的二阶优化

Second Order Optimization for Adversarial Robustness and Interpretability

论文作者

Tsiligkaridis, Theodoros, Roberts, Jay

论文摘要

深层神经网络很容易被称为对抗性攻击的小扰动所欺骗。对抗训练（AT）是一种旨在学习这种攻击功能的技术，并被广泛认为是非常有效的防御。但是，随着网络规模和输入维度的增长，此类培训的计算成本可能会过高。受稳健性和曲率之间关系的启发，我们提出了一种新颖的正规化程序，该新正常化程序通过二次逼近对抗性损失结合了第一阶和二阶信息。最坏的情况二次损失是通过迭代方案近似的。结果表明，与先前的梯度和曲率正则化方案相比，我们的正常器中仅使用单个迭代可以实现更强的鲁棒性，避免梯度混淆，并且随着额外的迭代，可以实现较强的鲁棒性，并且训练时间明显低于AT。此外，它保留了该网络的有趣方面学习与人类感知良好的特征。我们通过实验证明，与其他几何正规化技术相比，我们的方法产生更高的人类解剖特征。然后，这些强大的功能用于为建模预测提供对人体友好的解释。

Deep neural networks are easily fooled by small perturbations known as adversarial attacks. Adversarial Training (AT) is a technique aimed at learning features robust to such attacks and is widely regarded as a very effective defense. However, the computational cost of such training can be prohibitive as the network size and input dimensions grow. Inspired by the relationship between robustness and curvature, we propose a novel regularizer which incorporates first and second order information via a quadratic approximation to the adversarial loss. The worst case quadratic loss is approximated via an iterative scheme. It is shown that using only a single iteration in our regularizer achieves stronger robustness than prior gradient and curvature regularization schemes, avoids gradient obfuscation, and, with additional iterations, achieves strong robustness with significantly lower training time than AT. Further, it retains the interesting facet of AT that networks learn features which are well-aligned with human perception. We demonstrate experimentally that our method produces higher quality human-interpretable features than other geometric regularization techniques. These robust features are then used to provide human-friendly explanations to model predictions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题