学会将对抗性的对抗表示形式分开以进行强大的对抗检测

论文标题

学会将对抗性的对抗表示形式分开以进行强大的对抗检测

Learning to Separate Clusters of Adversarial Representations for Robust Adversarial Detection

论文作者

Joe, Byunggill, Hamm, Jihun, Hwang, Sung Ju, Son, Sooel, Shin, Insik

论文摘要

尽管深层神经网络在各种任务上表现出了有希望的表现，但它们容易受到输入中不明显的小扰动引起的错误预测。大量以前的作品提议检测对抗性攻击。但是，他们中的大多数人无法有效地检测到它们的自适应白盒攻击，因为对手有对模型和防御方法的了解。在本文中，我们提出了一个新的概率对抗检测器，该检测器是由最近引入的非稳定功能所激发的。我们将非固体特征视为对抗示例的共同属性，并且我们推断出可以在与该属性相对应的表示空间中找到群集。这个想法使我们在单独的群集中估算了对抗表示的概率分布，并利用了基于可能性的对抗检测器的分布。

Although deep neural networks have shown promising performances on various tasks, they are susceptible to incorrect predictions induced by imperceptibly small perturbations in inputs. A large number of previous works proposed to detect adversarial attacks. Yet, most of them cannot effectively detect them against adaptive whitebox attacks where an adversary has the knowledge of the model and the defense method. In this paper, we propose a new probabilistic adversarial detector motivated by a recently introduced non-robust feature. We consider the non-robust features as a common property of adversarial examples, and we deduce it is possible to find a cluster in representation space corresponding to the property. This idea leads us to probability estimate distribution of adversarial representations in a separate cluster, and leverage the distribution for a likelihood based adversarial detector.

下载PDF全文

下载文献需遵守相关版权规定

论文标题