通过扰动修复对抗文本

论文标题

通过扰动修复对抗文本

Repairing Adversarial Texts through Perturbation

论文作者

Dong, Guoliang, Wang, Jingyi, Sun, Jun, Chattopadhyay, Sudipta, Wang, Xinyu, Dai, Ting, Shi, Jie, Dong, Jin Song

论文摘要

众所周知，神经网络受到对抗性扰动的攻击，即通过扰动而恶意制作的输入来引起错误的预测。此外，这种攻击是不可能消除的，即在应用缓解方法（例如对抗训练）之后，仍然可能进行对抗扰动。已经开发了多种方法来检测和拒绝此类对抗输入，主要是在图像域中。但是，拒绝可疑输入可能并不总是可行的或理想的。首先，由于检测算法生成的错误警报，正常输入可能会被拒绝。其次，可以通过以对抗性输入为此类系统进行拒绝服务攻击。为了解决差距，在这项工作中，我们提出了一种在运行时自动修复对抗文本的方法。给定一个怀疑是对抗性的文本，我们以积极的方式将多种对抗性扰动方法应用于修复，即神经网络可以正确分类，即略有突变但具有语义上等效的文本。我们的方法已经对多种用于自然语言处理任务的训练的模型进行了实验，结果表明我们的方法是有效的，即，它成功修复了约80 \％的对抗文本。此外，根据应用的扰动方法，可以平均修复对抗文本。

It is known that neural networks are subject to attacks through adversarial perturbations, i.e., inputs which are maliciously crafted through perturbations to induce wrong predictions. Furthermore, such attacks are impossible to eliminate, i.e., the adversarial perturbation is still possible after applying mitigation methods such as adversarial training. Multiple approaches have been developed to detect and reject such adversarial inputs, mostly in the image domain. Rejecting suspicious inputs however may not be always feasible or ideal. First, normal inputs may be rejected due to false alarms generated by the detection algorithm. Second, denial-of-service attacks may be conducted by feeding such systems with adversarial inputs. To address the gap, in this work, we propose an approach to automatically repair adversarial texts at runtime. Given a text which is suspected to be adversarial, we novelly apply multiple adversarial perturbation methods in a positive way to identify a repair, i.e., a slightly mutated but semantically equivalent text that the neural network correctly classifies. Our approach has been experimented with multiple models trained for natural language processing tasks and the results show that our approach is effective, i.e., it successfully repairs about 80\% of the adversarial texts. Furthermore, depending on the applied perturbation method, an adversarial text could be repaired in as short as one second on average.

下载PDF全文

下载文献需遵守相关版权规定

论文标题