Learn2Weight：针对类似域的对抗攻击的参数适应

论文标题

Learn2Weight：针对类似域的对抗攻击的参数适应

Learn2Weight: Parameter Adaptation against Similar-domain Adversarial Attacks

论文作者

Datta, Siddhartha

论文摘要

NLP系统的Black-Box对抗攻击中最近的工作引起了很多关注。先前的黑框攻击假设攻击者可以根据选定的输入观察目标模型的输出标签。在这项工作中，受到对抗性转移性的启发，我们提出了一种新型的黑盒NLP对抗性攻击，攻击者可以选择相似的域并将对抗性示例转移到目标域并在目标模型中导致性能差。基于领域的适应理论，我们提出了一种称为“ Learn2Weight”的防御策略，该策略训练以预测目标模型的重量调整，以防止相似的对手示例的攻击。使用亚马逊多域情感分类数据集，我们从经验上表明，与标准的黑盒防御方法（例如对抗性训练和防御性蒸馏）相比，Learn2Weight对攻击有效。这项工作有助于越来越多的机器学习安全文献。

Recent work in black-box adversarial attacks for NLP systems has attracted much attention. Prior black-box attacks assume that attackers can observe output labels from target models based on selected inputs. In this work, inspired by adversarial transferability, we propose a new type of black-box NLP adversarial attack that an attacker can choose a similar domain and transfer the adversarial examples to the target domain and cause poor performance in target model. Based on domain adaptation theory, we then propose a defensive strategy, called Learn2Weight, which trains to predict the weight adjustments for a target model in order to defend against an attack of similar-domain adversarial examples. Using Amazon multi-domain sentiment classification datasets, we empirically show that Learn2Weight is effective against the attack compared to standard black-box defense methods such as adversarial training and defensive distillation. This work contributes to the growing literature on machine learning safety.

下载PDF全文

下载文献需遵守相关版权规定

论文标题