论文标题
NLP模型的解释方法的鲁棒性
Robustness of Explanation Methods for NLP Models
论文作者
论文摘要
解释方法已成为突出导致神经网络预测的功能的重要工具。有越来越多的证据表明,许多解释方法相当不可靠,并且容易受到恶意操纵的影响。在本文中,我们特别旨在了解文本模式的解释方法的鲁棒性。我们提供了最初的见解和结果,以制定针对文本解释的成功对抗性攻击。据我们所知,这是评估解释方法的对抗性鲁棒性的第一次尝试。我们的实验表明,解释方法可能会在很大程度上受到干扰,其中86%的测试样品中的输入句子及其语义的变化很小。
Explanation methods have emerged as an important tool to highlight the features responsible for the predictions of neural networks. There is mounting evidence that many explanation methods are rather unreliable and susceptible to malicious manipulations. In this paper, we particularly aim to understand the robustness of explanation methods in the context of text modality. We provide initial insights and results towards devising a successful adversarial attack against text explanations. To our knowledge, this is the first attempt to evaluate the adversarial robustness of an explanation method. Our experiments show the explanation method can be largely disturbed for up to 86% of the tested samples with small changes in the input sentence and its semantics.