转移或不转移：针对传输学历的文本分类器的错误分类攻击

论文标题

转移或不转移：针对传输学历的文本分类器的错误分类攻击

To Transfer or Not to Transfer: Misclassification Attacks Against Transfer Learned Text Classifiers

论文作者

Pal, Bijeeta, Tople, Shruti

论文摘要

转移学习---转移学习知识---带来了模型的训练方式的范式转变。提高准确性和减少培训时间的利润优势在具有约束的计算资源和更少的培训样本的培训模型中显示出了希望。具体而言，在大量数据集中受过培训的基于公开的基于文本的模型，在实践中已经无处不在地采用了无处不在的采用。在本文中，我们问：“可以利用文本预测模型中的学习来进行错误分类攻击吗？”作为我们的主要贡献，我们提出了新颖的攻击技术，该技术利用教师（公共）模型中学到的意外功能来为学生（下游）模型生成对抗性示例。据我们所知，我们的作品是第一项表明从最先进的基于单词和基于句子的教师模型转移学习的工作增加了学生模型对错误分类攻击的敏感性。首先，我们提出了一种基于单词得分的新型攻击算法，用于针对使用无上下文级别的嵌入模型训练的学生模型生成对抗性示例。在使用手套教师模型训练的二进制分类任务上，我们为IMDB电影评论获得了97％的平均攻击精度，而假新闻检测的平均攻击精度为80％。对于多级任务，我们将新闻组数据集分为6和20类，并分别达到75％和41％的平均攻击精度。接下来，我们对使用上下文感知的BERT模型训练的假新闻检测任务进行了基于长度和基于句子的错误分类攻击，并分别实现了78％和39％的攻击精度。因此，我们的结果激发了设计培训技术的需求，这些培训技术对于意外功能学习，特别是针对转移学习模型的功能。

Transfer learning --- transferring learned knowledge --- has brought a paradigm shift in the way models are trained. The lucrative benefits of improved accuracy and reduced training time have shown promise in training models with constrained computational resources and fewer training samples. Specifically, publicly available text-based models such as GloVe and BERT that are trained on large corpus of datasets have seen ubiquitous adoption in practice. In this paper, we ask, "can transfer learning in text prediction models be exploited to perform misclassification attacks?" As our main contribution, we present novel attack techniques that utilize unintended features learnt in the teacher (public) model to generate adversarial examples for student (downstream) models. To the best of our knowledge, ours is the first work to show that transfer learning from state-of-the-art word-based and sentence-based teacher models increase the susceptibility of student models to misclassification attacks. First, we propose a novel word-score based attack algorithm for generating adversarial examples against student models trained using context-free word-level embedding model. On binary classification tasks trained using the GloVe teacher model, we achieve an average attack accuracy of 97% for the IMDB Movie Reviews and 80% for the Fake News Detection. For multi-class tasks, we divide the Newsgroup dataset into 6 and 20 classes and achieve an average attack accuracy of 75% and 41% respectively. Next, we present length-based and sentence-based misclassification attacks for the Fake News Detection task trained using a context-aware BERT model and achieve 78% and 39% attack accuracy respectively. Thus, our results motivate the need for designing training techniques that are robust to unintended feature learning, specifically for transfer learned models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题