选择性健忘症：关于特洛伊木马学习模型的高效，高保真和对后门效应的盲目抑制

论文标题

选择性健忘症：关于特洛伊木马学习模型的高效，高保真和对后门效应的盲目抑制

Selective Amnesia: On Efficient, High-Fidelity and Blind Suppression of Backdoor Effects in Trojaned Machine Learning Models

论文作者

Zhu, Rui, Tang, Di, Tang, Siyuan, Wang, XiaoFeng, Tang, Haixu

论文摘要

在本文中，我们提出了一种简单而令人惊讶的有效技术，可以在后式模型上诱导“选择性健忘症”。我们的方法称为接缝，受到灾难性遗忘（CF）的问题的启发，这是持续学习的长期问题。我们的想法是在随机标记的干净数据上重新训练给定的DNN模型，以在模型上诱导CF，从而突然忘记了主要和后门任务。然后，我们通过在正确标记的干净数据上重新审进随机模型来恢复主要任务。我们通过对持续学习进行建模来分析接缝，并使用神经切线内核进一步近似DNN来测量CF。我们的分析表明，在没有触发输入的情况下，我们的随机标记方法实际上可以最大化CF的CF，并且还可以保留网络中的某些功能提取，以使主要任务快速复兴。我们进一步评估了图像处理和自然语言处理任务的接缝，在数据污染和培训操作攻击下，超过数千个在流行的图像数据集或Trowai竞赛提供的模型。我们的实验表明，接缝极大地超过了最先进的学习技术，实现了高忠诚度（衡量主要任务的准确性和后门的准确性之间的差距）（在几分钟内的差距）（在使用MNIST数据集中从scratch中训练模型的速度约为30倍），仅使用MNIST数据集训练模型，并且只需使用少量的清洁数据（0.1％的训练数据）就可以进行训练的模型。

In this paper, we present a simple yet surprisingly effective technique to induce "selective amnesia" on a backdoored model. Our approach, called SEAM, has been inspired by the problem of catastrophic forgetting (CF), a long standing issue in continual learning. Our idea is to retrain a given DNN model on randomly labeled clean data, to induce a CF on the model, leading to a sudden forget on both primary and backdoor tasks; then we recover the primary task by retraining the randomized model on correctly labeled clean data. We analyzed SEAM by modeling the unlearning process as continual learning and further approximating a DNN using Neural Tangent Kernel for measuring CF. Our analysis shows that our random-labeling approach actually maximizes the CF on an unknown backdoor in the absence of triggered inputs, and also preserves some feature extraction in the network to enable a fast revival of the primary task. We further evaluated SEAM on both image processing and Natural Language Processing tasks, under both data contamination and training manipulation attacks, over thousands of models either trained on popular image datasets or provided by the TrojAI competition. Our experiments show that SEAM vastly outperforms the state-of-the-art unlearning techniques, achieving a high Fidelity (measuring the gap between the accuracy of the primary task and that of the backdoor) within a few minutes (about 30 times faster than training a model from scratch using the MNIST dataset), with only a small amount of clean data (0.1% of training data for TrojAI models).

下载PDF全文

下载文献需遵守相关版权规定

论文标题