论文标题

中毒的分类器不仅是后门,而且从根本上破裂

Poisoned classifiers are not only backdoored, they are fundamentally broken

论文作者

Sun, Mingjie, Agarwal, Siddhant, Kolter, J. Zico

论文摘要

在针对分类模型的常见后门中毒攻击下,攻击者在训练数据的一个子集中增加了一个小的触发因素,因此在测试时间的存在此触发器会导致分类器始终预测某些目标类别。通常隐含地假设有毒的分类器仅对拥有扳机的对手来说很容易受到伤害。在本文中,我们从经验上表明,这种后门分类器的观点是不正确的。我们描述了一种新的中毒分类器威胁模型,其中一个不了解原始触发因素,希望控制中毒分类器。在此威胁模型下,我们提出了一种测试时间,人为攻击方法,以生成多个有效的替代触发器,而无需访问初始的后门和培训数据。我们通过首先生成平滑版本的分类器版本来构建这些替代触发器,该示例使用称为DeNoed Smoothing的过程创建,然后提取与人类交互的颜色或裁剪的平滑对抗图像。我们通过对高分辨率数据集的广泛实验来证明攻击的有效性:Imagenet和Trojai。我们还比较了我们先前在建模触发分布的工作的方法,并发现我们的方法在生成有效触发器方面更可扩展和有效。最后,我们包括一项用户研究,该研究表明我们的方法允许用户轻松确定现有中毒分类器中的后门存在。因此,我们认为没有秘密的后门在中毒分类器中:中毒分类者不仅邀请拥有触发因素的党派攻击,而且邀请任何有访问分类器的人的攻击。

Under a commonly-studied backdoor poisoning attack against classification models, an attacker adds a small trigger to a subset of the training data, such that the presence of this trigger at test time causes the classifier to always predict some target class. It is often implicitly assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger. In this paper, we show empirically that this view of backdoored classifiers is incorrect. We describe a new threat model for poisoned classifier, where one without knowledge of the original trigger, would want to control the poisoned classifier. Under this threat model, we propose a test-time, human-in-the-loop attack method to generate multiple effective alternative triggers without access to the initial backdoor and the training data. We construct these alternative triggers by first generating adversarial examples for a smoothed version of the classifier, created with a procedure called Denoised Smoothing, and then extracting colors or cropped portions of smoothed adversarial images with human interaction. We demonstrate the effectiveness of our attack through extensive experiments on high-resolution datasets: ImageNet and TrojAI. We also compare our approach to previous work on modeling trigger distributions and find that our method are more scalable and efficient in generating effective triggers. Last, we include a user study which demonstrates that our method allows users to easily determine the existence of such backdoors in existing poisoned classifiers. Thus, we argue that there is no such thing as a secret backdoor in poisoned classifiers: poisoning a classifier invites attacks not just by the party that possesses the trigger, but from anyone with access to the classifier.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源