论文标题
梳理凭证:从智能回复中提取主动模式
Combing for Credentials: Active Pattern Extraction from Smart Reply
论文作者
论文摘要
预先训练的大型语言模型,例如GPT \ NobReakDash-2和Bert,通常会进行微调,以在下游任务上实现最新的性能。一个自然的示例是``智能回复''应用程序,其中调整了预训练的模型以为给定查询消息提供建议的答复。由于调整数据通常是敏感的数据,例如电子邮件或聊天成绩单,因此了解和减轻模型泄漏其调谐数据的风险很重要。我们研究了典型的智能回复管道中潜在的信息泄漏漏洞。我们考虑了一个现实的设置,在这种设置中,对手只能通过前端接口与基础模型进行交互,该界面约束可以将哪些类型的查询发送到模型。以前的攻击在这些设置中不起作用,而需要将无约束的查询直接发送到模型的能力。即使对查询没有限制,先前的攻击通常需要数千甚至数百万的查询来提取有用的信息,而我们的攻击只需少数查询即可提取敏感数据。我们引入了一种新型的主动提取攻击,该攻击利用包含敏感数据的文本中的规范模式。我们在实验上表明,对手可以提取培训数据中存在的敏感用户信息,即使在现实的设置中,与模型的所有交互都必须经过限制查询类型的前端。我们探讨了潜在的缓解策略,并从经验上证明了差异隐私似乎是对这种模式提取攻击的合理有效的防御机制。
Pre-trained large language models, such as GPT\nobreakdash-2 and BERT, are often fine-tuned to achieve state-of-the-art performance on a downstream task. One natural example is the ``Smart Reply'' application where a pre-trained model is tuned to provide suggested responses for a given query message. Since the tuning data is often sensitive data such as emails or chat transcripts, it is important to understand and mitigate the risk that the model leaks its tuning data. We investigate potential information leakage vulnerabilities in a typical Smart Reply pipeline. We consider a realistic setting where the adversary can only interact with the underlying model through a front-end interface that constrains what types of queries can be sent to the model. Previous attacks do not work in these settings, but require the ability to send unconstrained queries directly to the model. Even when there are no constraints on the queries, previous attacks typically require thousands, or even millions, of queries to extract useful information, while our attacks can extract sensitive data in just a handful of queries. We introduce a new type of active extraction attack that exploits canonical patterns in text containing sensitive data. We show experimentally that it is possible for an adversary to extract sensitive user information present in the training data, even in realistic settings where all interactions with the model must go through a front-end that limits the types of queries. We explore potential mitigation strategies and demonstrate empirically how differential privacy appears to be a reasonably effective defense mechanism to such pattern extraction attacks.