关于捍卫自我监督的学习违背模型提取的困难

论文标题

关于捍卫自我监督的学习违背模型提取的困难

On the Difficulty of Defending Self-Supervised Learning against Model Extraction

论文作者

Dziedzic, Adam, Dhawan, Nikita, Kaleem, Muhammad Ahmad, Guan, Jonas, Papernot, Nicolas

论文摘要

自我监督学习（SSL）是一个日益流行的ML范式，它训练模型将复杂的输入转换为表示形式而不依赖于明确的标签。这些表示编码的相似性结构可以有效学习多个下游任务。最近，ML-AS-A-A-Service提供商已开始为推理API提供训练有素的SSL模型，该模型将用户输入转换为有用的表示费用。但是，训练这些模型及其对API的曝光涉及的高成本都使黑盒提取成为现实的安全威胁。因此，我们探索了对SSL的攻击攻击的模型。与输出标签的分类器上的传统模型提取不同，受害者模型在此处输出表示；与分类器的低维预测分数相比，这些表示的维度明显更高。我们构建了几种新颖的攻击，发现直接在受害者被盗的表示上训练的方法是有效的，并且可以使下游模型高准确性。然后，我们表明现有针对模型提取的防御措施不足，并且不容易改造为SSL的特异性。

Self-Supervised Learning (SSL) is an increasingly popular ML paradigm that trains models to transform complex inputs into representations without relying on explicit labels. These representations encode similarity structures that enable efficient learning of multiple downstream tasks. Recently, ML-as-a-Service providers have commenced offering trained SSL models over inference APIs, which transform user inputs into useful representations for a fee. However, the high cost involved to train these models and their exposure over APIs both make black-box extraction a realistic security threat. We thus explore model stealing attacks against SSL. Unlike traditional model extraction on classifiers that output labels, the victim models here output representations; these representations are of significantly higher dimensionality compared to the low-dimensional prediction scores output by classifiers. We construct several novel attacks and find that approaches that train directly on a victim's stolen representations are query efficient and enable high accuracy for downstream models. We then show that existing defenses against model extraction are inadequate and not easily retrofitted to the specificities of SSL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题