笨拙的笨蛋：通过语料库中毒控制单词含义

论文标题

笨拙的笨蛋：通过语料库中毒控制单词含义

Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning

论文作者

Schuster, Roei, Schuster, Tal, Meri, Yoav, Shmatikov, Vitaly

论文摘要

单词嵌入，即手套和sgn之类的低维矢量表示，在单词矢量之间的距离与其语义接近相对应的意义上，编码单词“含义”。这可以使语义学习各种自然语言处理任务进行转移学习。通常在Wikipedia或Twitter等大型公共语料库中对单词嵌入进行培训。我们证明，可以修改训练嵌入的语料库的攻击者可以通过更改嵌入空间中的位置来控制新单词和现有单词的“含义”。我们在语料库特征上开发了一个明确的表达，该表达式是单词之间距离的代理，并在其值和嵌入距离之间建立了一种因果关系。然后，我们展示如何将这种关系用于两个对抗性目标：（1）使一个单词成为另一个单词的顶级邻居，以及（2）将一个单词从一个语义群体移动到另一个单词。对嵌入的攻击会影响下游的多种任务，这是第一次在转移学习方案中数据中毒的力量。我们使用此攻击来操纵信息检索系统（例如简历搜索）中的查询扩展，使某些名称或多或少可见到命名的实体识别模型，并导致新单词被转换为特定的目标词，而不论语言如何。最后，我们展示了攻击者如何生成语言可能的语料库修改，从而欺骗了试图使用语言模型从语料库中过滤不可行的句子的防御。

Word embeddings, i.e., low-dimensional vector representations such as GloVe and SGNS, encode word "meaning" in the sense that distances between words' vectors correspond to their semantic proximity. This enables transfer learning of semantics for a variety of natural language processing tasks. Word embeddings are typically trained on large public corpora such as Wikipedia or Twitter. We demonstrate that an attacker who can modify the corpus on which the embedding is trained can control the "meaning" of new and existing words by changing their locations in the embedding space. We develop an explicit expression over corpus features that serves as a proxy for distance between words and establish a causative relationship between its values and embedding distances. We then show how to use this relationship for two adversarial objectives: (1) make a word a top-ranked neighbor of another word, and (2) move a word from one semantic cluster to another. An attack on the embedding can affect diverse downstream tasks, demonstrating for the first time the power of data poisoning in transfer learning scenarios. We use this attack to manipulate query expansion in information retrieval systems such as resume search, make certain names more or less visible to named entity recognition models, and cause new words to be translated to a particular target word regardless of the language. Finally, we show how the attacker can generate linguistically likely corpus modifications, thus fooling defenses that attempt to filter implausible sentences from the corpus using a language model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题