论文标题
如何保持文字私密?对隐私自然语言处理的深度学习方法的系统评价
How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing
论文作者
论文摘要
自然语言处理(NLP)任务的深度学习(DL)模型经常处理私人数据,要求保护违反和披露。数据保护法,例如欧盟的一般数据保护法规(GDPR),从而实现了对隐私的需求。尽管近年来已经提出了许多保护隐私的NLP方法,但尚未介绍任何类别的组织,因此很难遵循文献的进步。为了缩小这一差距,本文有系统地回顾了2016年至2020年之间发布的60种DL方法,以涵盖理论基础,增强隐私技术的NLP,并分析其对现实世界情景的适用性。首先,我们引入了一种新颖的分类法,将现有方法分类为三类:数据保护方法,可信赖的方法和验证方法。其次,我们提供了有关隐私威胁,应用程序数据集的广泛摘要以及用于隐私评估的指标。第三,在整个评论中,我们以整体观点描述了NLP管道中的隐私问题。此外,我们讨论了有关隐私保护NLP的公开挑战,这些挑战涉及数据可追溯性,计算开销,数据集大小,嵌入中人类偏见的普遍性以及隐私性实用性权衡。最后,这篇评论提出了未来的研究指导,以指导隐私保护NLP模型的连续研究和开发。
Deep learning (DL) models for natural language processing (NLP) tasks often handle private data, demanding protection against breaches and disclosures. Data protection laws, such as the European Union's General Data Protection Regulation (GDPR), thereby enforce the need for privacy. Although many privacy-preserving NLP methods have been proposed in recent years, no categories to organize them have been introduced yet, making it hard to follow the progress of the literature. To close this gap, this article systematically reviews over sixty DL methods for privacy-preserving NLP published between 2016 and 2020, covering theoretical foundations, privacy-enhancing technologies, and analysis of their suitability for real-world scenarios. First, we introduce a novel taxonomy for classifying the existing methods into three categories: data safeguarding methods, trusted methods, and verification methods. Second, we present an extensive summary of privacy threats, datasets for applications, and metrics for privacy evaluation. Third, throughout the review, we describe privacy issues in the NLP pipeline in a holistic view. Further, we discuss open challenges in privacy-preserving NLP regarding data traceability, computation overhead, dataset size, the prevalence of human biases in embeddings, and the privacy-utility tradeoff. Finally, this review presents future research directions to guide successive research and development of privacy-preserving NLP models.