论文标题
随着时间的推移,隐私政策:策划和分析一百万个文档数据集
Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset
论文作者
论文摘要
对隐私政策的自动分析已证明是一个富有成果的研究方向,并具有自动化政策摘要,问答系统和合规性检测等发展。先前的研究仅限于从单个时间点或时间短的时间点分析隐私政策,因为研究人员无法访问大型,纵向,策划的数据集。为了解决这一差距,我们开发了一个从Internet Archive的Wayback机器中发现,下载和提取存档的隐私政策的爬行者。使用爬行者并遵循一系列验证和质量控制步骤,我们策划了1,071,488英语语言隐私政策的数据集,范围超过了二十年,超过130,000个不同的网站。 我们对数据的分析描绘了隐私政策的透明度和可访问性的令人不安的图片。通过比较数据集中与跟踪相关的术语的发生与先前的Web隐私测量结果,我们发现隐私政策始终未能披露普通跟踪技术和第三方的存在。我们还发现,在过去的二十年中,隐私政策变得更加难以阅读,长度加倍并提高了中位数阅读水平。我们的数据表明,第一方网站的自我调节停滞不前,而第三方的自我调节增加了,但在线广告贸易协会主导了。最后,我们通过证明GDPR对隐私政策的历史影响来为有关隐私调节的文献做出贡献。
Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. Prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To address this gap, we developed a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive's Wayback Machine. Using the crawler and following a series of validation and quality control steps, we curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites. Our analyses of the data paint a troubling picture of the transparency and accessibility of privacy policies. By comparing the occurrence of tracking-related terminology in our dataset to prior web privacy measurements, we find that privacy policies have consistently failed to disclose the presence of common tracking technologies and third parties. We also find that over the last twenty years privacy policies have become even more difficult to read, doubling in length and increasing a full grade in the median reading level. Our data indicate that self-regulation for first-party websites has stagnated, while self-regulation for third parties has increased but is dominated by online advertising trade associations. Finally, we contribute to the literature on privacy regulation by demonstrating the historic impact of the GDPR on privacy policies.