论文标题
AppCorp:Android隐私政策文档结构分析的语料库
APPCorp: A Corpus for Android Privacy Policy Document Structure Analysis
论文作者
论文摘要
随着移动设备的日益普及和移动应用程序的广泛采用,人们对隐私问题的关注越来越大。隐私政策被确定为指示法律条款(例如GDPR)的适当媒介,并约束服务提供商和用户之间的法律协议。但是,隐私政策通常很长,最终用户可以阅读和理解。因此,重要的是能够自动分析隐私政策的文档结构以帮助用户理解。在这项工作中,我们创建了一个手动标记的语料库,其中包含$ 167 $的隐私政策(超过$ 447 $ K单词和5,276美元的注释段落)。我们报告注释过程和注释语料库的详细信息。我们还使用$ 4 $的文档分类模型对数据语料库进行基准测试,彻底分析结果并讨论研究委员会使用该语料库的挑战和机会。我们发布了标记的语料库以及公共访问的分类模型。
With the increasing popularity of mobile devices and the wide adoption of mobile Apps, an increasing concern of privacy issues is raised. Privacy policy is identified as a proper medium to indicate the legal terms, such as GDPR, and to bind legal agreement between service providers and users. However, privacy policies are usually long and vague for end users to read and understand. It is thus important to be able to automatically analyze the document structures of privacy policies to assist user understanding. In this work we create a manually labelled corpus containing $167$ privacy policies (of more than $447$K words and $5,276$ annotated paragraphs). We report the annotation process and details of the annotated corpus. We also benchmark our data corpus with $4$ document classification models, thoroughly analyze the results and discuss challenges and opportunities for the research committee to use the corpus. We release our labelled corpus as well as the classification models for public access.