论文标题
来自ADKDD'21隐私保护ML挑战的课程
Lessons from the AdKDD'21 Privacy-Preserving ML Challenge
论文作者
论文摘要
设计提供性能和强大隐私保证的数据共享机制是在线广告行业的热门话题。也就是说,在W3C的改进Web广告业务组下讨论的一个著名建议仅允许通过汇总的,差异的私人报告共享广告信号。为了广泛研究这项建议,在ADKDD'21上进行了开放的隐私机器学习挑战,这是广告科学的主要研讨会,并提供了广告公司Criteo提供的数据。在本文中,我们描述了挑战任务,可用数据集的结构,报告挑战结果,并启用其完整的可重复性。一个关键的发现是,在一小部分未汇总数据点的情况下,大型,聚合数据的学习模型可能令人惊讶地高效且便宜。我们还运行其他实验,以观察获奖方法对不同参数的敏感性,例如隐私预算或可用的特权侧面信息数量。我们得出的结论是,该行业需要私人数据共享的替代设计,或者使用汇总数据进行学习的突破,以使AD相关性保持在合理的水平。
Designing data sharing mechanisms providing performance and strong privacy guarantees is a hot topic for the Online Advertising industry. Namely, a prominent proposal discussed under the Improving Web Advertising Business Group at W3C only allows sharing advertising signals through aggregated, differentially private reports of past displays. To study this proposal extensively, an open Privacy-Preserving Machine Learning Challenge took place at AdKDD'21, a premier workshop on Advertising Science with data provided by advertising company Criteo. In this paper, we describe the challenge tasks, the structure of the available datasets, report the challenge results, and enable its full reproducibility. A key finding is that learning models on large, aggregated data in the presence of a small set of unaggregated data points can be surprisingly efficient and cheap. We also run additional experiments to observe the sensitivity of winning methods to different parameters such as privacy budget or quantity of available privileged side information. We conclude that the industry needs either alternate designs for private data sharing or a breakthrough in learning with aggregated data only to keep ad relevance at a reasonable level.