合成电子健康记录生成模型的多方面基准测试

论文标题

合成电子健康记录生成模型的多方面基准测试

A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models

论文作者

Yan, Chao, Yan, Yao, Wan, Zhiyu, Zhang, Ziqi, Omberg, Larsson, Guinney, Justin, Mooney, Sean D., Malin, Bradley A.

论文摘要

合成健康数据在共享数据以支持生物医学研究和创新医疗保健应用的发展时有可能减轻隐私问题。尤其是基于机器学习，尤其是生成对抗网络（GAN）方法的现代方法生成的方法，继续发展并表现出巨大的潜力。然而，缺乏系统的评估框架来基准方法出现，并确定哪些方法最合适。在这项工作中，我们引入了一个可推广的基准测试框架，以评估综合健康数据的关键特征在实用性和隐私指标方面。我们应用框架来评估来自两个大型学术医疗中心的电子健康记录（EHRS）数据的合成数据生成方法。结果表明，共享综合EHR数据存在公用事业私人关系权衡。结果进一步表明，在每个用例中，在所有标准上都没有明确的方法是最好的，这很明显为什么需要在上下文中评估合成数据生成方法。

Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records (EHRs) data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic EHR data. The results further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题