生成差异私人异构电子健康记录

论文标题

生成差异私人异构电子健康记录

Generation of Differentially Private Heterogeneous Electronic Health Records

论文作者

Chin-Cheong, Kieran, Sutter, Thomas, Vogt, Julia E.

论文摘要

电子健康记录（EHRS）通常由机器学习社区使用，用于研究与医疗保健和医学有关的问题。 EHR具有可以轻松分发并包含许多对例如的功能的优点。分类问题。使EHR数据集不同于典型的机器学习数据集的原因是，由于它们的高维度，它们通常非常稀疏，并且通常包含异质（混合）数据类型。此外，数据集涉及敏感信息，由于隐私问题，该信息限制了使用这些信息的任何模型的分布。由于这些原因，在实践中使用EHR数据提出了一个真正的挑战。在这项工作中，我们使用生成的对抗网络探索生成合成的异质EHR，目的是使用这些综合记录代替现有数据集进行下游分类任务。我们将进一步探索应用差异隐私（DP）保留优化的优化，以生成DP合成EHR数据集，这些数据集可提供严格的隐私保证，因此在现实世界中可共享和可用。在二进制分类任务中测试时，对于非DP模型的原始数据集（在基线的3-5％之内），我们模型的合成数据的性能（通过AUROC，AUPRC和准确性测量）非常接近。使用强$（1，10^{-5}）$ DP，我们的模型仍然会产生对机器学习任务有用的数据，尽管在我们的经过测试的分类任务中会受到大约17％的性能罚款。我们还进行了子群体分析，发现与男性/女性人群中的基线相比，我们的模型不会在合成EHR数据中引入任何偏见，或者在非DP或DP变量的分类性能方面，0-18、19-50和51岁以上的年龄段。

Electronic Health Records (EHRs) are commonly used by the machine learning community for research on problems specifically related to health care and medicine. EHRs have the advantages that they can be easily distributed and contain many features useful for e.g. classification problems. What makes EHR data sets different from typical machine learning data sets is that they are often very sparse, due to their high dimensionality, and often contain heterogeneous (mixed) data types. Furthermore, the data sets deal with sensitive information, which limits the distribution of any models learned using them, due to privacy concerns. For these reasons, using EHR data in practice presents a real challenge. In this work, we explore using Generative Adversarial Networks to generate synthetic, heterogeneous EHRs with the goal of using these synthetic records in place of existing data sets for downstream classification tasks. We will further explore applying differential privacy (DP) preserving optimization in order to produce DP synthetic EHR data sets, which provide rigorous privacy guarantees, and are therefore shareable and usable in the real world. The performance (measured by AUROC, AUPRC and accuracy) of our model's synthetic, heterogeneous data is very close to the original data set (within 3 - 5% of the baseline) for the non-DP model when tested in a binary classification task. Using strong $(1, 10^{-5})$ DP, our model still produces data useful for machine learning tasks, albeit incurring a roughly 17% performance penalty in our tested classification task. We additionally perform a sub-population analysis and find that our model does not introduce any bias into the synthetic EHR data compared to the baseline in either male/female populations, or the 0-18, 19-50 and 51+ age groups in terms of classification performance for either the non-DP or DP variant.

下载PDF全文

下载文献需遵守相关版权规定

论文标题