论文标题
gcimpute:用于丢失数据插补的软件包
gcimpute: A Package for Missing Data Imputation
论文作者
论文摘要
本文介绍了用于缺少数据插补的Python软件包GCIMPUTE。 GCIMPUTE可以通过将数据作为来自高斯副群模型的样本进行建模,将丢失的数据归为许多不同的变量类型,包括连续,二进制,序数,计数和截断值。该半参数模型了解了每个变量的边际分布以匹配经验分布,但描述了变量与关节高斯之间的相互作用,该变量能够快速推断,置信区间插入和多个插补。该软件包还提供了专门的扩展程序来处理大型数据集(在观测值的数量中具有复杂性线性)和流数据集(带有在线插图)。本文介绍了基本方法,并演示了如何使用软件包。
This article introduces the Python package gcimpute for missing data imputation. gcimpute can impute missing data with many different variable types, including continuous, binary, ordinal, count, and truncated values, by modeling data as samples from a Gaussian copula model. This semiparametric model learns the marginal distribution of each variable to match the empirical distribution, yet describes the interactions between variables with a joint Gaussian that enables fast inference, imputation with confidence intervals, and multiple imputation. The package also provides specialized extensions to handle large datasets (with complexity linear in the number of observations) and streaming datasets (with online imputation). This article describes the underlying methodology and demonstrates how to use the software package.