通过将大数据和调查样本数据结合起来的数据集成

论文标题

通过将大数据和调查样本数据结合起来的数据集成

Data Integration by combining big data and survey sample data for finite population inference

论文作者

Kim, Jae-kwang, Tam, Siu-Ming

论文摘要

在文献中，使用大数据对有限人群进行有效统计推断的统计挑战已在文献中得到充分证明。这些挑战主要是由于大数据源中覆盖不足引起的统计偏差，以表示数据集可用的变量中的感兴趣群体和测量误差。通过将人群分为大数据层和缺少的数据层，我们可以通过使用完全响应的概率样本来估计缺失的数据层，从而通过使用数据集成估计器来估算总体。通过将数据集成估计器表示为回归估计器，我们可以处理大数据和概率样本中变量中的测量误差。我们还提出了一种完全非参数分类方法，用于识别重叠单元并在错误分类错误下开发偏差校正的数据集成估计器。最后，我们开发了一个两步回归数据集成估计器，以处理概率样本中的测量误差。本文主张的方法的一个优点是，我们不必为工作方法做出不现实的失踪假设。提出的方法使用2015-16澳大利亚农业普查数据应用于实际数据示例。

The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under-coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample, and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias-corrected data integration estimator under misclassification errors. Finally, we develop a two-step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing-at-random assumptions for the methods to work. The proposed method is applied to the real data example using 2015-16 Australian Agricultural Census data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题