论文标题

通过扩展的MacSIM方法提高了对记录链接的准确性的评估

Improved assessment of the accuracy of record linkage via an extended MaCSim approach

论文作者

Haque, Shovanur, Mengersen, Kerrie

论文摘要

记录链接是从重叠数据源中将相同实体汇总在一起的过程,同时删除重复项。现在,公共或私人组织以及研究人员和个人正在收集大量数据。链接和分析来自大量数据储层的相关信息可以为社会提供新的见解。但是,数据的增加也可能增加数据库之间错误链接的记录的可能性。拥有有效,有效的方法来链接来自不同来源的数据已变得越来越重要。因此,有必要评估链接方法获得高精度或在准确性之间进行比较的能力。在本文中,我们改进了基于马尔可夫链的蒙特卡洛模拟方法(MACSIM),用于评估链接方法。 MacSim使用了两个以前已在类似数据类型上链接的链接文件来创建协议矩阵,然后使用开发的拟议算法模拟矩阵,以生成协议矩阵的重新采样版本。在每个模拟中使用定义的链接方法来链接文件,并评估了链接方法的准确性。此处提出的改进涉及计算每个记录对的每个链接变量值的相似性权重,这允许链接变量值的部分一致。根据可调节的参数“公差”,计算每个链接变量的阈值。为了评估链接方法的准确性,对每个记录进行了正确的链接比例。使用澳大利亚统计局(ABS)基于现实的数据设置提供的合成数据集说明了扩展的MacSIM方法。测试结果显示链接评估的准确性更高。

Record linkage is the process of bringing together the same entity from overlapping data sources while removing duplicates. Huge amounts of data are now being collected by public or private organizations as well as by researchers and individuals. Linking and analysing relevant information from this massive data reservoir can provide new insights into society. However, this increase in the amount of data may also increase the likelihood of incorrectly linked records among databases. It has become increasingly important to have effective and efficient methods for linking data from different sources. Therefore, it becomes necessary to assess the ability of a linking method to achieve high accuracy or to compare between methods with respect to accuracy. In this paper, we improve on a Markov Chain based Monte Carlo simulation approach (MaCSim) for assessing a linking method. MaCSim utilizes two linked files that have been previously linked on similar types of data to create an agreement matrix and then simulates the matrix using a proposed algorithm developed to generate re-sampled versions of the agreement matrix. A defined linking method is used in each simulation to link the files and the accuracy of the linking method is assessed. The improvement proposed here involves calculation of a similarity weight for every linking variable value for each record pair, which allows partial agreement of the linking variable values. A threshold is calculated for every linking variable based on adjustable parameter "tolerance" for that variable. To assess the accuracy of linking method, correctly linked proportions are investigated for each record. The extended MaCSim approach is illustrated using a synthetic dataset provided by the Australian Bureau of Statistics (ABS) based on realistic data settings. Test results show higher accuracy of the assessment of linkages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源