论文标题

记录联系的最大熵分类

Maximum Entropy classification for record linkage

论文作者

Lee, Danhyang, Zhang, Li-Chun, Kim, Jae-Kwang

论文摘要

通过记录链接,一个人加入了驻留在单独文件中的记录,这些记录被认为与同一实体有关。在本文中,我们将记录链接作为分类问题,并在文本挖掘中调整最大熵分类方法,以记录机器学习的监督和无监督设置。将根据相关的不确定性选择一组链接。一方面,我们的框架克服了Fellegi and Sunter(1969)率先提出的经典方法的一些持久理论缺陷。另一方面,所提出的算法是可扩展的且完全自动的,与通常需要文书审查以解决未定的情况的经典方法不同。

By record linkage one joins records residing in separate files which are believed to be related to the same entity. In this paper we approach record linkage as a classification problem, and adapt the maximum entropy classification method in text mining to record linkage, both in the supervised and unsupervised settings of machine learning. The set of links will be chosen according to the associated uncertainty. On the one hand, our framework overcomes some persistent theoretical flaws of the classical approach pioneered by Fellegi and Sunter (1969); on the other hand, the proposed algorithm is scalable and fully automatic, unlike the classical approach that generally requires clerical review to resolve the undecided cases.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源