论文标题

用机器学习管道进行缩放系统文献评论

Scaling Systematic Literature Reviews with Machine Learning Pipelines

论文作者

Goldfarb-Tarrant, Seraphina, Robertson, Alexander, Lazic, Jasmina, Tsouloufi, Theodora, Donnison, Louise, Smyth, Karen

论文摘要

系统的评论需要从大量科学文档中提取数据,是应用机器学习的理想途径。它们对许多科学和慈善领域至关重要,但非常耗时,需要专家。然而,系统审查的三个主要阶段很容易自动完成:可以通过API和刮刀进行搜索文档,可以通过二进制分类来选择相关文档,并且可以通过序列标记分类来提取数据。尽管对该领域有自动化的希望,但很少有研究检查各种这些任务的自动化方法。我们构建了一条自动化这些方面的管道,并尝试许多人类时代与系统质量权衡。我们测试了分类器在少量数据上运行良好的能力,并概括了来自培训数据中未代表的国家 /地区的数据。我们测试不同类型的数据提取,并在注释方面难度不同,五个不同的神经体系结构可以进行提取。我们发现,只有2周的人类专家注释,我们就可以获得整个管道系统的精确度和普遍性,这只是手动进行整个审查所需的15%的时间,并且可以重复并扩展到新数据,而无需额外努力。

Systematic reviews, which entail the extraction of data from large numbers of scientific documents, are an ideal avenue for the application of machine learning. They are vital to many fields of science and philanthropy, but are very time-consuming and require experts. Yet the three main stages of a systematic review are easily done automatically: searching for documents can be done via APIs and scrapers, selection of relevant documents can be done via binary classification, and extraction of data can be done via sequence-labelling classification. Despite the promise of automation for this field, little research exists that examines the various ways to automate each of these tasks. We construct a pipeline that automates each of these aspects, and experiment with many human-time vs. system quality trade-offs. We test the ability of classifiers to work well on small amounts of data and to generalise to data from countries not represented in the training data. We test different types of data extraction with varying difficulty in annotation, and five different neural architectures to do the extraction. We find that we can get surprising accuracy and generalisability of the whole pipeline system with only 2 weeks of human-expert annotation, which is only 15% of the time it takes to do the whole review manually and can be repeated and extended to new data with no additional effort.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源