COMETA：社交媒体上链接的医学实体语料库

论文标题

COMETA：社交媒体上链接的医学实体语料库

COMETA: A Corpus for Medical Entity Linking in the Social Media

论文作者

Basaldella, Marco, Liu, Fangyu, Shareghi, Ehsan, Collier, Nigel

论文摘要

尽管在链接通用语言的实体链接（EL）方面的进展越来越大，但现有数据集无法解决外行语言中健康术语的复杂性质。同时，对应用程序的需求越来越大，可以理解公众在卫生领域中的声音。为了解决这个问题，我们介绍了一个名为Cometa的新语料库，由20K英语生物医学实体组成，提到了Reddit专家通知，并链接到SNOMED CT，这是一个广泛使用的医学知识图。我们的语料库满足了从规模和覆盖到多样性和质量的理想属性的结合，据我们所知，该领域的任何现有资源都没有得到满足。通过基于弦的基于神经模型的20个EL基线的基准实验，我们阐明了这些系统在2个具有挑战性的评估场景下对实体和概念进行复杂推断的能力。我们对Cometa的实验结果表明，不存在金色子弹，甚至最好的主流技术仍然存在明显的性能差距，而最佳解决方案则依赖于结合不同数据视图的方法。

Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman's language. Meanwhile, there is a growing need for applications that can understand the public's voice in the health domain. To address this we introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string- to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios. Our experimental results on COMETA illustrate that no golden bullet exists and even the best mainstream techniques still have a significant performance gap to fill, while the best solution relies on combining different views of data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题