论文标题
这个缩写是什么意思?引入一个新的数据集,以进行首字母缩写标识和歧义
What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation
论文作者
论文摘要
首字母缩写词是短语的简短形式,可促进文档中冗长的句子传达,并作为写作的主要句子之一。由于它们的重要性,识别首字母缩写词和相应的短语(即首字母缩写标识(AI)),并找到每个首字母缩写词的正确含义(即,首字母缩写(AD))对于文本理解至关重要。尽管这项任务最近取得了进展,但现有数据集仍存在一些局限性,这阻碍了进一步的改进。更具体地说,在自动创建的首字母缩写标识数据集中,手动注释的AI数据集或噪音有限的尺寸有限,阻碍了设计高级高性能的缩写标识模型。此外,现有数据集大部分仅限于医疗域而忽略其他域。为了解决这两个局限性,我们首先为科学领域创建一个手动注释的大型AI数据集。该数据集包含17,506个句子,其句子比以前的科学AI数据集大得多。接下来,我们为科学领域的AD数据集准备了62,441个样本,该数据集明显大于以前的科学广告数据集。我们的实验表明,现有的最新模型远远落后于这项工作提出的两个数据集中的人级绩效。此外,我们提出了一种新的深度学习模型,该模型利用句子的句法结构来扩展句子中模棱两可的缩写。所提出的模型的表现优于新广告数据集上的最新模型,为该数据集的未来研究提供了强大的基准。
Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing. Due to their importance, identifying acronyms and corresponding phrases (i.e., acronym identification (AI)) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding. Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement. More specifically, limited size of manually annotated AI datasets or noises in the automatically created acronym identification datasets obstruct designing advanced high-performing acronym identification models. Moreover, the existing datasets are mostly limited to the medical domain and ignore other domains. In order to address these two limitations, we first create a manually annotated large AI dataset for scientific domain. This dataset contains 17,506 sentences which is substantially larger than previous scientific AI datasets. Next, we prepare an AD dataset for scientific domain with 62,441 samples which is significantly larger than the previous scientific AD dataset. Our experiments show that the existing state-of-the-art models fall far behind human-level performance on both datasets proposed by this work. In addition, we propose a new deep learning model that utilizes the syntactical structure of the sentence to expand an ambiguous acronym in a sentence. The proposed model outperforms the state-of-the-art models on the new AD dataset, providing a strong baseline for future research on this dataset.