论文标题
SSCIBERT:社会科学文本的预训练的语言模型
SsciBERT: A Pre-trained Language Model for Social Science Texts
论文作者
论文摘要
社会科学的学术文献记录了人类文明并研究人类社会问题。随着其大规模增长,快速找到有关相关问题的现有研究的方法已成为对研究人员的紧迫需求。先前的研究,例如SCIBERT,已经表明,使用特定领域的文本进行预训练可以改善自然语言处理任务的性能。但是,到目前为止尚未获得社会科学的预训练语言模型。鉴于此,本研究提出了一个基于社会科学引文指数(SSCI)期刊上发表的摘要的预培训模型。这些模型可在GitHub(https://github.com/s-t-full-text-knowledge-mining/ssci-bert)上获得,在学科分类,抽象结构 - 功能识别以及具有社会科学文献的指定实体识别任务方面表现出色。
The academic literature of social sciences records human civilization and studies human social problems. With its large-scale growth, the ways to quickly find existing research on relevant issues have become an urgent demand for researchers. Previous studies, such as SciBERT, have shown that pre-training using domain-specific texts can improve the performance of natural language processing tasks. However, the pre-trained language model for social sciences is not available so far. In light of this, the present research proposes a pre-trained model based on the abstracts published in the Social Science Citation Index (SSCI) journals. The models, which are available on GitHub (https://github.com/S-T-Full-Text-Knowledge-Mining/SSCI-BERT), show excellent performance on discipline classification, abstract structure-function recognition, and named entity recognition tasks with the social sciences literature.