论文标题
将功能分配给蛋白质 - 蛋白质相互作用:使用PubMed摘要的基于弱监督的生物Biobert方法
Assigning function to protein-protein interactions: a weakly supervised BioBERT based approach using PubMed abstracts
论文作者
论文摘要
动机:蛋白质 - 蛋白质相互作用(PPI)对于正常细胞和患病细胞中蛋白质的功能至关重要,许多关键蛋白质功能都是通过相互作用介导的。这些相互作用性质的知识对于构建网络以分析生物学数据很重要。但是,在蛋白质相互作用数据库中仅捕获的PPI只有一小部分具有可用功能的注释,例如完整数据库中只有4%的PPI在功能上注释。在这里,我们旨在通过提取PubMed摘要中描述的关系来标记PPI的功能类型。 方法:我们从完整的PPI数据库中创建了一个弱监督的数据集,该数据库包含与PubMed数据库的注释函数和相关摘要相互作用的蛋白质对。我们将最先进的深度学习技术用于生物医学自然语言处理任务,Biobert构建了一个模型(称为PPI-Biobert),以识别PPI的功能。为了大规模提取高质量的PPI功能,我们使用PPI-Biobert模型的集合来改善不确定性估计并应用相互作用类型的特异性阈值,以抵消每次交互类型的训练样本数量变化的影响。 结果:我们扫描了1800万PubMed摘要以自动识别3253个新的键入PPI,包括磷酸化和乙酰化相互作用,基于人工评价的样本,总体精度为46%(乙酰化87%)。这项工作表明,对PPI功能提取的生物医学摘要的分析是一种可行的方法,可大大增加与在线数据库中捕获的函数注释的相互作用数量。
Motivation: Protein-protein interactions (PPI) are critical to the function of proteins in both normal and diseased cells, and many critical protein functions are mediated by interactions.Knowledge of the nature of these interactions is important for the construction of networks to analyse biological data. However, only a small percentage of PPIs captured in protein interaction databases have annotations of function available, e.g. only 4% of PPI are functionally annotated in the IntAct database. Here, we aim to label the function type of PPIs by extracting relationships described in PubMed abstracts. Method: We create a weakly supervised dataset from the IntAct PPI database containing interacting protein pairs with annotated function and associated abstracts from the PubMed database. We apply a state-of-the-art deep learning technique for biomedical natural language processing tasks, BioBERT, to build a model - dubbed PPI-BioBERT - for identifying the function of PPIs. In order to extract high quality PPI functions at large scale, we use an ensemble of PPI-BioBERT models to improve uncertainty estimation and apply an interaction type-specific threshold to counteract the effects of variations in the number of training samples per interaction type. Results: We scan 18 million PubMed abstracts to automatically identify 3253 new typed PPIs, including phosphorylation and acetylation interactions, with an overall precision of 46% (87% for acetylation) based on a human-reviewed sample. This work demonstrates that analysis of biomedical abstracts for PPI function extraction is a feasible approach to substantially increasing the number of interactions annotated with function captured in online databases.