论文标题
Chemgrapher:通过深度学习对化合物的光学图识别
ChemGrapher: Optical Graph Recognition of Chemical Compounds by Deep Learning
论文作者
论文摘要
在药物发现中,对化合物的图形结构的了解至关重要。成千上万的化学和药学科学科学文章研究了化学化合物,但在这些化合物结构的细节中,仅作为图像发表。一种自动分析这些图像并将其转换为化学图结构的工具对于许多应用,这些药物发现将是有用的。有一些这样的工具可用,它们主要来自光学角色识别。但是,我们对这些工具性能的评估表明,它们在检测正确的债券多样性和立体化学信息方面经常犯错。另外,错误有时甚至会导致所得图中缺失原子。在我们的工作中,我们通过开发基于机器学习的复合识别方法来解决这些问题。更具体地说,我们开发了一个深层神经网络模型,用于光学复合识别。此处介绍的深度学习解决方案由一个分割模型组成,然后是三个预测原子位置,键和电荷的分类模型。此外,该模型不仅可以预测分子的图结构,而且还会产生将所得图的每个组件与源图像相关联所需的所有信息。该解决方案是可扩展的,可以快速处理数千个图像。最后,我们将提出的方法与建立良好的工具进行经验比较,并观察到大量误差降低。
In drug discovery, knowledge of the graph structure of chemical compounds is essential. Many thousands of scientific articles in chemistry and pharmaceutical sciences have investigated chemical compounds, but in cases the details of the structure of these chemical compounds is published only as an images. A tool to analyze these images automatically and convert them into a chemical graph structure would be useful for many applications, such drug discovery. A few such tools are available and they are mostly derived from optical character recognition. However, our evaluation of the performance of those tools reveals that they make often mistakes in detecting the correct bond multiplicity and stereochemical information. In addition, errors sometimes even lead to missing atoms in the resulting graph. In our work, we address these issues by developing a compound recognition method based on machine learning. More specifically, we develop a deep neural network model for optical compound recognition. The deep learning solution presented here consists of a segmentation model, followed by three classification models that predict atom locations, bonds and charges. Furthermore, this model not only predicts the graph structure of the molecule but also produces all information necessary to relate each component of the resulting graph to the source image. This solution is scalable and could rapidly process thousands of images. Finally, we compare empirically the proposed method to a well-established tool and observe significant error reductions.