论文标题
学习使用图形卷积网络解开基因组组装
Learning to Untangle Genome Assembly with Graph Convolutional Networks
论文作者
论文摘要
寻求确定从端粒到端粒的人类DNA的完整顺序始于三十年前,并最终于2021年完成。这项成就是众多专家的巨大努力的结果,这些专家竭尽全力设计了各种工具并进行了艰苦的手动检查,以实现第一个Gapless Gapeles基因组序列。但是,这种方法几乎不可用作组装不同基因组的一般方法,尤其是在鉴于大量数据至关重要时,组装速度至关重要时。在这项工作中,我们探讨了一种针对基因组组装任务的中心部分的不同方法,该方法包括解开大型组装图,需要从中重建基因组序列。我们的主要动机是减少人工设计的启发式方法,并利用深度学习来开发更具概括性的重建技术。确切地说,我们引入了一个新的学习框架,以训练图形卷积网络,以通过找到正确的路径来解决汇编图。该培训是由由解决的CHM13人类序列产生的数据集监督的,并在使用真实的人Pacbio Hifi读取的组装图上进行了测试。实验结果表明,在仅由单个染色体生成的模拟图上训练的模型能够明显地解决所有其他染色体。此外,该模型在同一图上的最先进的\ textit {de novo}汇编器上优于手工制作的启发式方法。带有图网络的重建染色体在核苷酸水平上更准确,报告重叠群的数量较低,基因组较高的重建分数和NG50/NGA50评估指标。
A quest to determine the complete sequence of a human DNA from telomere to telomere started three decades ago and was finally completed in 2021. This accomplishment was a result of a tremendous effort of numerous experts who engineered various tools and performed laborious manual inspection to achieve the first gapless genome sequence. However, such method can hardly be used as a general approach to assemble different genomes, especially when the assembly speed is critical given the large amount of data. In this work, we explore a different approach to the central part of the genome assembly task that consists of untangling a large assembly graph from which a genomic sequence needs to be reconstructed. Our main motivation is to reduce human-engineered heuristics and use deep learning to develop more generalizable reconstruction techniques. Precisely, we introduce a new learning framework to train a graph convolutional network to resolve assembly graphs by finding a correct path through them. The training is supervised with a dataset generated from the resolved CHM13 human sequence and tested on assembly graphs built using real human PacBio HiFi reads. Experimental results show that a model, trained on simulated graphs generated solely from a single chromosome, is able to remarkably resolve all other chromosomes. Moreover, the model outperforms hand-crafted heuristics from a state-of-the-art \textit{de novo} assembler on the same graphs. Reconstructed chromosomes with graph networks are more accurate on nucleotide level, report lower number of contigs, higher genome reconstructed fraction and NG50/NGA50 assessment metrics.