论文标题

计算非常大的字符串集合的最佳BWT

Computing the optimal BWT of very large string collections

论文作者

Cenzato, Davide, Guerrini, Veronica, Lipták, Zsuzsanna, Rosone, Giovanna

论文摘要

众所周知,字符串集合的burrows-wheeler-transform(BWT)的确切形式取决于集合中字符串的输入顺序。输入集合的重新排序字符串会影响同等字母运行$ r $的数量,这可以说是基于BWT的数据结构的最重要参数,例如FM-Index或$ r $ index。 Bentley,Gibney和Thankachan [ESA 2020]引入了一种线性时间算法,用于计算输入集合的排列,该算法产生了所得BWT的最小运行次数。 在本文中,我们介绍了第一个保证具有最少运行次数(OPTBWT)的Burrows-Wheeler-Transform,通过组合i)算法,该算法从字符串集合中构建BWT(SAIS基于SAIS的[Cenzato等,Spire等,Spire 2021]或BCR [Bauer等人,CPM,CPM 2011]); ii)[Cox等人,生物信息学,2012]中引入的SAP阵列数据结构; iii)Bentley等人的算法。 我们在现实生活和模拟数据上介绍了结果,表明,相对于输入顺序而言,根据$ r $的改进非常重要,并且通过计算最佳BWT可以忽略的计算而创建的间接费用,从而使我们的工具在运行时间和空间使用方面与BWT-Compant的其他工具竞争。特别是,在实际数据上,OptBWT的运行量最多只有31倍,而仅$ 1.39 \ times $ slowdown。 源代码可从https://github.com/davidecenzato/optimalbwt.git获得。

It is known that the exact form of the Burrows-Wheeler-Transform (BWT) of a string collection depends, in most implementations, on the input order of the strings in the collection. Reordering strings of an input collection affects the number of equal-letter runs $r$, arguably the most important parameter of BWT-based data structures, such as the FM-index or the $r$-index. Bentley, Gibney, and Thankachan [ESA 2020] introduced a linear-time algorithm for computing the permutation of the input collection which yields the minimum number of runs of the resulting BWT. In this paper, we present the first tool that guarantees a Burrows-Wheeler-Transform with minimum number of runs (optBWT), by combining i) an algorithm that builds the BWT from a string collection (either SAIS-based [Cenzato et al., SPIRE 2021] or BCR [Bauer et al., CPM 2011]); ii) the SAP array data structure introduced in [Cox et al., Bioinformatics, 2012]; and iii) the algorithm by Bentley et al. We present results both on real-life and simulated data, showing that the improvement achieved in terms of $r$ with respect to the input order is significant and the overhead created by the computation of the optimal BWT negligible, making our tool competitive with other tools for BWT-computation in terms of running time and space usage. In particular, on real data the optBWT obtains up to 31 times fewer runs with only a $1.39\times$ slowdown. Source code is available at https://github.com/davidecenzato/optimalBWT.git.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源