论文标题
弦乐集的BWT变体的调查
A survey of BWT variants for string collections
论文作者
论文摘要
近年来,生物信息学研究的重点已从单个序列转变为序列集合。鉴于burrows-wheeler变换(BWT)在字符串处理中的基本作用,已经开发了许多专用工具来计算字符串集合的BWT。尽管重点一直放在提高效率上,但无论是在时空和时间时期,所采用的BWT的确切定义都不是关注的焦点。正如我们在本文中所显示的那样,使用的不同工具通常会计算出非等效的BWT变体:所得转换可能会显着不同,包括运行的数量$ r $,这是BWT的中心参数。此外,使用许多工具,转换取决于集合的输入顺序。换句话说,在同一数据集上,如果以不同的顺序给出数据集,则相同的工具可能会输出不同的变换。我们研究了$ 18 $的专用工具,用于计算字符串集合的BWT,并能够识别这些工具计算的$ 6 $不同的BWT变体。我们从理论和实际的角度回顾了这些BWT变体之间的差异,并将其比较它们具有不同特征的$ 8 $现实生活生物学数据集。我们发现,根据数据集的不同,差异可能是广泛的,并且在许多类似短序列的集合上最大。参数$ r $,BWT的运行次数,还显示了不同BWT变体之间的显着变化;在我们的数据集上,它的乘法因素最高为$ 4.2 $。来复制结果并下载文章中使用的数据的源代码和脚本可在\ url {https://github.com/davidecenzato/bwt-variants-for-string-collections}可用
In recent years, the focus of bioinformatics research has moved from individual sequences to collections of sequences. Given the fundamental role of the Burrows-Wheeler Transform (BWT) in string processing, a number of dedicated tools have been developed for computing the BWT of string collections. While the focus has been on improving efficiency, both in space and time, the exact definition of the BWT employed has not been at the center of attention. As we show in this paper, the different tools in use often compute non-equivalent BWT variants: the resulting transforms can differ from each other significantly, including the number $r$ of runs, a central parameter of the BWT. Moreover, with many tools, the transform depends on the input order of the collection. In other words, on the same dataset, the same tool may output different transforms if the dataset is given in a different order. We studied $18$ dedicated tools for computing the BWT of string collections and have been able to identify $6$ different BWT variants computed by these tools. We review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on $8$ real-life biological datasets with different characteristics. We find that the differences can be extensive, depending on the datasets, and are largest on collections of many similar short sequences. The parameter $r$, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to $4.2$. Source code and scripts to replicate the results and download the data used in the article are available at \url{https://github.com/davidecenzato/BWT-variants-for-string-collections}