遗传数据的数据压缩的上下文结构，模型聚类和适应性

论文标题

遗传数据的数据压缩的上下文结构，模型聚类和适应性

Context binning, model clustering and adaptivity for data compression of genetic data

论文作者

Duda, Jarek

论文摘要

遗传数据库的快速增长意味着从数据压缩的改进中节省了大量，这需要更好的廉价统计模型。本文提出了自动化的优化，例如类似马尔可夫的模型，尤其是上下文binning和模型聚类。虽然仅删除上下文的低位是很受欢迎的，但提出的上下文binning自动优化了如下所示的减少：state = bin [context]确定概率分布，这样几乎从非常大的上下文中提取所有有用的信息，也从非常少量的状态中提取。第二种提出的方法：模型聚类在一般统计模型的空间中使用K-均值聚类，从而可以优化一些可以选择一些模型（作为群集质心），例如每个读取。还简要讨论了一些适应性技术，以包括数据非平稳性。

Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to just remove low bits of the context, proposed context binning automatically optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a relatively small number of states. The second proposed approach: model clustering uses k-means clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题