论文标题

遗传数据的数据压缩的上下文结构,模型聚类和适应性

Context binning, model clustering and adaptivity for data compression of genetic data

论文作者

Duda, Jarek

论文摘要

遗传数据库的快速增长意味着从数据压缩的改进中节省了大量,这需要更好的廉价统计模型。本文提出了自动化的优化,例如类似马尔可夫的模型,尤其是上下文binning和模型聚类。虽然仅删除上下文的低位是很受欢迎的,但提出的上下文binning自动优化了如下所示的减少:state = bin [context]确定概率分布,这样几乎从非常大的上下文中提取所有有用的信息,也从非常少量的状态中提取。第二种提出的方​​法:模型聚类在一般统计模型的空间中使用K-均值聚类,从而可以优化一些可以选择一些模型(作为群集质心),例如每个读取。还简要讨论了一些适应性技术,以包括数据非平稳性。

Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to just remove low bits of the context, proposed context binning automatically optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a relatively small number of states. The second proposed approach: model clustering uses k-means clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源