论文标题

MAPREDUCE-HADOOP基因组学的Fasta/Q数据压缩机:使空间和时间节省变得容易 - 版本1

FASTA/Q Data Compressors for MapReduce-Hadoop Genomics:Space and Time Savings Made Easy -- Version 1

论文作者

Petrillo, Umberto Ferraro, Palini, Francesco, Cattaneo, Giuseppe, Giancarlo, Raffaele

论文摘要

动机:基因组数据的存储是生命科学的主要成本,主要通过专门的数据压缩方法有效地解决。由于数据生产丰富的原因相同,大数据技术的使用被视为基因组数据存储和处理的未来,MapReduce-Hadoop是领导者。令人惊讶的是,Hadoop中没有一个专业的Fasta/Q压缩机。确实,他们在那里的部署并不完全是直接的。这样的最新状态是有问题的。结果:我们在两个不同方向上提供了重大进展。从方法上讲,我们提出了两种通用方法,其中包括相应的软件,它们使在MapReduce-Hadoop中部署专门的FastA/Q压缩机非常容易,以处理存储在分布式Hadoop文件系统上的文件,对Hadoop的了解很少。实际上,我们提供的证据表明,在Hadoop中的这些专业压缩机的部署(到目前为止尚未提供)可节省大量成本,即在大型植物基因组上,HDFS数据块(一个块= 128MB)减少了30%(一个块= 128MB),I/O时间至少X1.5在I/O时间内至少可比性或可比性的网络交流,并且可以与Generic compressors一起使用。最后,我们观察到这些结果也适用于Apache Spark Framework,当用于处理Hadoop文件系统上的FastA/Q文件时。

Motivation: Storage of genomic data is a major cost for the Life Sciences, effectively addressed mostly via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in major cost savings, i.e., on large plant genomes, 30% less HDFS data blocks (one block=128MB), speed-up of at least x1.5 in I/O time and comparable or reduced network communication time with respect to the use of generic compressors available in Hadoop. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源