太阳能：一个高度优化的数据加载框架，用于基于CNN的科学替代物的分布式培训

论文标题

太阳能：一个高度优化的数据加载框架，用于基于CNN的科学替代物的分布式培训

SOLAR: A Highly Optimized Data Loading Framework for Distributed Training of CNN-based Scientific Surrogates

论文作者

Sun, Baixi, Yu, Xiaodong, Zhang, Chengming, Tian, Jiannan, Jin, Sian, Iskra, Kamil, Zhou, Tao, Bicer, Tekin, Beckman, Pete, Tao, Dingwen

论文摘要

基于CNN的替代物已在科学应用中普遍存在，以取代常规的耗时的物理方法。尽管这些替代物可以在小型培训数据集的计算成本下显着降低计算成本，但我们的基准测试结果表明，当使用大型数据集培训替代的替代品时，数据加载开销成为主要的性能瓶颈。实际上，替代物通常接受高分辨率科学数据的训练，这些数据很容易达到Terabyte量表。提出了一些最先进的数据加载器，以改善CNN训练中的负载吞吐量；但是，当应用于替代训练时，它们是最佳的。在这项工作中，我们提出了一个替代数据加载器太阳能，该太阳能最终可以增加训练期间的负载吞吐量。它在基准测试过程中利用了我们的三个主要观察结果，并包含三个新颖的设计。具体而言，太阳能首先生成预定的洗牌索引列表，因此优化了全局访问顺序和缓冲驱逐方案，以最大程度地利用数据重用和缓冲区命中率。然后，它提出了轻巧的计算失衡与重量加载工作量失衡之间的权衡，以加快整体训练。它最终用HDF5优化了其数据访问模式，以实现更好的并行I/O吞吐量。我们对三个科学替代物和32 GPU的评估表明，太阳能在Pytorch数据加载器上最多可以达到24.4倍的速度，而在最先进的数据加载器上可以实现3.52倍的速度。

CNN-based surrogates have become prevalent in scientific applications to replace conventional time-consuming physical approaches. Although these surrogates can yield satisfactory results with significantly lower computation costs over small training datasets, our benchmarking results show that data-loading overhead becomes the major performance bottleneck when training surrogates with large datasets. In practice, surrogates are usually trained with high-resolution scientific data, which can easily reach the terabyte scale. Several state-of-the-art data loaders are proposed to improve the loading throughput in general CNN training; however, they are sub-optimal when applied to the surrogate training. In this work, we propose SOLAR, a surrogate data loader, that can ultimately increase loading throughput during the training. It leverages our three key observations during the benchmarking and contains three novel designs. Specifically, SOLAR first generates a pre-determined shuffled index list and accordingly optimizes the global access order and the buffer eviction scheme to maximize the data reuse and the buffer hit rate. It then proposes a tradeoff between lightweight computational imbalance and heavyweight loading workload imbalance to speed up the overall training. It finally optimizes its data access pattern with HDF5 to achieve a better parallel I/O throughput. Our evaluation with three scientific surrogates and 32 GPUs illustrates that SOLAR can achieve up to 24.4X speedup over PyTorch Data Loader and 3.52X speedup over state-of-the-art data loaders.

下载PDF全文

下载文献需遵守相关版权规定

论文标题