随机梯度下降，没有完整数据洗牌

论文标题

随机梯度下降，没有完整数据洗牌

Stochastic Gradient Descent without Full Data Shuffle

论文作者

Xu, Lijie, Qiu, Shuang, Yuan, Binhang, Jiang, Jiawei, Renggli, Cedric, Gan, Shaoduo, Kara, Kaan, Li, Guoliang, Liu, Ji, Wu, Wentao, Ye, Jieping, Zhang, Ce

论文摘要

随机梯度下降（SGD）是现代机器学习（ML）系统的基石。尽管具有其计算效率，但SGD仍需要随机数据访问，这些数据访问在依赖块 - 可调式二级存储的系统中实现效率低下，例如HDD和SSD，例如TensorFlow/Pytorch和DB ML系统，而不是大文件。为了解决这种阻抗不匹配，已经提出了各种数据改组策略，以平衡SGD的收敛速率（有利于随机性）及其I/O性能（有利于顺序访问）。在本文中，我们首先就现有数据改组策略进行了系统的实证研究，该研究表明，所有现有策略都有改进的空间 - 它们都在I/O性能或融合率方面受苦。考虑到这一点，我们提出了一种简单但新颖的分层数据改组策略Corgipile。与现有策略相比，Corgipile避免了完整的数据洗牌，同时保持SGD的可比收敛速度，就好像进行了完整的混音一样。我们对Corgipile的融合行为提供了非平凡的理论分析。我们通过在新的CorgipileDataSet API中设计新的平行/分布式洗牌算子来进一步将Corgipile整合到Pytorch中。我们还通过介绍具有优化的三个新物理运营商，将Corgipile集成到PostgreSQL中。我们的实验结果表明，对于深度学习和广义线性模型，Corgipile可以与全面的SGD实现可比的收敛速率。对于Imagenet数据集的深度学习模型，Corgipile比Pytorch快1.5倍，并带有完整的数据洗牌。对于具有线性模型的INDB ML，在HDD和SSD上，Corgipile的Corgipile比两个最先进的IN-DB ML系统（Apache Madlib和Bismarck）快1.6 x-12.8倍。

Stochastic gradient descent (SGD) is the cornerstone of modern machine learning (ML) systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., TensorFlow/PyTorch and in-DB ML systems over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement -- they all suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PyTorch by designing new parallel/distributed shuffle operators inside a new CorgiPileDataSet API. We also integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models. For deep learning models on ImageNet dataset, CorgiPile is 1.5X faster than PyTorch with full data shuffle. For in-DB ML with linear models, CorgiPile is 1.6X-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题