论文标题
随时间延迟分布式学习中的异步培训方案
Asynchronous Training Schemes in Distributed Learning with Time Delay
论文作者
论文摘要
在分布式深度学习的背景下,陈旧的权重或梯度的问题可能导致算法性能差。这个问题通常是通过延迟耐受算法来解决的,对目标函数和步进尺寸有一些温和的假设。在本文中,我们提出了一种不同的方法来开发一种新算法,称为$ \ textbf {p} $ redict $ \ textbf {c} $ lipping $ \ textbf {a} $ synChronous $ \ textbf {s} s} s} $ textbf {s} $ PC-ASGD)。具体而言,PC -ASGD有两个步骤 - $ \ textIt {预测步骤} $利用泰勒扩展利用梯度预测来降低过时的权重的稳固性,而$ \ textit {clivipping step} $选择性地降低了过时的权重来减轻其负面影响。引入权衡参数以平衡这两个步骤之间的影响。从理论上讲,考虑到平滑的物镜函数弱符号和非convex的延迟延迟的延迟,我们提出了收敛速率。还提出了一种实用的PC-ASGD变体,即采用条件来帮助确定权衡参数。对于经验验证,我们在两个基准数据集上使用两个深神经网络体系结构演示了该算法的性能。
In the context of distributed deep learning, the issue of stale weights or gradients could result in poor algorithmic performance. This issue is usually tackled by delay tolerant algorithms with some mild assumptions on the objective functions and step sizes. In this paper, we propose a different approach to develop a new algorithm, called $\textbf{P}$redicting $\textbf{C}$lipping $\textbf{A}$synchronous $\textbf{S}$tochastic $\textbf{G}$radient $\textbf{D}$escent (aka, PC-ASGD). Specifically, PC-ASGD has two steps - the $\textit{predicting step}$ leverages the gradient prediction using Taylor expansion to reduce the staleness of the outdated weights while the $\textit{clipping step}$ selectively drops the outdated weights to alleviate their negative effects. A tradeoff parameter is introduced to balance the effects between these two steps. Theoretically, we present the convergence rate considering the effects of delay of the proposed algorithm with constant step size when the smooth objective functions are weakly strongly-convex and nonconvex. One practical variant of PC-ASGD is also proposed by adopting a condition to help with the determination of the tradeoff parameter. For empirical validation, we demonstrate the performance of the algorithm with two deep neural network architectures on two benchmark datasets.