论文标题
本地SGD,其沟通开销仅取决于工人数量
Local SGD With a Communication Overhead Depending Only on the Number of Workers
论文作者
论文摘要
我们考虑通过在多个工人中平行将随机梯度下降(SGD)加速加速。我们假设$ n $工人共享相同的数据集,他们可以采取SGD步骤并与中央服务器进行协调。不幸的是,这可能需要工人与服务器之间的大量通信,这可以大大减少并行性的收益。在早期文献中提出和分析的本地SGD方法表明,机器应在此类通信之间做出许多本地步骤。 While the initial analysis of Local SGD showed it needs $Ω( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(nT)$, this has been successively improved in a string of papers, with the state-of-the-art requiring $Ω\left( n \left( \mbox{ polynomial in log } (T) \right) \ right)$ communications。在本文中,我们对本地SGD进行了新的分析。我们分析的结果是,本地SGD可以实现一个错误,该错误仅限制为$ 1/(nt)$,仅固定数量的通信,而独立于$ t $:特别是,仅需要$ω(n)$通信。
We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $n$ workers, who can take SGD steps and coordinate with a central server. Unfortunately, this could require a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs $Ω( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(nT)$, this has been successively improved in a string of papers, with the state-of-the-art requiring $Ω\left( n \left( \mbox{ polynomial in log } (T) \right) \right)$ communications. In this paper, we give a new analysis of Local SGD. A consequence of our analysis is that Local SGD can achieve an error that scales as $1/(nT)$ with only a fixed number of communications independent of $T$: specifically, only $Ω(n)$ communications are required.