论文标题
差异减少了Proxskip:算法,理论和应用联邦学习
Variance Reduced ProxSkip: Algorithm, Theory and Application to Federated Learning
论文作者
论文摘要
我们研究基于{\ em本地培训(LT)}范式的分布优化方法:通过在参数平均之前对客户进行基于本地梯度的培训来实现沟通效率。回顾该领域的进度,我们{\ em识别5代LT方法}:1)启发式,2)均匀,3)sublinear,4)线性和5)加速。由Mishchenko,Malinovsky,Stich和Richtárik(2022)的Proxskip方法发起的5 $ {}^{\ rm th} $生成及其分析的发起的,其特征是第一个理论确认,即LT是LT是通信加速机制。受到最近进度的启发,我们通过证明有可能使用{\ em差异降低}进一步增强它们,从而为5 $ {}^{\ rm th} $生成LT方法的生成。尽管LT方法的所有以前的理论结果都忽略了本地工作的成本,并且仅根据沟通次数的数量而被构建,但我们表明,在{\ em em总培训成本}方面,与最先进的方法Proxskip相比,在本地计算中,我们的方法可以比{\ em总培训成本}更快。我们从理论上表征了这个阈值,并通过经验结果证实了我们的理论预测。
We study distributed optimization methods based on the {\em local training (LT)} paradigm: achieving communication efficiency by performing richer local gradient-based training on the clients before parameter averaging. Looking back at the progress of the field, we {\em identify 5 generations of LT methods}: 1) heuristic, 2) homogeneous, 3) sublinear, 4) linear, and 5) accelerated. The 5${}^{\rm th}$ generation, initiated by the ProxSkip method of Mishchenko, Malinovsky, Stich and Richtárik (2022) and its analysis, is characterized by the first theoretical confirmation that LT is a communication acceleration mechanism. Inspired by this recent progress, we contribute to the 5${}^{\rm th}$ generation of LT methods by showing that it is possible to enhance them further using {\em variance reduction}. While all previous theoretical results for LT methods ignore the cost of local work altogether, and are framed purely in terms of the number of communication rounds, we show that our methods can be substantially faster in terms of the {\em total training cost} than the state-of-the-art method ProxSkip in theory and practice in the regime when local computation is sufficiently expensive. We characterize this threshold theoretically, and confirm our theoretical predictions with empirical results.