论文标题
FEDPD:具有最佳速度和对非IID数据的联合学习框架
FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data
论文作者
论文摘要
联合学习(FL)已成为从分布式数据中学习的流行范式。为了有效地利用不同设备的数据而不将其移至云中,诸如联合平均(FIDAVG)之类的算法采用了“计算然后进行聚集”(CTA)模型,在该模型中,在将本地数据发送到本地模型中以进行聚集之前,使用本地数据执行了多个本地更新。 但是,这些方案通常需要强大的假设,例如本地数据是相同独立的分布式(i.i.d),或者局部梯度的大小是有界的。在本文中,我们首先明确表征了FedAvg算法的行为,并表明如果没有对问题结构的强烈和不切实际的假设,则该算法对于非convex问题(例如,差异与无限属)的行为可能不稳定。旨在设计快速且需要尽可能少的假设的FL算法,我们从原始的偶尔优化角度提出了一种新的算法设计策略。我们的策略产生了一种与现有算法相同的CTA模型的算法系列,但是它们可以处理非凸目标,实现最佳的优化和通信复杂性,同时能够处理完整的批次和迷你批次本地计算模型。最重要的是,所提出的算法是{\ it通信有效},因为通信模式可以适应本地数据之间的异质性水平。据我们所知,这是FL的第一个算法框架,可实现上述所有属性。
Federated Learning (FL) has become a popular paradigm for learning from distributed data. To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model, in which multiple local updates are performed using local data, before sending the local models to the cloud for aggregation. However, these schemes typically require strong assumptions, such as the local data are identically independent distributed (i.i.d), or the size of the local gradients are bounded. In this paper, we first explicitly characterize the behavior of the FedAvg algorithm, and show that without strong and unrealistic assumptions on the problem structure, the algorithm can behave erratically for non-convex problems (e.g., diverge to infinity). Aiming at designing FL algorithms that are provably fast and require as few assumptions as possible, we propose a new algorithm design strategy from the primal-dual optimization perspective. Our strategy yields a family of algorithms that take the same CTA model as existing algorithms, but they can deal with the non-convex objective, achieve the best possible optimization and communication complexity while being able to deal with both the full batch and mini-batch local computation models. Most importantly, the proposed algorithms are {\it communication efficient}, in the sense that the communication pattern can be adaptive to the level of heterogeneity among the local data. To the best of our knowledge, this is the first algorithmic framework for FL that achieves all the above properties.