重新思考联合学习中的数据异质性：引入新的概念和标准基准

论文标题

重新思考联合学习中的数据异质性：引入新的概念和标准基准

Rethinking Data Heterogeneity in Federated Learning: Introducing a New Notion and Standard Benchmarks

论文作者

Morafah, Mahdi, Vahidian, Saeed, Chen, Chen, Shah, Mubarak, Lin, Bill

论文摘要

尽管成功，但联邦学习给机器学习带来了新的挑战，尤其是当出现数据异质性问题（也称为非IID数据）时。为了应对统计异质性，以前的作品将近端术语纳入了局部优化或修改了服务器端的模型聚合方案或主张聚类的联合学习方法，中央服务器组将具有共同训练的数据分布的群集构成集群，以利用一定的个性化级别。尽管有效，但他们缺乏关于哪种数据异质性以及数据异质性如何影响参与客户的准确性表现的深刻阐述。与许多先前的联邦学习方法相反，我们不仅证明当前设置中的数据异质性问题不一定是一个问题，而且实际上它对FL参与者可能是有益的。我们的观察结果是直观的：（1）客户的不同标签（标签偏斜）不一定被视为数据异质性，并且（2）由其相应的数据主要向量跨越了药剂数据子空间之间的主要角度，这是数据异质性的更好估计。我们的代码可在https://github.com/mmorafah/fl-sc-niid上找到。

Though successful, federated learning presents new challenges for machine learning, especially when the issue of data heterogeneity, also known as Non-IID data, arises. To cope with the statistical heterogeneity, previous works incorporated a proximal term in local optimization or modified the model aggregation scheme at the server side or advocated clustered federated learning approaches where the central server groups agent population into clusters with jointly trainable data distributions to take the advantage of a certain level of personalization. While effective, they lack a deep elaboration on what kind of data heterogeneity and how the data heterogeneity impacts the accuracy performance of the participating clients. In contrast to many of the prior federated learning approaches, we demonstrate not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants. Our observations are intuitive: (1) Dissimilar labels of clients (label skew) are not necessarily considered data heterogeneity, and (2) the principal angle between the agents' data subspaces spanned by their corresponding principal vectors of data is a better estimate of the data heterogeneity. Our code is available at https://github.com/MMorafah/FL-SC-NIID.

下载PDF全文

下载文献需遵守相关版权规定

论文标题