LSQ：在大规模异构系统中使用多个调度员的负载平衡

论文标题

LSQ：在大规模异构系统中使用多个调度员的负载平衡

LSQ: Load Balancing in Large-Scale Heterogeneous Systems with Multiple Dispatchers

论文作者

Vargaftik, Shay, Keslassy, Isaac, Orda, Ariel

论文摘要

如今，传统的负载平衡政策的效率甚至可行性受到云基础架构的快速增长和服务器异质性水平不断提高的挑战。在具有许多负载量的异质系统中，诸如JSQ等传统解决方案（例如JSQ）会导致大型的沟通开销和由于群体行为而导致的有害无关。替代性低通信政策，例如JSQ（D）和最近提出的JIQ，要么不稳定或提供较差的性能。我们介绍了当地最短的队列（LSQ）负载平衡算法家族。在这些算法中，每个调度程序都保持自己的局部，可能过时的服务器队列长度的视图，并在其本地视图上使用JSQ保持使用。很少使用小型通信开销来更新此本地视图。我们正式证明，只要这些服务器队列长度的本地估计中的错误在预期中有限，则整个系统都非常稳定。最后，在模拟中，我们展示了简单和稳定的LSQ政策在使用同等通信预算的同时，表现出吸引力的性能，并显着优于现有的低通信政策。特别是，由于牛群行为的减少，我们的简单政策甚至通常都优于JSQ。我们进一步展示了如何依靠智能服务器（即基于高级拉动的通信），我们可以进一步提高性能并降低沟通开销。

Nowadays, the efficiency and even the feasibility of traditional load-balancing policies are challenged by the rapid growth of cloud infrastructure and the increasing levels of server heterogeneity. In such heterogeneous systems with many load-balancers, traditional solutions, such as JSQ, incur a prohibitively large communication overhead and detrimental incast effects due to herd behavior. Alternative low-communication policies, such as JSQ(d) and the recently proposed JIQ, are either unstable or provide poor performance. We introduce the Local Shortest Queue (LSQ) family of load balancing algorithms. In these algorithms, each dispatcher maintains its own, local, and possibly outdated view of the server queue lengths, and keeps using JSQ on its local view. A small communication overhead is used infrequently to update this local view. We formally prove that as long as the error in these local estimates of the server queue lengths is bounded in expectation, the entire system is strongly stable. Finally, in simulations, we show how simple and stable LSQ policies exhibit appealing performance and significantly outperform existing low-communication policies, while using an equivalent communication budget. In particular, our simple policies often outperform even JSQ due to their reduction of herd behavior. We further show how, by relying on smart servers (i.e., advanced pull-based communication), we can further improve performance and lower communication overhead.

下载PDF全文

下载文献需遵守相关版权规定

论文标题