干扰和需要意识到的工作负载托管在超大数据中心中

论文标题

干扰和需要意识到的工作负载托管在超大数据中心中

Interference and Need Aware Workload Colocation in Hyperscale Datacenters

论文作者

Chakraborti, Sayak, Coutinho, Brian, Dwarkadas, Sandhya, Malani, Parth, Sharma, Bikash

论文摘要

由于服务所有者和平台提供商的目标相互矛盾，数据中心遭受了资源利用率效率低下的困扰。打算维持服务级别目标（SLO）的服务所有者通常会要求保守的资源。平台提供商希望提高运营效率，以降低资本和运营成本。由于服务工作负载特征的多样性，依赖于输入负载，平台，内存，I/O网络体系结构以及资源捆绑的资源使用模式以及资源捆绑，因此同时实现了单个服务的运营效率和SLO的运营效率和SLO。本文提出了一种可调节的资源分配方法，可以说明动态服务资源需求和平台异质性。此外，与离线灵敏度组件一起使用了基于在线K-均值的服务分类方法。我们可调的方法允许根据服务所有者对其SLO的敏感性，可为绝对SLO保证的交易资源利用效率。我们在私有云环境中以主要关键工作负载进行大规模评估可调资源分配器。调整操作效率时，我们证明所需机器的降低量高达约50％；〜40％的总股权（TCO）降低了40％；与基线相比，CPU和内存碎片的减少约为60％，但以增加SLO降解的任务数量高达约25％。在调整SLO时，通过引入干扰感知的托管，我们可以将求解器调整为将SLO降解的任务减少高达约22％，而与基线相比，但在宿主数量方面，SLO的任务额为额外的费用约为30％。我们强调了TCO和SLO违规之间的权衡，并根据平台所有者的要求提供调整。

Datacenters suffer from resource utilization inefficiencies due to the conflicting goals of service owners and platform providers. Service owners intending to maintain Service Level Objectives (SLO) for themselves typically request a conservative amount of resources. Platform providers want to increase operational efficiency to reduce capital and operating costs. Achieving both operational efficiency and SLO for individual services at the same time is challenging due to the diversity in service workload characteristics, resource usage patterns that are dependent on input load, heterogeneity in platform, memory, I/O, and network architecture, and resource bundling. This paper presents a tunable approach to resource allocation that accounts for both dynamic service resource needs and platform heterogeneity. In addition, an online K-Means-based service classification method is used in conjunction with an offline sensitivity component. Our tunable approach allows trading resource utilization efficiency for absolute SLO guarantees based on the service owners' sensitivity to its SLO. We evaluate our tunable resource allocator at scale in a private cloud environment with mostly latency-critical workloads. When tuning for operational efficiency, we demonstrate up to ~50% reduction in required machines; ~40% reduction in Total-Cost-of-Ownership (TCO); and ~60% reduction in CPU and memory fragmentation, but at the cost of increasing the number of tasks experiencing degradation of SLO by up to ~25% compared to the baseline. When tuning for SLO, by introducing interference-aware colocation, we can tune the solver to reduce tasks experiencing degradation of SLO by up to ~22% compared to the baseline, but at an additional cost of ~30% in terms of the number of hosts. We highlight this trade-off between TCO and SLO violations, and offer tuning based on the requirements of the platform owners.

下载PDF全文

下载文献需遵守相关版权规定

论文标题