论文标题

Deeptriage:为云服务事件的自动转移援助

DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services

论文作者

Pham, Phuong, Jain, Vivek, Dauterman, Lukas, Ormont, Justin, Jain, Navendu

论文摘要

随着云服务的增长并产生高收入,这些服务中停机的成本变得越来越昂贵。为了减少损失和服务停机时间,关键的主要步骤是执行事件分类,这是将服务事件分配给正确负责任团队的过程。不正确的任务可能会有其他事件重新构造,并增加了减轻10倍的时间。但是,大型云服务中的自动事件分类面临许多挑战:(1)来自大量团队的高度不平衡的事件分布,(2)输入数据或数据源形式的种类繁多,(3)缩放以满足生产级别的需求,(4)(4)在使用机器学习建议中获得工程师的信任。为了应对这些挑战,我们引入了DeepTriage,这是一项智能事件转移服务,结合了多个机器学习技术 - 梯度增强的分类器,聚类方法和深神经网络 - 在一个合奏中,建议负责任的团队分类事件。 Microsoft Azure中实际事件的实验结果表明,我们的服务获得了82.9%的F1分数。对于高度影响的事件,DeepTriage的F1得分从76.3%-91.3%。我们已经应用了最佳实践和最先进的框架来扩展深层式,以处理所有云服务的事件路由。自2017年10月以来,Deeptriage已被部署到Azure,每天都有数千个团队使用。

As cloud services are growing and generating high revenues, the cost of downtime in these services is becoming significantly expensive. To reduce loss and service downtime, a critical primary step is to execute incident triage, the process of assigning a service incident to the correct responsible team, in a timely manner. An incorrect assignment risks additional incident reroutings and increases its time to mitigate by 10x. However, automated incident triage in large cloud services faces many challenges: (1) a highly imbalanced incident distribution from a large number of teams, (2) wide variety in formats of input data or data sources, (3) scaling to meet production-grade requirements, and (4) gaining engineers' trust in using machine learning recommendations. To address these challenges, we introduce DeepTriage, an intelligent incident transfer service combining multiple machine learning techniques - gradient boosted classifiers, clustering methods, and deep neural networks - in an ensemble to recommend the responsible team to triage an incident. Experimental results on real incidents in Microsoft Azure show that our service achieves 82.9% F1 score. For highly impacted incidents, DeepTriage achieves F1 score from 76.3% - 91.3%. We have applied best practices and state-of-the-art frameworks to scale DeepTriage to handle incident routing for all cloud services. DeepTriage has been deployed in Azure since October 2017 and is used by thousands of teams daily.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源