论文标题
MPI的隐式动作和非阻滞故障恢复
Implicit Actions and Non-blocking Failure Recovery with MPI
论文作者
论文摘要
长期以来,科学应用一直将MPI作为在大型分布式系统上执行的首选环境。用户级失败缓解(ULFM)规范扩展了MPI标准以解决弹性并启用MPI应用程序以在失败后恢复其通信能力。这项工作建立在该领域中获得的广泛体验,以消除当前实践与理想,更异步的恢复模型之间的差距,在这些模型中,可以同时且重叠多个组件的容错活动。这项工作建议:(1)为应用程序提供故障报告中所需的一致性(即,启用应用程序来评估计算阶段的成功而不会引起无法接受的性能命中); (2)提出允许在应用程序中有效范围恢复故障恢复的构建块,以便应用程序中的独立组件可以恢复而不会彼此干扰,并且应用程序中的各个流程可以独立或一致恢复; (3)与应用程序恢复活动(例如,从检查点的数据集恢复)恢复系统(例如,从通信组的错误过程驱逐过程)一致性所需的重叠恢复活动。
Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and (3) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints).