Proactive Fault Tolerance for HPC with Xen Virtualization
Arun Babu Nagarajan
Abstract
Large-scale parallel computing is relying increasingly on clusters with
thousands of processors. At such large counts of compute nodes, faults are
becoming common place. Current techniques to tolerate faults focus on
reactive schemes to recover from faults and generally rely on a
checkpoint/restart mechanism. Yet, in today's systems, node
failures can often be anticipated by detecting a deteriorating health
status. Instead of a reactive scheme for fault tolerance (FT), we are
promoting a proactive one where processes automatically migrate from
'unhealthy' nodes to healthy ones. Our approach relies on operating system
virtualization techniques exemplified by Xen. This paper contributes an
automatic and transparent mechanism for proactive FT for arbitrary
MPI applications. It leverages virtualization techniques combined with
health monitoring and load-based migration. We exploit Xen's live
migration mechanism for a guest operating system (OS) to migrate an MPI
task from a
health-deteriorating node to a healthy one without stopping the MPI task
during most of the migration. Our proactive FT daemon orchestrates the
tasks of healthmonitoring, load determination and initiation of guest OS
migration.