Time: April 20, 11:00am
Location: 3211, EB II
Speaker: Arun Babu

Title: Proactive Fault Tolerance for HPC with Xen Virtualization

Abstract: Large scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults. We have proposed and implemented a proactive fault tolerant system, where fault tolerance is provided by migration of entire guest OS from 'unhealthy' nodes to spare nodes. Xen virtualization is used for providing migration facility. Experiments show that the overhead associated is quite low.