Project Title:
Enhanced
Proactive Falut tolerance system for
HPC with Xen virtualization
Background:
The
earlier work “Proactive Fault Tolerance for HPC with xen
virtualization” aims at providing a fault tolerance solution
to the HPC clusters by automatically migrating the OS image from an
‘unhealthy’ node to a ‘healthy’
one. The health of a node is determined by parameters like CPU
temperature, fan speed, etc,.
A daemon – Proactive Fault tolerance daemon (PFTd) runs and
monitors the health of the node on a continuous basis by reading the
hardware sensors using OpenIPMI. On detecting a deteriorating health,
PFTd migrates whole of the OS image to a spare node and starts
execution again. Xen virtualization is used to migrate the images from
one node to another in a ‘live’ fashion with little
overhead.
Problem
Statement:
-
Improve the currently implemented FT system by including a
functionality to proactively pre-deploy parts of the OS image to the
spare node and help cutting down cost of migration of VM.
- Instrument the Xen tools to gather more details like pages sent in
each iteration, dirtying rate etc, and perform a sensitivity study of
the benchmarks on the exact time at which the migration is initiated.
Download:
Proposal:
Report.pdf
(added Oct 26 '06)
Update1:
Report2.pdf
(added Nov 07 '06)
Final report: final.pdf (added Oct 30 '06)