Problem
Description
The currently implemented FT system necessitates the health monitoring
system to figure out node failure before 13 – 40 seconds
(based on the application) so that the VM can be safely migrated to the
destination. The way the migration happens is that during the initial
iteration, the pages are sent over to the target. The next iteration
sends the pages, which have been dirtied since the previous send, and
so on. It has been observed with the NAS parallel benchmarks that a
large chunk of the pages (in fact more than 90%) are sent during the
initial iteration and the other pages are sent repeatedly during the
following iterations, (depending on the working set at the time the
migration command was initiated).
We would like to exploit this behavior by sending some part of the VM
image earlier than required to the spare node so that we could
significantly cut down on the transfer cost.