Changes in Data Structures

Next: Changes to functions Up: Code Change Previous: Format of Storage and

Changes in Data Structures

There was an obvious need for storing the timing information somewhere, since timing information had to be passed across MPI calls for intranode compression and MPI task nodes for internode compression. We added an array for storing the time fields in the replay_op data structure.

The replay_op structure seemed to be the most appropriate place for storing the timing information, though rsd_node would also have been a good choice. The following things were considered when making this decision.

We had to start off with timing information for each call, and then allow for aggregation of time fields during compression
compress_node function was passed only the replay_op when it was doing intranode compression
compress_node need not have compressed two nodes immediately, and may compress two nodes when a later MPI call is trapped and compression is called
internode compression called at the end only passed a flattened out version of the rsd_queue structure to its parent
when outputting to the global_dump only the rsd_queue is used

All these made the choice very clear.

The changes in opstruct.h are as shown below. These were marked out as explained earlier.

	enum
	{
	    MIN_TIME = 0,
	    AVG_TIME = 1,
	    MAX_TIME = 2,
	    TIME_FIELDS = 3
	};	
	...
	...
	typedef struct
	{
	    ...
	    long int time[TIME_FIELDS];
	    ...
	} replay_op;

While compressing, the min and max timing operations are trivial to compute, however the average poses a problem. The problem does not occur when we have to compress two nodes which are pure (not part of an rsd compression sequence), but when either or both of them are compressed nodes. When attempting to compress nodes a and b, node a may have resulted from the compression of two mpi calls, but b may be a pure node. The average is not a simple addition of their average times divided by two. To address this issue, we needed at compression time, to be able to determine, how many mpi calls are represented by a single node. After exploring the code as much as was pragmatic, no direct or simple mechanism for retrieving this information could be found. The data structures probably stored this information indirectly or there may have been an indirect mechanism for retrieving it (by traversing the stack or something else).

Since this information need not be part of the final trace file, I thought adding it as a field would add to the runtime overhead in space, and a small overhead in communication. This field was added to the rsd_node structure. This is the most appropriate place for it because the rsd_node represents aggregation of multiple mpi calls. So we add a field in the rsd_node structure to keep a track of the number of nodes it has aggregated.

The following field was added to rsd_node in rsd_queue.h.

	typedef struct rsd_node_t
	{
	    ...
	    int numAggregated;
    ...
	} rsd_node;

Next: Changes to functions Up: Code Change Previous: Format of Storage and

Blazing Demon 2006-11-30