SPE programs use DMA transfers to move data and instructions between main storage and the local store (LS) in the SPE.
This method wastes a great deal of time waiting for DMA transfers to complete. We can speed up the process significantly by allocating two buffers, B0 and B1 , and overlapping computation on one buffer with data transfer in the other. This technique is called double buffering. Figure 1 shows a flow diagram for this double buffering scheme.
Double buffering is a form of multibuffering, which is the method of using multiple buffers in a circular queue to overlap processing and data transfer.
/* Example C code demonstrating double buffering using
* buffers B[0] and B[1]. In this example, an array of data
* starting at the effective address eahi|ealow is DMAed
* into the SPU's local store in 4-KB chunks and processed
* by the use_data subroutine.
*/
#include <spu_intrinsics.h>
#include "spu_mfcio.h"
#define BUFFER_SIZE 4096
volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128)));
void double_buffer_example(unsigned int eahi, unsigned int ealow, int buffers)
{
int next_idx, buf_idx = 0;
// Initiate DMA transfer
spu_mfcdma64(B[buf_idx], eahi, ealow, BUFFER_SIZE, buf_idx, MFC_GET_CMD);
ealow += BUFFER_SIZE;
while (--buffers) {
next_idx = buf_idx ^ 1;
// Initiate next DMA transfer
spu_mfcdma64(B[next_idx], eahi, ealow, BUFFER_SIZE, next_idx, MFC_GET_CMD);
ealow += BUFFER_SIZE;
// Wait for previous transfer to complete
spu_writech(MFC_WrTagMask, 1 << buf_idx);
(void)spu_mfcstat(MFC_TAG_UPDATE_ALL);
// Use the data from the previous transfer
use_data(B[buf_idx]);
buf_idx = next_idx;
}
// Wait for last transfer to complete
spu_writech(MFC_WrTagMask, 1 << buf_idx);
(void)spu_mfcstat(MFC_TAG_UPDATE_ALL);
// Use the data from the last transfer
use_data(B[buf_idx]);
}
To use double buffering effectively, follow these rules for DMA transfers on the SPE:
The purpose of double buffering is to maximize the time spent in the compute phase of a program and minimize the time spent waiting for DMA transfers to complete. Let τt represent the time required to transfer a buffer B, and let τc represent the time required to compute on data contained in that buffer. In general, the higher the ratio τt/τc, the more performance benefit an application will realize from a double-buffering scheme.