Matrix add - host data partitioning example

In this example, two large matrices are added together using ALF. The problem can be expressed simply as:
A[m,n] + B[m,n] = C[m,n] 
where m and n are the dimensions of the matrices.
This simple example demonstrates how to: You can also use this sample as a template to build a more complicated application.
In this example, the host application: The accelerator application includes a simple computational kernel that computes the addition of the two matrices.
The scalar code to add two matrices for a uni-processor machine is provided below:
float mat_a[NUM_ROW][NUM_COL];
float mat_b[NUM_ROW][NUM_COL];
float mat_c[NUM_ROW][NUM_COL];
int main(void)
{
  int i,j;
  for (i=0; i<NUM_ROW; i++)
     for (j=0; j<NUM_COL; j++)
        mat_c[i][j] = mat_a[i][j] + mat_b[i][j];
  return 0;
}
An ALF host program can be logically divided into several sections:

Source code

The following code listings only show the relevant sections of the code. For a complete listing, refer to the ALF samples directory
matrix_add/STEP1a_partition_scheme_A/common/host_partition

Initialization

The following code segment shows how ALF is initialized and accelerators allocated for a specific ALF runtime.
alf_handle_t alf_handle;
unsigned int nodes;

/* initializes the runtime environment for ALF*/
alf_init(&config_parms, &alf_handle;);

/* get the number of SPE accelerators available for from the Opteron */
rc = alf_query_system_info(alf_handle, ALF_QUERY_NUM_ACCEL, ALF_ACCEL_TYPE_SPE, &nodes;);

/* set the total number of accelerator instances (in this case, SPE) */  
/* the ALF runtime will have during its lifetime */ 
rc = alf_num_instances_set (alf_handle, nodes);

Task setup

The next section of an ALF host program contains information about the description of a task and the creation of the task runtime. The alf_task_desc_create function creates a task descriptor. This descriptor can be used multiple times to create different executable tasks. The function alf_task_create creates a task to run an SPE program with the name spe_add_program.
/* variable declarations */
alf_task_desc_handle_t task_desc_handle;
alf_task_handle_t task_handle;
const char* spe_image_name;
const char* library_path_name;
const char* comp_kernel_name;

/* describing a task that's executable on the SPE*/
alf_task_desc_create(alf_handle, ALF_ACCEL_TYPE_SPE, &task_desc_handle;);
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_TSK_CTX_SIZE, 0);
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, sizeof(add_parms_t));
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE, H * V * 2 sizeof(float));    
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_OUT_BUF_SIZE, H * V * sizeof(float));      
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_NUM_DTL_ENTRIES, 8);
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_MAX_STACK_SIZE, 4096);

/* providing the SPE executable name */
alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_IMAGE_REF_L,(unsigned long long) spe_image_name);
alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_LIBRARY_REF_L,(unsigned long) library_path_name);
alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_KERNEL_REF_L,(unsigned long) comp_kernel_name);

Work block setup

This section shows how work blocks are created. After the program has created the work block, it describes the input and output associated with each work block. Each work block contains the input description for blocks in the input matrices of size H * V starting at location matrix[row][0] with H and V representing the horizontal and vertical dimensions of the block.

In this example, assume that the accelerator memory can contain the two input buffers of size H * V elements and the output buffer of size H * V. The program calls alf_wb_enqueue() to add the work block to the queue to be processed. ALF employs an immediate runtime mode. As soon as the first work block is added to the queue, the task starts processing the work block. The function alf_task_finalize closes the work block queue.
alf_wb_handle_t wb_handle;
add_parms_t parm __attribute__((aligned(128)));
parm.h = H; /* horizontal size of the block */
parm.v = V; /* vertical size of the block */

/* creating work blocks and adding param & io buffer */
for (i = 0; i < NUM_ROW; i += H) {
     alf_wb_create(task_handle, ALF_WB_SINGLE, 0,&wb_handle);
     
     /* begins a new Data Transfer List for INPUT */ 
     alf_wb_dtl_set_begin(wb_handle, ALF_BUF_IN, 0);
     
     /* Add H*V element of mat_a as Input */
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_a[i][0], H * V, ALF_DATA_FLOAT);
     
     /* Add H*V element of mat_b as Input */
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_b[i][0], H * V, ALF_DATA_FLOAT);
     alf_wb_dtl_set_end(wb_handle);
     
     /* begins a new Data Transfer List OUTPUT */
     alf_wb_dtl_set_begin(wb_handle, ALF_BUF_OUT, 0);
     
     /* Add H*V element of mat_c as Output */
     alf_wb_dtl_set_entry_add(wb_handle, &matrix_c[i][0], H * V, ALF_DATA_FLOAT);
     alf_wb_dtl_set_end(wb_handle);
     
     /* pass parameters H and V to spu */
     alf_wb_parm_add(wb_handle, (void *) (&parm), sizeof(parm), ALF_DATA_BYTE, 0);
     
     /* enqueuing work block */
     alf_wb_enqueue(wb_handle);
}
alf_task_finalize(task_handle);

Task wait and exit

After all the work blocks are on the processing queue, the program waits for the accelerator to finish processing the work blocks. Then alf_exit() is called to cleanly exit the ALF runtime environment.
/* waiting for all work blocks to be done*/
alf_task_wait(task_handle, -1);
/* exit ALF runtime */
alf_exit(alf_handle, ALF_EXIT_WAIT, -1);

Accelerator side

On the accelerator side, you need to provide the actual computational kernel that computes the addition of the two blocks of matrices. The ALF runtime on the accelerator is responsible for getting the input buffer to the accelerator memory before it runs the user-provided alf_accel_comp_kernel function. After alf_accel_comp_kernel returns, the ALF runtime is responsible for getting the output data back to host memory space. Double buffering or triple buffering is employed as appropriate to ensure that the latency for the input buffer to get into accelerator memory and the output buffer to get to host memory space is well covered with computation.
int alf_accel_comp_kernel(void *p_task_context,
											 void *p_parm_context, 
											 void *p_input_buffer,
											 void *p_output_buffer,
											 void *p_inout_buffer, 
											 unsigned int current_count,
											 unsigned int total_count) 
{
	unsigned int i, cnt;
  vector float *sa, *sb, *sc;
  add_parms_t *p_parm = (add_parms_t *)p_parm_context;
  cnt = p_parm->h * p_parm->v / 4;  
  sa = (vector float *) p_input_buffer;
  sb = sa + cnt;
  sc = (vector float *) p_output_buffer;
  for (i = 0; i < cnt; i += 4) {
    		sc[i] = spu_add(sa[i], sb[i]);
    		sc[i + 1] = spu_add(sa[i + 1], sb[i + 1]);
    		sc[i + 2] = spu_add(sa[i + 2], sb[i + 2]);
    		sc[i + 3] = spu_add(sa[i + 3], sb[i + 3]);
  			}
  return 0;
}