Step 3: Parallelize code for execution across multiple SPEs

The most common and practical method of parallelizing computation across multiple SPEs is to partition the data. This works well for applications with little or no data dependency.

In our example, we can partition the Euler integration of the particle equally among the available SPEs. If there are four available SPEs, then the first quarter of the particles is processed by the first SPE, the second quarter of the particles is processed by the second SPE, and so forth.

The SPE code for this step is the same as that in Step 2, so only the PPE code is shown below.

PPE Code:
#include <stdio.h>
#include <stdlib.h>
#include <libspe2.h>
#include <pthread.h>
#include "particle.h"

#define MAX_SPE_THREADS  16

vec4D pos[PARTICLES] __attribute__ ((aligned (16)));
vec4D vel[PARTICLES] __attribute__ ((aligned (16)));
vec4D force __attribute__ ((aligned (16)));
float inv_mass[PARTICLES] __attribute__ ((aligned (16)));
float dt = 1.0f;

extern spe_program_handle_t particle;

typedef struct ppu_pthread_data {
  spe_context_ptr_t spe_ctx;
  pthread_t pthread;
  unsigned int entry;
  void *argp;
} ppu_pthread_data_t;

void *ppu_pthread_function(void *arg) {
  ppu_pthread_data_t *datap = (ppu_pthread_data_t *)arg;

  if (spe_context_run(datap->spe_ctx, &datap->entry, 0, datap->argp, NULL, 
    NULL) < 0)                        {  
    perror ("Failed running context\n");
    exit (1);
  }
  pthread_exit(NULL);
}


int main()
{
  int i, offset, count, spe_threads;
  ppu_pthread_data_t datas[MAX_SPE_THREADS];
  parm_context ctxs[MAX_SPE_THREADS] __attribute__ ((aligned (16)));

/* Determine the number of SPE threads to create */
  spe_threads = spe_cpu_info_get(SPE_COUNT_USABLE_SPES, -1);
  if (spe_threads > MAX_SPE_THREADS) spe_threads = MAX_SPE_THREADS;

/* Create multiple SPE threads */
  for (i=0, offset=0; i<spe_threads; i++, offset+=count) {
    /* Construct a parameter context for each SPE. Make sure
     * that each SPEs (excluding the last) particle count is a multiple
     * of 4 so that inv_mass context pointer is always quadword aligned.
     */
    count = (PARTICLES / spe_threads + 3) & ~3;
    ctxs[i].particles = (i==(SPE_THREADS-1)) ? PARTICLES - offset : count;
    ctxs[i].pos_v = (vector float *)&pos[offset];
    ctxs[i].vel_v = (vector float *)&vel[offset];
    ctxs[i].force_v = *((vector float *)&force);
    ctxs[i].inv_mass = &inv_mass[offset];
    ctxs[i].dt = dt;
    
    /* Create SPE context */
    if ((datas[i].spe_ctx = spe_context_create (0, NULL)) == NULL) {
        perror ("Failed creating context");
        exit (1);
    }
    /* Load SPE program into the SPE context */
    if (spe_program_load (datas[i].spe_ctx, &particle)) {
      perror ("Failed loading program");
      exit (1);
    }
    /* Initialize context run data */
    datas[i].entry = SPE_DEFAULT_ENTRY;
    datas[i].argp = &ctxs[i];

    /* Create pthread for each of the SPE conexts */
    if (pthread_create (&datas[i].pthread, NULL, &ppu_pthread_function, 
      &datas[i])){ 
      perror ("Failed creating thread");
    }
  }

  /* Wait for all the SPE threads to complete.*/
  for (i=0; i<spe_threads; i++) {
    if (pthread_join (datas[i].pthread, NULL)) {
      perror ("Failed joining thread");
      exit (1);
    }
  }

  return (0);
}

Now that the program has been migrated to the SPEs, you can analyze and tune its performance. This is discussed in Performance analysis.