NAS Parallel Benchmark: Integer Sort in NVIDIA CUDA

Ben Clay - rbclay [at] ncsu.edu

Final Report

The final report is too long to list here. Needless to say, we were unable to significantly speed Integer Sort's execution for a host of algorithm-related and environmental reasons. However, it was an excellent learning experience, and has given us a lot of respect for the effort required to parallelize codes. See the full report, including experimental results, below. Source can be found here.

Report Files

The initial report is available in PDF form here.

The second report is available in PDF form here.

The final report is available in PDF form here.

Intermediate Report

Overview

Since the last report was written, we have made a few important architectural changes in our proposed solution, as well as hit upon some unforeseen problems with our implementation. We were originally overambitious in establishing a timeline, and that has been adjusted as well based on our experiences. Parallelization of NAS PB IntegerSort is definitely feasible using CUDA, but these issues must be overcome before significant speedups are realized. We have revised our timeline to support these new developments.

Updates to Proposed Solution

Previously, the core of our proposed speedup was the running of multiple iterations of the rank() function simultaneously. While we still intend to investigate this, from our initial work splitting and porting the rank() function to CUDA, we are seriously concerned about data dependence between successive iterations of the rank() function, as well as the logistics of the MPI-based communication required.

Initially, we misinterpreted the functionality provided by the cudaMPI library. cudaMPI allows the host to perform MPI calls based on device memory locations, essentially handling DMA operations for the programmer in a transparent fashion. We had initially believed cudaMPI allowed the devices themselves to place MPI calls, which was erroneous. The outcome of this is that the original host-run rank() function has been split into 3 separate CUDA kernels, with MPI calls between them. Between rank1 and rank2 cudaMPI is used, and between rank2 and rank3 normal MPI calls are used with manual DMA. This, coupled with the logistics of managing different datasets for each iteration, significantly complicates the original proposal to perform rank() iterations in parallel (cross-iteration). While we still intend to investigate this concept, it is no longer the core focus of our speedup efforts.

Instead, we intend to focus on parallelizing the individual rankX() kernels themselves, to complete the execution of the rank() iterations as fast as possible. rank1() is the best candidate, with multiple loops iterating over the largest arrays used in IS, but rank2() and rank3() could benefit from parallelization as well. This is a more reasonable target, given the data dependence issue with cross-iteration changes, and even if cross-iteration speedup is feasible, speeding the rankX() executions would be tangential anyway. We feel this is a better target for initial speedup, with cross-iteration investigations performed later as time allows.

Implementation Issues

In attempting to parallelize the rankX() functions, we have hit on two significant roadblocks. The first and most troubling is data corruption in global memory, due to contention across blocks. The other is kernel crashing due to use of shared memory within blocks, which is probably due to implementation specifics.

When we encountered this data corruption issue, we were attempting to parallelize the larger loops inside rank1(). These loops iterate 2 million times for class A, and many more times for the larger classes, so they are the central part of what we want to speed up. The large size of these arrays mean that we must keep them in global memory, which is slower than shared memory. The core of these loops is sorting the array of random keys into buckets, which involves using the random keys as indices into other temporary arrays. The result of this is that for a given CUDA thread, we can partition the input space (key arrays) such that each thread works independently, but we don’t know what values in the intermediate array will be altered, meaning cross-thread dependence could exist. This results in a contention problem, because two random keys being handled by two threads may result in writes to the same intermediate array position. This is complicated by the fact that most of the writes are in fact increments, which mean both a read and a write for each operation.

When we first parallelized these loops, we were able to run them for very small numbers of threads (2-8 total), but above this number we started seeing data corruption, which IntegerSort detects during its partial and full verification phases. To fix this, we decided to use the atomic operations that CUDA provides, which guarantee atomic transactions on both shared and global memory. Using these atomic operations, we are able to scale our loops to 1 block with 512 threads, which is a significant increase over 8 threads. However, in the best cases, we are unable to move to more than 1 block, presumably because of reliance on the __syncthreads() function. __syncthreads() acts as a barrier to all threads within a block, but is local to that block, so when multiple blocks are operating in parallel, one could move on before global memory changes have taken effect. The __threadfence() function ensures memory visibility across blocks, but fails to help our issue.

To sync across blocks, we will have to further split the rank1 kernel into several. Parallelizing rank1() from 1 to 512 CUDA threads resulted in a speedup of 40 seconds -> 18 seconds for the overall IS class A execution, but given that it executes in 4 seconds on the host, this is still not acceptable.

The second issue we encountered was during an attempt to move some of the smaller intermediate arrays used in rank1() into shared memory, which should speed up all operations, including those that are atomic, by reducing memory latency significantly. Shared memory introduces a new problem of cross-block synchronizing, but since we’re currently limited to 512 threads we chose to ignore that issue for now. Shared memory is limited to 16k per block, so the arrays targeted needed to be aggregately less than this value. Most of them are around 1024 in length, and composed of integers, so 1024*4bytes = 4096 bytes should be significantly less than the maximum. We should also be able to 2 or even 3 of these arrays per block, but we chose to just use a single shared array at first.

The problem that resulted from use of shared arrays was that subsequent kernel instantiations resulted in system crashes. We assume this is due to the shared array overwriting some valuable block context information, or other memory that should not be touched. This occurred both when we used an explicitly sized shared array, as well as one without a size, as described by the CUDA programming guide (extern __shared__ myarray[];). We found this very puzzling, because we got no errors during the kernel execution that uses the shared memory, but instead the following kernel execution. We have never used shared arrays of this size with CUDA before, but as demonstrated above we should be well below the system’s maximum. We will continue to investigate this issue, as shared memory offers good speedup, but since CUDA’s atomic operations support both global and shared memory, this conversion is desired but not required.

Timeline – Updated

Subject to change dependent on final project due date.
Items in green have been completed.

Day 4

Day 6

Day 10

Day 15

Day 25

Day 28

Day 30

Day 35

Day 40






Initial Report

Problem Description

The NAS Parallel Benchmark standard's Integer Sort (IS) benchmark [1] describes a bucket sort of random integers. The particular implementation we examine is written in C and uses MPI for communication.

Using the mpiP tool [2], we see that the Class B test takes an average of 9.53s to complete, and spends approximately 70% of program execution time performing MPI tasks. The Class C test takes an average of 37.13s to complete, and spends approximately 65% of time performing MPI tasks. The Class D test required too much memory to run on the test systems.

Using the gprof tool [3], we obtain the following partial results:

Class B Class C
% time Duration (seconds) Calls Function % time Duration (seconds) Calls Function
39.58 0.67 8388631 randlc() 43.57 3.32 11 rank()
38.10 0.64 11 rank() 33.92 2.59 33554457 randlc()
12.50/td> 0.21 1 full_verify() 10.50 0.80 1 full_verify()

From these results, we see that the functions randlc() and rank() are the two most costly functions in terms of time. Furthermore, randlc() is high in cost because it is called so many times, while rank() takes a very long time to execute once.

Proposed Solution

Of the two most costly functions, the rank() function is best suited for hardware acceleration. It is the core compute kernel for the Integer Sort benchmark, which performs the bucket sort, while randlc() is part of the initialization routine. rank() is called serially 11 times on each MPI node, which could potentially be parallelized using NVIDIA CUDA [4] kernels. Each rank() call would be executed by a single CUDA thread, which would perform MPI synchronization with equivalent CUDA threads on other MPI hosts using the cudaMPI library [5].

One issue with this approach is that the rank() function contains two MPI calls that are not supported by cudaMPI, or DCGN [6], another CUDA-MPI library: MPI_Alltoall() and MPI_Alltoallv(). These function calls occur one after the other roughly in the middle of rank(), presenting the opportunity to split rank() into two CUDA kernels, rank1() and rank2(), and allowing the x86 host to perform these non-supported MPI calls in between the two CUDA kernels. This will add DMA overhead, since values need to be moved from CUDA memory to host and vice versa, but there doesn't appear to be any viable alternative. Even if the IS benchmark was accelerated using Cell, the Cell Messaging Layer [7] also does not support Alltoall() or Alltoallv(), making host execution of these calls a necessity.

Timeline

Day 2

Day 8

Day 9

Day 12

Day 13

Day 14



References

[1] NAS Parallel Benchmark.
[2] mpiP tool.
[3] gprof tool.
[4] NVIDIA CUDA.
[5] cudaMPI library.
[6] DCGN library.
[7] Cell Messaging Layer library.