Parallel computing can be easy if you don't care about DRAM bandwidth

Wen-mei Hwu, PCI Chief Scientist
04/02/2012 - 13:40

I like the quote “Parallel programming can be easy if you don’t care about performance.” What makes it hard is when we want performance in the face of limited memory bandwidth. This is quite true in dense matrix linear algebra. Most of the complexity in parallel programming has to do with managing the timing of using data so that each data element fetched from DRAM is used multiple times in order to conserve DRAM bandwidth. Historically, expert programmers have used tiling/blocking techniques to achieve such re-use effect in linear algebra libraries. These techniques have become a necessity for good performance in virtually all modern processors, including both single-core CPUs, multicore CPUs and manycore GPUs.   Simple expressions of parallel algorithms for any of these processors would likely use up too much DRAM bandwidth. Today, the stake in manycore GPUs is higher because they tend to have higher calculation rate and smaller amount of on-chip memory; thus they are more susceptible to the effect of limited DRAM bandwidth.

Let’s use matrix-matrix multiplication to illustrate this point. Matrix-matrix multiplication is an intrinsically parallel problem. Let’s say we would like to calculate the product of two matrices M and N. Since we are talking about simple parallel programming, we will assume a simple approach where every thread calculates one element of P. For brevity, we will assume that the matrices are square and are of Width elements in both horizontal and vertical directions. The calculation of every P element is independent of that of other P elements. The following CUDA code, when called by a host program, launches a collection of Width2 parallel threads. Since Width is often in the hundreds or thousands of elements, we have a massively parallel code.

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)

{

1. int Row = blockIdx.y*blockDim.y+threadIdx.y;

2. int Col = blockIdx.x*blockDim.x+threadIdx.x;

3. if ((Row < Width) && (Col < Width)) {

4.    float Pvalue = 0;

5.    for (int k = 0; k < Width; ++k)

6.      Pvalue += M[Row*Width+k]* N[k*Width+Col];

7.    P[Row*Width+Col] = Pvalue;

  }

}

(Figure 1. A simple, parallel matrix-matrix multiplication kernel function)

Figure 1 is based on CUDA; an equivalent OpenCL code is very similar and thus not shown. There are only 7 lines of code in the function body. The function receives pointers to the matrices M, N, and P in DRAM plus the Width value. CUDA uses a Single-Program-Multiple-Data (SPMD) programming style where all threads are given unique indices by the hardware so that they can self-organize and use these indices to access different parts of the data set. The thread indices appear as pre-initialized built-in variables in CUDA kernel functions. The threads can be organized into 1D, 2D, or 3D arrays. For matrix multiplication, it is natural to use 2D thread arrays, which can be conveniently mapped to the 2D matrix data. The value ranges of these indices reflect the organization. Lines 1 and 2 reflect how the programmer maps these indices to the Column and Row of the P element to be calculated by each thread. Readers who come from MPI background would recognize that these indices and their usage are very similar to MPI ranks.

Line 3 is a test if a thread is designated to calculate a column and row position that is within the valid range of P. This is because CUDA thread arrays are in general configured to multiples 16 or hgher for better hardware efficiency. For example if the matrices are of 1000 elements in each dimension, the number of threads in each dimension will be 1,008 so that there are multiples of 16 threads in each dimension. The test in Line 3 turns off the last 8 threads in each dimension since they do not have any P elements to calculate. Similar tests are also needed in for CPUs in order to use their vector capabilities.

Lines 4 through 6 perform the inner product of a row of M and a column of N in order to generate the P element at the corresponding row and column position. This is done by first initializing an accumulator, and then using a loop to iterate through all the elements. Each iteration of the loop fetches one element from the assigned row of M and one element from the assigned column of N, multiplies them together, and accumulates the product into the accumulator.

I should comment on the index calculation in Line 6. For M, each thread fetches M elements along the row specified by the Row variable. We assume that M, N, and P are all dynamically allocated arrays whose size can be determined at run time. In C, dynamically allocated multi-dimensional arrays are accessed as 1D arrays using row-major ordering: all elements in row 0 first, followed by row 1, followed by row 2, and so on. Each row has Width elements. So, there are Row*Width M elements before the beginning of the row in memory. Since all elements in a row are in consecutive locations in C, the element to be accessed in iteration k is M[Row*Width + k]. As for N, the thread fetches N elements along the column specified by the Col variable. The first element of the column is the Colth element in the 0th row. However, next element is actually Width elements away. So, the element to be accessed in iteration k is N[Col + k*Width]. Note that these index calculations are not complicated by parallel programming. They are just part of accessing dynamically allocated 2D arrays in C.

We can estimate the effect of limited DRAM bandwidth by calculating the expected performance level of the kernel code in Figure 1. The most important part of the kernel in terms of execution activities of each thread is the loop in Lines 5 and 6. In every iteration of this loop, two memory accesses are performed for one floating-point multiplication and one floating-point addition. One memory access fetches an M element and the other fetches an N element. One floating-point operation multiplies the M and N elements fetched and the other accumulates the product into auto variable Pvalue. In CUDA, C auto variables are assigned into registers whenever possible so this accumulation does not cause any memory access. Thus, the ratio of floating point calculation to global memory access operation is 1 to 1, or 1.0.

In a high-end manycore GPU today, the peak floating-point calculation rate is about 1.0 TFLOPS (single precision) and the memory bandwidth is about 150 GB/s, or  37.5 G single precision operands per cycle. If we need to fetch one operand from DRAM for every floating point operation, we can expect no more than 37.5 GFLOPS of achieved calculation rate for the kernel. This is a mere 3.75% of the peak calculation rate. With the small caches in today’s GPUs, we can probably expect up to 50% of the memory accesses to hit in cache. This could raise the achieved execution rate to 75 GFLOPS, still less than 10% of the peak.

Based on the discussion, I would like to revise the quote to be “Parallel programming can be simple as long as you don’t care about DRAM bandwidth.” The parallel matrix-matrix multiplication kernel is very simple and fairly easy to understand. The only slight complexity is due to index calculation for accessing dynamically allocated 2D arrays in C. It is unfortunately hard to ignore the fact that we are using less than 10% of the calculation capabilities of the processor!

Traditionally, CPU designs have used brute force to address the DRAM bandwidth and latency problem. By incorporating multiple megabytes of on-chip cache memory, CPU’s can hold the entire M and N arrays in cache when their have hundreds of elements in each dimension. Having the entire working set in the on-chip cache can allow CPU programmers to be not affected by limited memory bandwidth. However, the performance can drop dramatically when the CPU caches fail to capture the working set. This has made even the sequential CPU code to be complex with tiling techniques for linear algebra functions that must handle large matrices. Furthermore, the power constraints will likely reduce the amount of on-chip cache per core per thread in future CPU designs. I expect that programmers for all types of processors, sequential or parallel, will need to use tiling techniques for large array computation even though these techniques add much complexity to the code. As you would expect, I will be discussing tiling techniques in the next blog.