Life is NOT fair in parallel computing

Wen-mei Hwu, PCI chief scientist
02/02/2012 - 14:11

The peach in Figure 1 depicts the level of difficulty in covering applications with multicore CPU architectures and manycore GPU architectures. The core of the cartoon peach represents the sequential portions of the applications. For various reasons, these portions cannot be expressed into multi-threaded code. These sequential portions have been the target of modern instruction-level parallelism techniques that wring limited amount of parallelism out of these portions. The latency reduction mechanisms in moderns CPUs, including cache memories, branch predictors, and data forwarding are important in preserving the instruction-level parallelism in these portions. As we all know, biting into the core of a peach is neither a pleasant nor a rewarding experience.

The orange flesh of the cartoon peach represents the data parallel portions of the applications. These portions typically process large data sets whose elements can be processed in parallel. Modern media and data analysis applications tend to have large data parallel portions. The traditional CPUs do not have sufficient amount of execution resources to harvest the data parallelism and achieve dramatic increase in performance. The bump coming out of the peach core depicts the SIMD or vector extensions to the CPU ISAs in order to exploit a higher level of data parallelism in CPUs. The trend is to increase the width of the SIMD instructions to exploit a higher level of parallelism. This is illustrated with the arrows coming out of the bump.

The meshed ring in the peach flesh represents the types of data parallel applications that can be efficiently covered by manycore architectures today. These applications typically are based on parallel algorithms that have high-level of data re-use and vector memory accesses, such as matrix-matrix multiplication, convolution, and particle-grid calculations. The arrows coming out of the wedge represent efforts to develop new algorithms and/or to add hardware resources such as more on-chip memory to broaden the types of applications that can be effectively covered by manycore architectures.

The point is that there is a large population of parallel applications that are currently covered by neither CPUs nor manycore processors. Although these applications are rich in data parallelism, they lack the regular accesses and data re-use needed for effective execution by the manycores. The dividing line between haves and have-nots in parallel computing today is whether the algorithms used have data re-use and regular accesses. New efforts to design algorithms that exhibit more regular accesses and data re-use are needed for these applications to run effectively on manycore processors. In fact, I argue that the same efforts are also needed for these applications to run effectively on multicore CPUs with wide SIMD extensions. This is where most of the satisfying and rewarding work will be done in parallel computing in the foreseeable future.