# Three challenges in parallel programming

As we seek to develop parallel applications, we must understand that such development presents at least three major challenges. I argue that all these challenges are equally present whether one programs a many-core GPU, an MIC, or a multi-core CPU. Unfortunately, there is little compiler technology that can help programmers to meet these challenges today. These challenges are the reasons why compiler-based solutions from vendors will have limited success in creating a scalable parallel code base for many applications. The focus of this blog is to help you, the reader, clearly understand these challenges before we discuss key techniques that have been proven effective in practice.

The most difficult challenge is that some important problems do not have massively parallel algorithms that exhibit desired behavior. To put this challenge into perspective, in order to achieve exa-FLOPS performance by the end of the decade, we will need to have billion-way parallelism assuming the current trends in clock frequency. At the chip level, developers who hope to enjoy continued performance growth in the next decade will need to use algorithms with at least ten-thousand-way parallelism. Many “scalable” algorithms today fall short when measured with this standard.

There are three levels of difficulties in the parallelism challenge. First, some problems do not have work-efficient parallel algorithms that exhibit massive parallelism. That is, we simply do not know how to solve these problems with a large number of parallel execution units without significantly increasing the computation complexity. Examples of such problems include finding shortest paths between two points in a graph and Delaunay triangulation of a mesh. In the shortest path problem, all known parallel algorithms with high levels of parallelism have significantly higher complexity, or equivalently much lower work efficiency, compared to the most efficient sequential algorithms. For large data sets, the increased computational complexity unfortunately results in so much work that the execution time of parallel algorithms can be slower than sequential algorithms. Unfortunately, large data sets are what make parallel algorithms compelling and thus make the work inefficient parallel algorithms impractical.

In Delaunay triangulation, we have algorithms that exhibit a small amount of parallelism due to the lack of known methods to divide the mesh into independent parts for large-scale parallel processing. There is simply not enough parallelism to achieve good performance scalability in the next several hardware generations. In many cases, finding work-efficient parallel algorithms or functionally correct problem partition methods has been the target of decades of investigations and would be a fundamental contribution to computer science. While we cannot expect to make broad, rapid progress in addressing this challenge, we must continue to support long-term, excellent work in this area.

Second, some problems have known parallel algorithms with ample parallelism but questionable numerical stability. Some of these parallel algorithms do not have the same level of numerical stability as well-known sequential algorithms. For example, the Parallel Cyclic Reduction (PCR) algorithm for solving tri-diagonal systems performs independent variable elimination operations on adjacent equations to recursively divide a linear system into independent sub-systems. While PCR by itself is not work efficient, with computational complexity of O(Nlog(N)), it can form a work efficient hybrid with the sequential Thomas algorithm. Such hybrid has been shown to exhibit excellent scalability of parallelism with large data size. However, for some numerical applications, PCR based algorithms are known to be numerically unstable. Such inferior numerical stability has hampered the production use of PCR hybrids. Similar problems exist in other linear algebra solvers that achieve better problem division by avoiding pivoting. Further work is needed to improve the numerical stability of these parallel algorithms.

Third, some applications that have parallel algorithms are plagued by catastrophic load imbalance due to highly non-uniform data distribution. For example, there are highly parallel, numerically stable breadth-first search algorithms for graph problems. These algorithms scale well for uniformly distributed graphs where the number of neighbors for nodes does not vary dramatically. However, their performance drops rapidly when the distribution of node connectivity becomes highly skewed. For scale-free graphs, a small number of nodes are connected to a very large number of neighbors. Such non-uniform connectivity is known to cause severe load imbalance and often make parallel algorithms run slower than sequential algorithms. Unfortunately, many real-world graph problems are based on graphs with highly skewed connectivity. Dealing with such load imbalance is difficult. There are limited data pre-processing techniques that can help alleviate the problems with moderate level of load imbalance. However, effective solutions are still yet to be developed for many common severe cases.

The second challenge arises in applications where parallel algorithms do not have sufficient data reuse to achieve good scalability in manycore processors. A good example is the sparse matrix-vector multiplication kernel of conjugate gradient solvers. While there is ample parallelism in the algorithm, there is no data reuse in accessing the sparse matrix. While there should be some reuse of the vector elements in theory, the working set is typically too large to be captured by the available on-chip storage. As a result, the achieved performance is typically limited by the available off-chip memory bandwidth, which will not be increasing as fast as the on-chip execution resources. There is active research in communication avoidance algorithms to improve data reuse in these applications. In the context of a conjugate gradient solver, a potential solution is to perform multiple steps of matrix-vector multiplication without performing a residual-gradient calculation in between. There are existing techniques that can organize such polynomial computation of the matrix and the vector to promote data reuse. However, such approach does not have close guidance from the residual-gradient evaluations and thus may end up converging more slowly. Furthermore, it can also affect the numerical stability of the solver. Active research projects are being conducted to address these difficulties. While lack of data locality is a major cause of scalability problems today, its impact will grow in the future. The growing cost of data movement with each generation of computing system is compelling developers to more aggressively organize and express data locality in their applications.

The third challenge, and perhaps the least daunting one of the three, is that for parallel algorithms with high levels of parallelism and significant data locality, engineering the implementation of parallel algorithms to actually achieve good scalability is still challenging. Today, a parallel programmer needs to determine layout arrangements of data, allocate memory and temporary storage, arrange pointers, perform index calculation, and orchestrate data movement in order to make use of the on-chip memory resources to support data re-use. The programmer also has to decompose work into tasks, organize threads to perform the tasks, perform thread index calculations to access data in different levels of the memory hierarchy, determine data sharing patterns, and check data bounds. Many parameters of these arrangements need to be determined for each hardware platform. All these efforts are tedious and error prone, whether one is programming GPUs, MIC, or multi-core CPUs. None of these implementation problems will go away by simply recompiling applications for any of these architectures.

The ability to meet these challenges often defines the haves and have-nots in parallel application development. Problems with good solutions to these challenges enjoy excellent scalability and efficiency. Others struggle. In the next several postings in this miniseries, I will discuss key techniques that have been proven effective in developing scalable parallel algorithms for challenging applications.