“As soon as an Analytical Engine exists, it will necessarily guide the future course of the science. Whenever any result is sought by its aid, the question will then arise — by what course of calculation can these results be arrived at by the machine in the shortest time?”
Charles Babbage (1864)
Points to Ponder
Would it not be wonderful, if we could write all our simulations as serial programs, and parallelized code (highly optimized for any given supercomputer) would be generated automatically by the compiler? Why is this not the case today? How come supercomputing centers require teams of highly trained developers to write simulations?
Introduction
Scientists around the world develop mathematical models and write simulations to understand systems in nature. In many cases, simulation performance becomes an issue either as datasets (problem size) get larger, and/or when higher accuracy is required. In order to resolve the performance issues, parallel processing resources can be utilized. Since a large number of these simulations are developed using high level tools such as Matlab, Mathematica, Octave, etc., the obvious choice for the scientist is to use the parallel processing functions provided within the tool. A case in point is the parfor
function in Matlab, which executes iterations of a for-loop in parallel. However, when an automation tool fails to parallelize a for-loop, it can be hard to understand why parallelization failed, and how one might change the code to help the tool with parallelization.
Books and courses on parallel programming customarily focus on teaching how to write parallel programs using programming models such as OpenMP, MPI, Intel Threading Building Blocks (TBB), NVIDIA CUDA, OpenCL, etc. Knowledge of these programming models, as well as parallelization techniques even though helpful, is not sufficient for working with loop auto-parallelization tools. In order to understand when and why these tools fail, it helps to know how they work. My hope is that the knowledge of how these tools/functions work can help make better use of them.
This is the first in a series of posts that will introduce the reader to concepts in parallel programming. As opposed to most courses and books on the subject, the approach taken in this blog is based on dependence analysis; a method used by parallelizing compilers. My intention is to make the material accessible to those writing simulations in high level tools, as well as those interested in the internal workings of parallelizing compilers.
Throughout this series of posts, I’ll try to answer the following questions:
- What are some fundamental concepts in parallel programming?
- What are some generic methods used in parallelizing compilers?
- Why don’t we have automation tools that take any given serial program and convert it into an efficient parallel program? i.e., what are the limitations of parallelizing compilers?
Auto-parallelization and For-Loops
I collaborate on developing parallel simulations with colleagues from the natural sciences. Their primary focus is on understanding some natural phenomenon, for which they develop mathematical models. High level tools like Matlab are well suited for such a task, since one can express the model with relative ease. As performance requirements grow to the point where parallel processing becomes a necessity, the simulation developer typically has two choices: 1) utilize parallel processing functions available within the high level simulation environment, or 2) port the simulation code to a low-level programming language such as C++ or Fortran.
Even though the first option seems simpler, it does share an underlying problem with the second option; parallel programming diverts one’s efforts from the scientific problem, and requires one to focus on parallelization, concurrency, and performance optimization issues (and that too for a considerable amount of time). But wait! It is understandable that writing code in a low-level language can be time consuming; how come using parallel processing functions available in high level tools is complicated? Shouldn’t that be as straight forward as calling just another function?
Well, it should be simple enough, but there is a catch. When you use something like parfor
in Matlab, you essentially delegate tasks of extracting parallelization and optimizing the parallel code to the compiler. But compilers have limitations (which will be uncovered in subsequent posts); they can only parallelize code if it is written in a certain way. The algorithms compilers use to transform serial code to parallel code are not nearly as clever as a trained parallel programmer. Therefore, auto-parallelization tools are not as easy to work with as one would expect. Moreover, errors generated by these tools can be hard to interpret, i.e., it is not often clear how to resolve the issue. Therefore, even if one wishes only to use auto-parallelization tools, a basic understanding of parallel programming and the principles and algorithms involved in auto-parallelization can be very helpful.
Let’s take a look at a couple of loop auto-parallelization examples with Matlab parfor
.
parfor i = 1:size(A,2) |
parfor i = 2:size(A,2) |
The first example (code snippet on the left) consists of two equal sized row-vectors A
and B
, and a scalar alpha
. The loop iterates over elements of A
, multiplies each element with alpha
, and assigns the result to the corresponding element in B
. The parfor
function successfully parallelizes the loop; no errors. The reason parallelizing this loop is as simple as using parfor
is that each iteration of the loop is completely independent of all other iterations. This means that if we have four iterations and four processing units, each iteration can be assigned to a different processor. The execution order of the iterations does not affect the final result.
However, in our second example (code snippet on the right), parfor
fails to parallelize the code, pointing out that there are dependendencies between loop iterations. This is because each loop iteration is dependent on the result computed in the previous iteration - each element of B
depends on the value of the preceding element. There is only one possible order in which iterations can execute. Therefore, different iterations cannot be assigned to different processing units.
We will discuss the concept of dependence in a lot more depth in the subsequent posts. The important point to consider for now is that serial code always appears in a certain execution order. However, the execution order given in a serial program does not necessarily impose a constraint on the program. It is often possible to relax the given ordering, and this is in fact how automation tools and compilers find parallelism in the code. If it is possible to reorder iterations in a loop (without affecting the final result), it is possible to distribute the iterations on different processing units.
Why For-Loops?
You might be wondering, “why does he keep referring to parallel for-loops?” In most numerical algorithms used in scientific computing, for-loops comprise the most compute intensive segments of the code. It is often the case that larger datasets, as well as more finely grained meshes, both result in a larger number of loop iterations. E.g., in Monte Carlo algorithms, more iterations result in better convergence. Similarly, in Finite Element based codes, the more fine grained the mesh, the better the results, and the larger the number of iterations required. As the number of iterations increase with problem size, more parallelism is generated. Therefore, more parallel compute resources can be utilized for larger problems.
Nevertheless, a for-loop is not the only place in a serial program where one can find parallelism. Many simulations consist of multiple different algorithms that can execute independently of each other. In a simulation which consists of two such independently executable segments, one might consider executing them on two separate processing units. The problem with such a parallelization strategy is that the amount of parallelism in this case is constant, i.e., no matter how large the dataset, or how fine grained the mesh, the number of processes will always be two. This is a severe limitation, and is therefore not considered a good strategy.
Up Next
We’ll stop here today, and continue next time with a more formal treatment of the concept of dependence. We will also look into memory hierarchies in modern architectures, and the role ‘locality’ plays in performance optimization. Finally, we’ll see how dependence theory is not only used to find parallelism, but is also used to optimize for locality.
Note: The mathematical concepts we will use in the subsequent posts are from the fields of Linear Algebra and Integer Linear Programming. We will also use the concepts of Partial Ordering and Total Ordering. My intention is to provide a simplified introduction to these concepts in the corresponding posts.