The course covers a wide range of aspects of parallel computing, with the emphasis on developing efficient software for commonly available systems, such as the clusters here on campus. We will emphasize using standard software, including MPI (Message Passing Interface) and OpenMP. Use of standard software helps make the code more portable, preserving the time invested in it. Because there are several parallel computing models, you also have to learn some about various parallel architectures since there is significant hardware/software interaction. This includes aspects such as interconnection networks, shared vs. distributed memory, fine-grain vs. medium-grain, MIMD/SIMD, cache structure, etc. If we can get sufficent access to machines, we will also have an assignment involving GPUs, programmed with CUDA or OpenCL.
We examine many reasons for poor parallel performance, and a wide range of techniques for improving it. Concepts such as domain decomposition; deterministic, probabilistic and adaptive load balancing; and sychronization will be covered. You'll learn why modest parallelization is relatively easy to achieve, and why efficient massive parallelization is quit difficult - in particular, we will cover various implications of Amdahl's Law. Examples and programs will be numeric, such as matrix multiplication, and nonnumeric, such as sorting, but they do not assume any deep knowledge in either field and you'll be given all of the requisite information you need to understand the problem.
You'll also learn some about high performance computing in general. It makes no sense to spend a lot of money on parallel computers if you could get just as much performance on a serial computer had you only tuned your code. For example, intelligent use of cache is critical, and can give an order of magnitude improvement in some cases.