Topics |
B
L
O
O
M
#
|
H
O
U
R
S
|
Where Covered |
Learning Outcome |
Parallel Programming paradigms and Notations |
|
|
|
|
By the target machine model |
|
5 |
|
|
SIMD |
K |
0.5 |
CS2; Systems |
Understand common vector operations including element-by-element operations and reductions. |
● Processor vector extensions |
K |
|
Systems |
Know examples - SSE/Altivec macros |
● Array language extensions |
N |
|
ParProg (A) |
Know how to write parallel array code in some language (e.g.,Fortran95, Intel’s C/C++ Array Extension[CEAN]) |
Shared memory |
A |
2.0 |
CS2; DS/A; Lang |
Be able to write correct thread- based programs (protecting shared data) and understand how to obtain speed up. |
● Language extensions |
K |
|
|
Know about language extensions for parallel programming. Illustration from Cilk (spawn/join) and Java (Java threads) |
● Compiler directives/
pragmas |
C |
|
|
Understand what simple directives, such as those of OpenMP, mean (parallel for, concurrent section), show examples |
● Libraries |
C |
|
|
Know one in detail, and know of the existence of some other example libraries such as Pthreads, Pfunc, Intel's TBB (Thread building blocks), Microsoft's TPL (Task Parallel Library), etc. |
Distributed memory |
C |
1.0 |
DS/A; Systems |
Know basic notions of messaging among processes, different ways of message passing, collective operations |
● Message passing |
N |
|
ParProg(C) |
Know about the overall organization of an message passing program as well as point-to-point and collective communication primitives (e.g., MPI) |
● PGAS languages |
N |
|
ParProg (C) |
Know about partitioned address spaces, other parallel constructs (UPC, CoArray Fortran, X10, Chapel) |
● Client Server |
C |
1.0 |
DS/A; Systems |
Know notions of invoking and providing services (e.g., RPC, RMI, web services) - understand these as concurrent processes |
Hybrid |
K |
0.5 |
Systems |
Know the notion of programming over multiple classes of machines simultaneously (CPU, GPU, etc.) |
By the control statement |
|
|
|
|
Task/thread spawning |
A |
1 |
CS2; DS/A |
Be able to write correct programs with threads, synchronize (fork-join, producer/consumer, etc.), use dynamic threads (in number and possibly recursively) thread creation - (e.g. Pthreads, Java threads, etc.) - builds on shared memory topic above |
SPMD |
C |
1 |
CS2; DS/A |
Understand how SPMD program is written and how it executes |
● SPMD notations |
C |
|
|
Know the existence of highly threaded data parallel notations (e.g., CUDA, OpenCL), message passing (e.g, MPI), and some others (e.g., Global Arrays, BSP library) |
Data parallel |
A |
1 |
CS2; DS/A; Lang |
Be able to write a correct data-parallel program for shared-memory machines and get speedup, should do an exercise. Understand relation between different notations for data parallel: Array notations, SPMD, and parallel loops. Builds on shared memory topic above. |
● Parallel loops for shared memory |
A |
|
CS2; DS/A; Lang |
Know, through an example, one way to implement parallel loops, understand collision/dependencies across iterations (e.g., OpenMP, Intel's TBB) |
● Data parallel for distributed memory |
N |
|
ParProg (K) |
Know data parallel notations for distributed memory (e.g., High Performance Fortran) |
Functional/logic languages |
N |
|
ParProg (K) |
Understanding advantages and disadvantages of very different programming styles (e.g., Parallel Haskell, Parlog, Erlang) |
Semantics and correctness issues |
|
|
|
|
Tasks and threads |
K |
0.5 |
CS2; DS/A; Systems, Lang |
Understand what it means to create and assign work to threads/processes in a parallel program, and know of at least one way do that (e.g., OpenMP, Intel TBB, etc.) |
Synchronization |
A |
1.5 |
CS2; DS/A; Systems |
Be able to write shared memory programs with critical regions, producer- consumer communication, and get speedup; know the notions of mechanisms for concurrency (monitors, semaphores, etc. - [from ACM 2008]) |
● Critical regions |
A |
|
|
Be able to write shared memory programs that use critical regions for synchronization |
● Producer-consumer |
A |
|
|
Be able to write shared memory programs that use the producer-consumer pattern to share data and synchronize threads |
● Monitors |
K |
|
|
Understand how to use monitors for synchronization |
Concurrency defects |
C |
1.0 |
DS/A; Systems |
Understand the notions of deadlock (detection, prevention), race conditions (definition), determinacy/non-determinacy in parallel programs (e.g., if there is a data race, the output may depend on the order of execution) |
● Deadlocks |
C |
|
|
Understand what a deadlock is, and methods for detecting and preventing them |
● Data Races |
K |
|
|
Know what a data race is, and how to use synchronization to prevent it |
Memory models |
N |
|
ParProg (C) |
Know what a memory model is, and the implications of the difference between strict and relaxed models (performance vs. ease of use) |
● Sequential consistency |
N |
|
|
Understand semantics of sequential consistency for shared memory programs |
● Relaxed consistency |
N |
|
|
Understand semantics of one relaxed consistency model (e.g., release consistency) for shared memory programs |
Tools to detect concurrency defects |
K |
0.5 |
DS/A; Systems |
Know the existence of tools to detect race conditions (e.g., Eraser) |
Performance issues |
|
|
|
|
Computation |
C |
1.5 |
CS2; DS/A |
Understand the basic notions of static and dynamic scheduling, mapping and impact of load balancing on performance |
Computation decomposition strategies |
C |
|
|
Understand different ways to assign computations to threads or processes |
● Owner computes rule |
C |
|
|
Understand how to assign loop iterations to threads based on which thread/process owns the data element(s) written in an iteration |
● Decomposition into atomic tasks |
C |
|
|
Understand how to decompose computations into tasks with communication only at the beginning and end of each task, and assign them to threads/processes |
● Work stealing |
N |
|
ParProg (C) |
Understand one way to do dynamic assignment of computations |
Program transformations |
N |
|
Compilers (A) |
Be able to perform simple loop transformations by hand, and understand how that impacts performance of the resulting code (e.g., loop fusion, fission, skewing) |
Load balancing |
C |
1.0 |
DS/A; Systems |
Understand the effects of load imbalances on performance, and ways to balance load across threads or processes |
Scheduling
and mapping |
C |
1.0 |
DS/A; Systems |
Understand how a programmer or compiler maps and schedules computations to threads/processes, both statically and dynamically |
● Static |
|
|
|
Understand how to map and schedule computations before runtime |
● Dynamic |
|
|
|
Understand how to map and schedule computations at runtime |
Data |
K |
1.0 |
DS/A; Lang |
Understand impact of data distribution, layout and locality on performance; know false sharing and its impact on performance (e.g., in a cyclic mapping in a parallel loop); notion that transfer of data has fixed cost plus bit rate (irrespective of transfer from memory or inter-processor) |
Data distribution |
K |
|
|
Know what block, cyclic, and block-cyclic data distributions are, and what it means to distribute data across multiple threads/processes |
Data layout |
K |
|
|
Know how to lay out data in memory to get improve performance (memory hierarchy) |
Data locality |
K |
|
|
Know what spatial and temporal locality are, and how to organize data to take advantage of them |
False sharing |
K |
|
|
Know that for cache coherent shared memory systems, data is kept coherent in blocks, not individual words, and how to avoid false sharing across threads of data for a block |
Performance monitoring tools |
K |
0.5 |
DS/A; Systems |
Know of tools for runtime monitoring (e.g., gprof, Vtune) |
Performance metrics |
C |
1.0 |
CS2; DS/A |
Know the basic definitions of performance metrics (speedup, efficiency, work, cost), Amdahl's law; know the notions of scalability |
● Speedup |
C |
|
|
Understand how to compute speedup, and what it means |
● Efficiency |
C |
|
|
Understand how to compute efficiency, and why it matters |
● Amdahl’s law |
K |
|
|
Know that speedup is limited by the sequential portion of a parallel program, if problem size is kept fixed |
● Gustafson’s Law |
K |
|
|
Understand the idea of weak scaling, where problem size increases as the number of processes/threads increases |
● Isoefficiency |
N |
|
ParProg; Algo2 (C) |
Understand the idea of how quickly to increase problem size with number of processes/threads to keep efficiency the same |