You are here

NSF/IEEE-TCPP Curriculum Initiative on Parallel and Distributed Computing - Core Topics for Undergraduates

6. Programming Topics

6.1 Rationale

The material is organized into three subtopics: Paradigms and notations, correctness, and performance/energy. We discuss these in separate sections below. A prerequisite for coverage of much of this material is some background in conventional programming. Even though we advocate earlier introduction of parallelism in a student's programming experience, basic algorithmic problem-solving skills must still be developed, and we recognize that it may be easier to begin with sequential models. Coverage of parallel algorithms prior to this material would allow the focus to be exclusively on the practical aspects of parallel and distributed programming, but they can also be covered at the same time as necessary and appropriate.  Parallel software development can be taught using many different languages and tools, including Java, C, C++, Python, OpenMP, CUDA, MPI, and many others.

Paradigms and Notations:  There are different approaches to parallel programming. These can be classified in many different ways. Here we have used two different ways of classifying the models. First, we classify the paradigms by the target machine model: SIMD (single instruction multiple data) is the paradigm in which the parallelism is confined to operations on (corresponding) elements of arrays. This linguistic paradigm is at the basis of the Intel Advanced Vector Extensions (AVX) or IBM Vector Scalar Extension (VSX) macros, some database operations, some operations in data structure libraries, and the languages constructs used for vector machines. Shared-memory is the paradigm of OpenMP and Intel’s Thread Building Blocks, among other examples.  Distributed memory is the paradigm underlying message passing and the MPI standard. A hybrid model is when any of the previous three paradigms co-exist in a single program. The logical target machine does not have to be identical to the physical machine. For example, a program written according to the distributed memory paradigm can be executed on a shared-memory machine and programs written in the shared-memory paradigm can be executed on a distributed memory machine with appropriate software support (e.g., Intel’s Cluster OpenMP).  More loosely coupled models are also addressed, including client/server, peer-to-peer, and big data models.  

A second way to classify programming approaches is according to the mechanisms that control parallelism. These are (mostly) orthogonal to the first classification. For example, programs in the SPMD (single program multiple data) paradigm can follow a distributed-memory, shared-memory or even the SIMD model from the first classification. The same is true of programs following the data parallel model. The task spawning model can work within a distributed or shared-memory paradigm. The parallel loop form seems to be mainly used with the shared-memory paradigm, but some models/languages have merged the loop model with the distributed memory paradigm (e.g., MapReduce).  The recent trend toward accelerators (e.g., GPUs) is also included here. Students are expected to be familiar with several notations (not languages since in many cases support comes from libraries such as MPI and BSPlib). Not all notations need to be covered, but at least one per main paradigm should be. An example collection that provides this coverage would be Java threads, AVX macros, OpenMP, and MPI.

Correctness and Semantics: This set of topics outlines the material needed to understand the behavior of parallel programs beyond the fact that there are activities that take place (or could take place) simultaneously. Material in this section covers three main topics as well as the methods and tools necessary to detect and troubleshoot defects. The main topics are Tasking, Synchronization and Memory models. In this context, tasking refers to the means to create threads on multiple cores and assign work to them, either through implicit or explicit assignment (e.g., OpenMP vs. POSIX threads). Synchronization introduces the concept of critical sections of code that depend on order to execute properly. Material in this section includes critical regions in code as well as producer-consumer models. The third section covers different Memory models explaining the relationship and tradeoffs between strict and relaxed memory models. As many programming languages have their own model, only the basic ideas are expected to be covered. Finally, the section deals with Concurrency problems and the tools to detect them. Common defects such as deadlock, starvation and race conditions are explained while tools like Eraser or various tools from Intel’s Parallel Toolkit can be covered as a means of detecting the errors. 

Performance and Energy:  The final group of topics is about performance and energy issues - how to organize the computation and the data for the different classes of machines, and how to deal with minimizing energy usage. Topics in this section are divided in four categories: Computation, Data, Tools and Metrics, and Power/Energy Efficiency. The first two categories (i.e., Computation and Data) refer to parallel programming aspects that influence performance, either in the form of task decomposition or by accessing remote data. The third topic introduces the notion of performance metrics and the ways in which they can be collected.  In more detail, Computation includes task decomposition strategies and how tasks  can be assigned to different  threads/processes. This section includes performance effects of load balancing and its contributing factors. such as scheduling, failures, and distribution delays. Data includes a variety of topics, from data representation and its effect on performance and energy, to data locality, and performance tradeoffs of laying out data and accessing data remotely. This section introduces different storage and organization paradigms, as well as issues arising from distribution and replication, such as consistency and atomicity of operations. Tools and Metrics covers ways in which performance is measured and evaluated, the laws measuring performance in a parallel setting, and finally performance in relation to Energy/Power consumption. 

6.2 Updates from version 1.0

We highlight the changes to the curriculum guidelines relative to version 1.0 of this document.  The main changes are related to adding topics from new aspects, including big data, distributed computing, and energy.  Many of those topics are suitable for upper level classes in parallel programming, distributed systems, databases and others, but some topics have influenced the learning outcomes for core classes.  The energy topics can be addressed in the core Systems class.  One additional change is the addition of programming for accelerators, such as GPUs, which can be introduced in a CS2 or Systems class and addressed more deeply in an upper level Parallel Programming class.

Table 3: Programming Topics

 

 Topics

Core Bloom Level

Learning Outcomes and Teaching Suggestions (core)

Advanced Bloom Level

Learning Outcome (advanced)

Where Covered

Parallel Programming Paradigms and Notations

 

 

 

 

 

 

By the target machine model

 

 

 

 

 

 

 

Concurrency and Parallelism

C

Understand concurrency is an algorithmic property; it exposes potential for parallelization. If concurrency is present in an algorithm, it can be parallelized, without concurrency there is no scope for parallelization. Concurrency can be present in a sequential program, parallelization takes advantage of concurrency to increase performance.

 

 

CS1; CS2; DS/A

 

 

SIMD

K

Understand common vector operations including element-by-element operations and reductions.

 

 

CS2: Systems

 

 

  • Processor vector extensions

K

Know examples - e.g., Intel AVX or Power VSX macros

C

Understand examples from Intel/Power vector instructions

Systems; Arch2

 

 

  • Array language extensions

N

 

A

Know how to write parallel array code in some language (e.g., Fortran95, Intel’s C/C++ Array Extension[CEAN])

ParProg

 

 

Shared memory

A

Be able to write correct thread-based programs (protecting shared data) and understand how to obtain speed up.

 

 

CS2; DS/A

 

 

  • Language parts or extensions

K

Know about language extensions for parallel programming. Illustrate with examples from Cilk (spawn/join), Java (Java threads), or other languages.

 

 

 

 

 

  • Compiler  directives/ pragmas

C

Understand what simple directives, such as those of OpenMP, mean (parallel for, concurrent section), show examples

 

 

 

 

 

  • Libraries

C

Know one in detail, and know of the existence of some other example libraries such as Pthreads, Pfunc, Intel's TBB (Thread building blocks), Microsoft's TPL (Task Parallel Library), C++ threads, etc.

 

 

 

 

 

Distributed memory

K

Know basic notions of messaging among processes, different ways of message passing, collective operations

 

 

Systems; DS/A

 

 

  • Message passing

N

 

C

Know about the overall organization of a message passing program as well as point-to-point and collective communication primitives (e.g., MPI)

ParProg

 

 

  • PGAS languages

N

 

C

Know about partitioned address spaces, other parallel constructs (e.g., UPC, CoArray Fortran, Chapel)

ParProg

 

 

Client Server and Peer-to-Peer models

C

Know notions of invoking and providing services (e.g., RPC, RMI, web services) - understand these as concurrent processes; know about network model of distributed computing (e.g., sockets); know that in distributed systems such handshaking interaction is crucial for the efficient communication between asynchronous processes

A

Be able to program a basic client/server and/or P2P interface

Systems; Networking

 

 

Big Data Technology Stack

N

 

A

Understand the Big data technology stack and its layered architecture. Be able to write code (e.g., in Python, R) using some of the tools that facilitate data storage, organization, management, and/or analysis within a Big Data stack

ParProg; DistSystems

 

 

Hybrid

K

Know the notion of programming over multiple classes of machines simultaneously (CPU, GPU, TPU, etc.)

A

Be able to write correct programs using two programming paradigms: e.g., shared memory and GPU (OpenMP+CUDA), Distributed and shared memory (MPI+OpenMP), Distributed memory and GPU (MPI+CUDA)

Systems; ParProg

 

By the control statement

 

 

 

 

 

 

 

Task/thread spawning

A

Be able to write correct programs with threads, synchronize (fork-join, producer/consumer, master/worker, etc.), use dynamic threads (in number and possibly recursively) thread creation - (e.g., Pthreads,  Java threads, etc.)  - builds on shared memory topic above

 

 

CS2; DS/A

 

 

Event-Driven Execution

K

Know about the need for event-driven execution; possible approaches to implementing it. Know about the notion of causality among events (e.g., remote file access, GUI).  These effects may be easier to discuss in the context of distributed systems.

A

 

CS2; DistSystems

 

 

SPMD

C

Understand how SPMD program is written and how it executes

 

Be able to write an SPMD program and understand how it executes

 

 

 

  • SPMD notations

C

Know the existence of highly threaded data parallel notations (e.g., CUDA, OpenCL), message passing (e.g., MPI), and some others (e.g., Global Arrays, BSP library)

A

 

CS2; DS/A; ParProg

 

 

Data parallel

A

Be able to write a correct data-parallel program for shared-memory machines and get speedup, should do an exercise.

 

Understand relation between different notations for data parallel: Array notations, SPMD, and parallel loops. Builds on shared memory topic above.

 

 

 

  • Parallel loops for shared memory

A

Know, through an example, one way to implement parallel loops, understand collision/dependencies across iterations (e.g., OpenMP, Intel's TBB)

A

 

CS2; DS/A; Lang

 

 

  • Data parallel for distributed memory

N

 

K

Know data parallel notations for distributed memory (e.g., UPC, Chapel, Co-Array Fortran)

ParProg

 

 

  • MapReduce

K

Understand how problems can be solved by mapreduce, and how algorithms can be written using map and reduce

A

Solve problems using mapreduce, and write algorithms using map and reduce

CS2; Lang; ParProg

 

 

Offloading to accelerators

K

Know about running parts of applications on accelerators (e.g., GPU, TPU, FPGA)

 

 

CS2; Systems

 

 

  • Accelerator notations

N

 

A

Be able to write an accelerator program that takes advantage of the hardware (e.g., CUDA, OpenACC, OpenMP 4.5 or above, OpenCL, TensorFlow)

ParProg

 

 

Functional/logic languages

N

 

K

Understanding advantages and disadvantages of very different programming styles (e.g., Parallel Haskell, Parlog, Erlang)

ParProg

Semantics and correctness issues

 

 

 

 

 

 

Tasks and threads

A

Be able to write parallel programs that create and assign work to threads/processes,, in at least one parallel environment (e.g., OpenMP, Intel TBB, pthreads, etc.)

A

 

CS2; DS/A; Systems; Lang

 

Synchronization

A

Be able to write shared memory programs with critical regions, producer- consumer communication, and get speedup; know the notions of mechanisms for concurrency (monitors, semaphores, etc.)

 

 

CS2; DS/A; Systems

 

 

Critical regions

A

Be able to write shared memory programs that use critical regions for synchronization

 

 

 

 

 

Producer-consumer

A

Be able to write shared memory programs that use the producer-consumer pattern to share data and synchronize threads

 

 

 

 

 

Handshaking

N

 

A

Analyze handshaking protocols and derive performance bounds (e.g., response time in sliding window, TCP connection management)

Networking; DistSystems

 

Concurrency issues

C

Understand the notions of deadlock (detection, prevention), race conditions (definition), determinacy/non-determinacy in parallel programs (e.g., if there is a  race condition, the correctness of the output may depend on the order of execution)

 

 

DS/A; Systems

 

 

Deadlock/Livelock

C

Understand what deadlock and livelock are, and methods for detecting and preventing them; also cast in terms of distributed systems

 

 

 

 

 

Starvation

C

Understand how starvation (of a thread or process) can occur, in context of an example (e.g., dining philosophers)

 

 

 

 

 

Race condition

C

Know what a race condition is, and how to use synchronization to prevent it

 

 

 

 

 

Distributed Data Structures and Applications

N

 

C

Understand synchronization in the context of data structures; correctness in a concurrent context

ParProg; DistSystems

 

 

Tools to detect concurrency defects

N

 

K

Know the existence of tools to detect race conditions (e.g., Eraser) and debugging (e.g., Intel Parallel Toolkit)

ParProg

 

 Memory models

N

 

C

Know what a memory model is, and the implications of the difference between strict and relaxed models (performance vs. ease of use)

ParProg

 

 

Sequential consistency

N

 

 

Understand semantics of sequential consistency for shared memory programs

 

 

 

Relaxed consistency

N

 

 

Understand semantics of one relaxed consistency model (e.g., release consistency) for shared memory programs

 

 

 

Consistency in distributed transactions

N

 

C

Recognize consistency problems. Know that consistency is an issue in transactions issued concurrently by multiple agents. Implement transaction commit protocols in databases.

DB; DistSystems

Performance and Energy issues

 

 

 

 

 

 

Computation

C

Understand the basic notions of static and dynamic scheduling, mapping and impact of load balancing on performance

 

 

 

 

 

Computation decomposition strategies

C

Understand different ways to assign computations to threads or processes

 

 

CS2; DS/A

 

 

  • Owner computes rule

C

Understand how to assign loop iterations to threads based on which thread/process owns the data element(s) written in an iteration

 

 

 

 

 

  • Decomposition into atomic tasks

C

Understand how to decompose computations into tasks with communication only at the beginning and end of each task, and assign them to threads/processes

 

 

 

 

 

  • Decomposition into Map and Reduce tasks

K

Know that divide and conquer can be expressed and programmed as a Map and Reduce decomposition

A

Understand how to decompose computation using the Map Reduce paradigm. Be able to reason about computation speed ups from this decomposition and the communication tradeoffs resulting from reduction(s). Be able to produce code expressing this decomposition

CS2; ParProg

 

 

  • Work stealing

N

 

C

Understand one way to do dynamic assignment of computations

ParProg

 

 

  • Offloading onto an accelerator

N

 

C

Understand when it is worthwhile to decompose onto an accelerator (e.g., GPU, TPU, FPGA)

ParProg

 

 

Program transformations

N

 

C

Be able to perform simple loop transformations by hand, and understand how that impacts performance of the resulting code (e.g., loop fusion, fission, skewing, blocking)

Compilers; ParProg

 

 

Load balancing

C

Understand the effects of load imbalances on performance, and ways to balance load across threads or processes

 

 

DS/A; Systems

 

 

Scheduling and mapping

C

Understand the importance of a programmer, compiler and/or runtime system mapping and scheduling computations to threads/processes, both statically and dynamically

A

Can apply a static and a dynamic strategy for mapping processes of large scale parallel programs onto processors that optimizes performance through reduction of communication delays

DS/A; Systems; ParProg

 

 

Effect of timing failures/delay in distributed systems

K

Understand that a failure in one node can cause a global failure in a distributed system. For example, one could use waiting on a non-terminating program to illustrate a failure scenario in distributed systems (e.g., in the context of consensus).

 

 

CS2

 

 

  • protocol timeout protections

N

 

A

Understanding the use of timeouts in situations with a high probability of error

Networking

 

 Data

 

Understand impact of data distribution, layout and locality on performance; notion that transfer of data has fixed cost plus bit rate (irrespective of transfer from memory or inter-processor); know false sharing and its impact on performance (e.g., in a cyclic mapping in a parallel loop) in ParProg;

C

 

DS/A; Lang; ParProg

 

 

Data distribution

N

 

C

Understand what block, cyclic, and block-cyclic data distributions are, and what it means to distribute data across multiple threads/processes

ParProg

 

 

Data layout

C

Know how to lay out data in memory to get improved performance and energy (memory hierarchy in shared memory parallel system)

 

 

DS/A; Systems

 

 

  • False sharing

K

Know that for cache coherent shared memory systems, data is kept coherent in blocks, not individual words, and how to avoid false sharing across threads of data for a block

C

Be aware of false sharing, able to give examples where it occurs, and understand why it happens

Systems; ParProg

 

 

  • Energy impact

K

Know the energy cost of loading data into memory from secondary storage (and writing out modified data to secondary storage)

 

 

Systems

 

 

Data Representation

 

 

 

 

 

 

 

  • Floating point and integer precision (64-bit, 32-bit, and 16-bit or less)

K

Power savings using smaller data representation 64-bit vs. 32-bit vs. 16-bit floating-point precision, 32-bit vs. 16-bit integer). For example, machine learning on GPUs is driving lower (16-bit) floating-point precision.

 

 

Systems

 

 

Data locality

K

Know what spatial and temporal locality are, and how to organize data to take advantage of them

 

 

DS/A; Systems

 

 

  • Performance impact of data movement

K

Know the performance cost of moving data to secondary storage for big data analysis, distributed vs centralized analysis. Be aware of specific mechanisms that take advantage of locality (e.g., in-situ processing)

C

Understand performance costs of data locality with respect to various metrics, and be able to contrast mechanisms like in-situ vs. in-transit vs. offline processing

Systems; ParProg; DistSystems

 

 

  • Structured vs unstructured data

K

Know the differences and tradeoffs between these data representations

A

Be able to build solutions for the different types of data

DS/A; Databases

 

 

  • Graph representations and databases

K

Know of graph representations of data to facilitate graph analysis.

C

Understand performance gains due to graph-specific mechanisms versus other more general data representations

DS/A; Databases

 

 

Data handling and manipulation

N

 

C

Understand the differences and performance impact of data formatting and storing mechanisms. Be able to implement adequate solutions for the different types of storing

DistSystems; Databases

 

 

  • Distributed databases

N

 

C

Comprehend the principles behind distributed databases and the motivation of tradeoffs to support scalability

DistSystems; Databases

 

 

  • NoSQL databases

N

 

A

Comprehend how NoSQL databases enable scalable data manipulation, include exemplars to become familiar with some of them (e.g., MongoDB, Hive)

Databases

 

 

  • Eventual consistency vs ACID

N

 

C

Comprehend how replication enables scalability in databases but transactions lose their ACID properties

Databases

 

 

Distributed file systems

K

Be aware of existence of distributed file systems and common examples of where they are used and why

K

Comprehend the basic principles of how distributed file systems work, their scalability benefits, and performance and reliability problems

Systems; Operating Systems; DistSystems; ParProg

 

 

  • Replicated file systems

N

 

C

Know of distributed file systems such as HDFS and its replication and fault tolerance mechanisms. ParProg be able to use a DFS.

DistSystems; ParProg

 

 

  • Key-value storage systems

N

 

K

Understand the concept of key-value storage. ParProg be able to use an appropriate API.

DistSystems; ParProg

 

Tools and metrics

 

 

 

 

 

 

 

Performance monitoring tools

K

Know of tools for runtime monitoring (e.g., gprof, perf, Intel Performance Toolkit, TAU)

 

 

DS/A; Systems

 

 

Performance metrics

C

Know the basic definitions of performance metrics (speedup, efficiency, work, cost), Amdahl's law; know the notion of scalability

 

 

CS2; DS/A

 

 

  • Speedup

C

Understand how to compute speedup, and what it means

 

 

 

 

 

  • Efficiency

C

Understand how to compute efficiency, and why it matters

 

 

 

 

 

  • Parallel Scalability

C

Understand that speedup and efficiency is a single point of measure for a particular problem size and  number of processes/threads. These metrics change as problem size and/or number of processes/threads vary. Understand that scalability is a metric that measures how speedup varies as problem size and/or number of processes/threads vary.

 

 

 

 

 

  • Amdahl’s law

C

Understand that speedup is limited by the sequential portion of a parallel program, if problem size is kept fixed

 

 

 

 

 

  • Gustafson’s Law

K

Know the idea of weak scaling, where problem size increases as the number of processes/threads increases

 

 

 

 

Power/Energy efficiency

 

 

 

 

 

 

 

Power-latency tradeoff

K

Familiar with the notion that problem decompositions (including their granularity), and active/idle states (e.g., including modulation of CPU frequency) may be exploited to adjust balance among throughput, latency, and energy consumption.

 

 

Systems

 

 

Energy efficiency vs. load balancing

K

Aware that unbalanced work decomposition and communication congestion can prolong computation and reduce energy efficiency.

 

 

Systems

 

 

Active power management methods

N

 

A

Aware that systems  expose various execution parameters (e.g., P states on Intel) and have examined the effect of modulating at least one to optimize performance or energy consumption.

ParProg; Arch2

 

 

Idle power management methods

N

 

A

Aware that architectures and OSs provide various interfaces that enable computational units to be passivated (e.g., C states) and have examined tradeoffs (e.g., reduced subsystem power consumption v. increased latency) involved in exploiting at least one of them. 

ParProg; Arch2

 

 

Power consumption of parallel programs

N

 

K

Aware that optimal energy efficiency may not be achieved through aggressive reduction of CPU clock frequencies and exploitation of sleep modes due to increased execution time and static components of system power consumption.

ParProg

Security

 

 

 

 

 

 

 

Security protocols

N

 

A

understand IPsec basics

Networking