You are here

NSF/IEEE-TCPP Curriculum Initiative on Parallel and Distributed Computing - Core Topics for Undergraduates

7. Algorithms Topics

7.1 Rationale

Similar to other areas, the Algorithms topics fall into topics covered at various levels in core courses and topics that are important but may be covered in non-core courses. The topics designated for core courses may be introduced alongside their traditional sequential counterparts without adding a significant amount of instruction time. 

Many of the topics build on sequential algorithm ideas that are already covered in a typical curriculum; the list of topics, in many cases, highlights the opportunity to use this sequential coverage to introduce parallel and distributed algorithmic concepts. These topics are organized under three broad sections. The Parallel/Distributed Models and Complexity section contains foundational topics aimed at equipping the students with the basic computational considerations and skills needed for designing and analyzing parallel algorithms. The Algorithmic Techniques section covers recurrent themes or constructs that are generally useful as building blocks in designing a wide variety of parallel algorithms.  In the Algorithmic Problem section, we include some basic problems for which we feel that learning parallel algorithms would be valuable for almost all CS/CE students.

Parallel/Distributed Models and Complexity:  The goal of this section is to introduce the foundational aspects of parallel and distributed algorithms, including computational and communication models, and performance metrics; we build on this foundation to develop other topics. We walk through these considerations, placing them in the context of parallel and distributed algorithms.

One of the most fundamental differences between sequential and parallel/distributed algorithms is along the time dimension.  As a performance metric, time, which can be shared by concurrent processes, is very different in PDC compared to sequential algorithms. In contrast, energy, area/hardware and memory (to within reuse), behave similarly in sequential and parallel or distributed contexts. A basic manifestation of time in parallel and distributed algorithms is in the form of asynchrony, with associated issues, such as races and hazards, data consistency, and event-driven execution. The efficient use of resources to harness the time-benefits of parallelism brings up measures such as speedup and scalability, and approaches including scheduling and load-balancing. 

Another important difference between the sequential and parallel/distributed context is the communication structure between processing elements. Algorithmic strategies must account for the advantages and limitations of a given communication fabric. For this, we suggest the introduction of a relatively simple, parallel and/or distributed computational model, a “model of choice,” on which key algorithmic concepts can be developed without getting into implementation details. A PRAM-like model is useful to bring out the raw parallelism in an approach, whereas a completely-connected network model may address aspects of communication without topological constraints. Another possible model is one of distributed Big Data, where the impracticality of moving data plays a prominent part in algorithm design.

The topics identified in this section emphasize foundational aspects of parallel and distributed algorithms and act as stepping stones to topics in Sections 6.2 and 6.3. They have also been selected to not overly depend on technological particulars, but rather emphasize core ideas with greater longevity. Our choices reflect a combination of  “persistent truths” and “conceptual flexibility.” Our suggestions strive to impart both an understanding of how one reasons rigorously about the expenditure of computational resources (time, memory, energy), and an appreciation of fundamental computational limitations that transcend the details of particular platforms.

Algorithmic Techniques:  This section acknowledges the folk wisdom that contrasts giving a fish and teaching how to fish. Algorithmic techniques lie in the latter camp. We have attempted to enumerate a variety of techniques, patterns, or kernels whose algorithmic utility has been demonstrated over the course of decades. Many of these can be viewed as algorithmic analogues of high-level library functions or methods. The parallel kernels or patterns in our list can be viewed as control structures that would be frequently encountered while developing parallel algorithms for a diverse set of computational tasks or problems, and are readily available on a broad range of parallel and distributed computing platforms. 

Included are some commonly-used global computations such as broadcast and reduction, which may appear trivial in a serial environment, but must be recognized as special communication kernels in a parallel environment. Due to their global nature, their optimal implementation could be a critical determinant of the complexity/performance of the parallel algorithm employing them. Other examples include parallel prefix (scan), gather, and scatter. 

This section also includes task and data decomposition paradigms such as divide-and-conquer, recursion, blocking and striping, and load balancing. Modern approaches such as MapReduce are introduced as a problem-solving technique in distributed systems. Additionally, this section touches on pervasive topics of synchronization, mutual exclusion, and conflict resolution. 

We believe that students with the appropriate level of knowledge of the building blocks covered in this section would be reasonably equipped to design and analyze basic parallel algorithms, some of which are included in the next section.

Algorithmic Problems: This section includes problems for which we recommend designing, analyzing, and in some cases, implementing parallel algorithms. It highlights important problems that play such key roles in a variety of computational situations that we view them as essential to an education about parallel and distributed computing. These problems can serve as examples with which to illustrate the algorithmic techniques from the previous section.  Some of the problems are implementations of techniques mentioned in the previous section..  For example, parallel-prefix (scan) appears in both the Algorithmic Techniques and Algorithmic Problems sections.  The former recommendation is to teach scan as a primitive for use in solving other problems and the latter is to teach how the scan itself can be implemented.  In many cases, these may be taught together, but having separate recommendations highlights the two aspects and allows for courses or curricula to treat them separately.

Also included in this section is a list of algorithmic problems from which we recommend choosing one or two for a relatively in-depth design, analysis, and/or implementation. The problems can be chosen based on students’ and instructor’s interest or fit with the existing algorithms curriculum. We believe that every modern CS/CE undergraduate student should encounter an in-depth experience with the design and analysis of at least one parallel or distributed algorithm. The problem or the algorithm chosen for deeper investigation may also be used to explore one or more of the earlier topics. For example, matrix multiplication could be used to study the impact of different decomposition schemes on speedup and efficiency, or alternatively, search could be used to explore dynamic load balancing.

7.2 Updates from version 1.0

The number of core Algorithms topics recommended has been reduced and the recommended level of coverage has been lowered for many of the retained topics, thus reducing the total additional effort required to infuse algorithmic PDC concepts into the core CS/CSE curriculum. However, there are two noteworthy additions. The first is the recommendation around a model of choice (MOC); the MOC admits development of key algorithmic concepts, while abstracting away from implementation details. Algorithmic analysis requires agreement on the allowed operations and their cost so it is important to define a concrete model.  That said, we recognize that different models are appropriate for different kinds of algorithms and thus leave the choice of which to use up to the instructor.  The second is the recommendation of a capstone deeper algorithmic experience, as described in Section 7.3 above.

Other changes to the recommendations in the Algorithms area relative to version 1.0 of this document include: (1) Updating topics to reflect current issues and trends (e.g., energy as a cost metric) (2) Streamlining the table and reducing the total number of topics, (3) Reorganizing and renaming the Algorithmic Techniques section, (4) Updating most learning outcomes, (5) Adding learning outcomes and courses.
 

Table 4: Algorithms Topics

 

Topics

Core Bloom level

Learning Outcomes and Teaching Suggestions (core)

Advanced Bloom Level

Learning Outcomes (advanced)

Where Covered

Parallel and Distributed Models and Complexity

Be exposed to the foundational aspects of parallel and distributed algorithms, including computational and communication models, and performance metrics

 

 

 

 

Concurrency, Asynchrony, Dependencies, and Nondeterminism

C

Qualitatively understand the notion of concurrency, asynchrony, dependencies, and nondeterminism through one or more every day examples illustrating simultaneous events with dependencies. The examples can be non-computational in CS1; for example, preheating the oven and preparing the batter for a cake can proceed concurrently and asynchronously to save time, but both these events must finish before baking starts (dependency). Computational  examples of the appropriate level can be used in CS2. The goal for the instructor is to develop a baseline of ideas upon which to build the PDC concepts.

 

 

CS1; CS2

 

Costs of computation

 

Be exposed to the broad concepts of parallel time and space complexity

 

 

 

 

 

Asymptotics

C

Be able to describe upper (big-O) and lower bounds (big- Omega,) in the context of PDC, understanding that the functions whose order is being analyzed may have an additional variable related to the number of processing elements in the PDC context. For example, serial run time for simple matrix multiplication is Theta(n^3), and the parallel run time may be Theta(n^3/p), where the number of processing elements p is the additional variable.

 

 

DS/A; CS2

 

 

Time (number of operations) complexity

C

Recognize time as a fundamental computational resource that can be influenced by parallelism

 

 

DS/A

 

 

Work

C

Understand the definition of computational work and how it is different from time. Be able to observe its impact on complexity measures such as time, speedup, energy consumption, etc.

 

 

DS/A

 

 

Space/Memory

C

Recognize space/memory in the same manner as time.

 

 

DS/A

 

 

Memory and Communication complexity

C

Understand that data movement (such as memory accesses or communication) may take more wall-time and energy than computations. Understand also that in certain distributed "big data" applications, moving data is cost prohibitive.

 

 

DS/A

 

Performance Metrics

 

Be exposed to a variety of computational costs that are affected by PDC.

 

 

 

 

 

Speedup

C

Recognize the use of parallelism either to solve a given problem instance faster or to solve larger instance in the same time. Understand and be able to explain why there's an upper limit on speedup for a given problem of a given size (Amdah's law).

 

 

DS/A

 

 

Efficiency, Scalability, Throughput

K

Know about the notions of efficiency, strong and weak scaling, and throughput.

C

Comprehend via several examples that having access to more processors does not guarantee faster execution (for example Amdahl's Law)

DS/A; Algo2

 

Tradeoffs

 

 

 

Recognize the inter-influence of various cost measures

 

 

 

Time vs. space

N

 

C

Understand through an illustration in the context of, perhaps BigData, that not all information may be saved for future reference. This has an impact on the time needed to perform the computation. Observe several examples of this prime cost tradeoff; lazy vs. eager evaluation supplies many examples. Observe that recomputing a result may sometimes be more energy efficient than storing and retrieving the result.

Algo2; OS

 

 

Power vs. time

N

 

C

Understand through at least one example this prime cost tradeoff (the literature on “VLSI computation” --- e.g., the footnoted books[1] [2] --- yield many examples). For example, trade power-intensive communication for extra computation, even if the latter takes more time. Understand that an imbalanced load may be better from the power consumption point of view as some processes could be hibernated. Understand that recomputing a result may be more power efficient than sending the computed result to multiple nodes (for example computing the join in a database context).

Algo2; ParAlgo; DistSystems

 

 

Power vs. precision

N

 

C

Understand that power savings using smaller data representation 64-bit vs. 32-bit vs. 16-bit floating-point precision, 32-bit vs. 16-bit integer). Machine learning is driving lower (16-bit) floating-point precision.

Algo2; OS

 

 

Isoefficiency (Work, Speedup, Efficiency tradeoffs)

N

 

C

Understand the idea of how to increase problem size as a function of the number of processes/threads to keep efficiency the same

ParProg; Algo2

 

Model-based notions

 

Recognize that architectural features can influence amenability to parallel cost reduction and the amount of reduction achievable

 

 

 

 

 

Notions from complexity-theory

 

Understand (via examples) that some computational notions transcend the details of any specific model

 

 

 

 

 

● Model(s) to abstract from architectural details (for example PRAM (shared mem) and/or completely connected (network))

K

Understand concurrency basics without the trappings of real systems (routing, data alignment etc.). Recognize the PRAM as embodying the simplest forms of parallel computation: Embarrassingly parallel problems can be sped up easily just by employing many processors.  Recognize how a completely connected network abstracts away from routing details. Recognize the difference between the model(s) and real systems. Such a MODEL of CHOICE (MoC) is assumed to be adopted by the instructor on which PDC concepts would be discussed.

C

Observe  examples of “model-independent” algorithms that ignore the details of the platform on which they are executed.  Recognize architecture independence as an important avenue toward understanding the core ideas of parallelism. Explore fast algorithms (O(1) and O(log n) time, such as search and prefix computation); explore read/write rules (concurrent/exclusive). Explore communication centric algorithms (broadcast, multicast, gossip)

DS/A; Algo2

 

 

●  BSP

N

 

K

Be exposed to higher-level algorithmic abstractions that encapsulate synchronization and other aspects of real architectures. BSP would be a good option to introduce a higher level programming model and higher-level notions.   Remark that this abstraction has led to programming models.

Algo2

 

 

●  Simulation/ emulation

N

 

K

See simple examples of this abstract, formal analogue of the virtual machines that are discussed under programming topics.  It is important to stress that (different aspects of the same) central notions of PDC can be observed in all four of our main topic areas.

Algo2

 

 

Notions from scheduling

 

Understand how to decompose a problem into tasks

 

 

 

 

 

●  Dependencies

A

Understand how dependencies constrain the execution order of sub-computations (thereby lifting one from the limited domain of “embarrassing parallelism” to more complex computational structures) and how deleterious race conditions can occur when dependencies among concurrent tasks are not respected. Also understand that there are situations where not respecting the order of certain computations may result in nondeterministic, but acceptable results. See examples under Mutual Exclusion and Conflict Resolution. Instructors may use this discussion to emphasize that distributed systems generally offer little control of time, and that executions are usually event-driven. For example, if a decision is to be made on the basis of emails from several sources, then the order of receipt of the emails should not affect the decision (as the order cannot be controlled).

 

 

CS1; CS2; DS/A

 

 

●  Task graphs

C

Show multiple examples of this concrete algorithmic abstraction as a mechanism for exposing inter-task dependencies.  These graphs, which are used also in compiler analyses, form the level at which parallelism is exposed and exploited. Recognize that Series-parallel graphs are a special case of task graphs, which can emerge from barrier synchronizations or fork-join. Understand the possible penalties (in parallelism) that this synchronization incurs (Amdahl's law).

 

 

DS/A; SwEng

 

 

●  Makespan

K

Observe analyses in which makespan is identified with parallel time (basically, time to completion)

 

 

DS/A

 

 

● Energy aware scheduling

N

 

C

Understand how processes can be scheduled to minimize energy consumption due, for example, by taking advantage of active and idle power management capabilities.

OS

Algorithmic Techniques

 

Learn a variety of techniques, patterns, or kernels whose algorithmic utility has been demonstrated over the course of decades.

 

 

 

 

Decomposition

 

Recognize that tasks and/or data associated with an algorithm need to be decomposed into parts to expose concurrency that can be exploited by computing elements running in parallel.

 

 

 

 

 

Recursion and Divide & Conquer (parallel aspects)

A

Recognize that the same structure that enables divide and conquer (sequential) algorithms exposes opportunities for parallel computation. Examples include mergesort or numerical integration (trapezoid rule, Simpson’s rule) or (at a more advanced level) Strassen's matrix-multiply.

 

 

CS2; DS/A; Algo2

 

 

Blocking and Striping

N

 

C

See examples of this algorithmic manifestation of memory hierarchies

Algo2

 

 

Architecture-Specific decomposition

N

 

C

Understand that performance and/or energy efficiency can be improved by decomposing an algorithm to exploit features of GPUs (as opposed to, for example, CPUs)

Algo2

 

Load Balancing

K

Understand that processors equitably sharing the computational/communication load among processors benefits the entire algorithm. That is, the most stretched processor determines the performance of the entire algorithm. Use a simple task graph (for example) with a small number of processors to illustrate the idea.

 

 

DS/A; CS2

 

Multi-party communication

 

Recognize the semantics of some common multi-party communication operations and how to use them appropriately in parallel algorithms.

 

 

 

 

 

Reduction

C

Recognize and use the tree structure implicit in applications such as scalar product, mergesort, histogram, mapreduce, etc.

 

 

DS/A

 

 

Synchronization

A

Recognize that synchronization is necessary for certain algorithms to work correctly. Also recognize that synchronization should be used only when needed as it entails its own overheads. For example, a reduction tree implemented on multiple processing elements needs synchronization before the operation of an internal node is performed. However, a barrier synchronization across the entire tree at each level imposes an unnecessary logarithmic overhead; in this situation local synchronizations suffice.

 

 

CS1; CS2; DS/A

 

 

Parallel Prefix (Scan)

K

Observe, via examples this "high-level" algorithmic tool. For instance, polynomial evaluation can be done as a prefix product (of powers of x), then a set of multiplications, followed by a reduction.

 

 

DS/A

 

 

Other multi-party communication patterns

K

Recognize common multi-party communication patterns such as broadcast, gather, scatter, all-to-all or many-to-many communications, and their use as building blocks for parallel and distributed algorithms.  Illustrate using block matrix transpose or shuffle from map-reduce, etc.

 

 

CS2; DS/A

 

 

MapReduce

N

 

 

Understand MapReduce as an approach to Big Data that involves data movement and then doing a reduction on like data.  Understand a simple example such as word count. It is assumed that students know elements of a distributed model that emphasizes the infeasibility of moving large amounts of data.

Algo2

 

Mutual Exclusion and Conflict Resolution

C

Understand the need to resolve conflicts among concurrent processes competing for a shared resource. Here the computation may have to grant exclusive access to one process to ensure correctness and/or progress. Be able to identify and mitigate problems due to races. Instructors may provide examples such as (a) selecting an airline seat that may be simultaneously competed for by several passengers, (b) selecting which customer gets an item when more than one tries to buy it simultaneously, (c)  mutual exclusion in the context of Java threads, (d) Dining philosophers.

A

This can also be explored in the context of MAC protocols (Networking); leader election (Distributed Systems). Asynchrony and local knowledge are causes of conflict (Distributed systems).

CS2; DS/A; Networking; DistSystems

Algorithmic Problems

 

The Algorithmic problems section contains parallel algorithms for certain problems. The important thing here is to emphasize the parallel/distributed aspects of the topic

 

 

 

 

Algorithms for Communication and Synchronization

 

Understand  (at the pseudo-code level) how certain patterns of communication can be implemented in a parallel/distributed model; the model(s) of choice (MoCs) could serve as good vehicle(s) on which to explore these ideas consistently across the course. As a corollary, one could also appreciate the cost of communication in PDC.

 

 

 

 

 

Reduction and Broadcast for communication and synchronization

C

Understand, for example, how recursive doubling can be used for all-to-one reduction, and its dual, one-to-all reduction, in log(p) steps.  The same applies to all-to-all broadcast and all-to-all reduction. Be aware of the synonyms for these operations in the jargon associated with different areas; for example, all-to-all broadcast may be referred to as "gossip" or "total exchange". Recognize that all-to-all broadcast/reduction are synchronizing operations in a distributed (event-driven) environment.

A

 

DS/A; Algo2

 

 

Parallel Prefix (Scan)

C

Understand the structure of at least one simple parallel prefix algorithm, for example, on a PRAM-type model. One could consider recursive or iterative approaches (such as those of Ladner-Fischer, Kogge-Stone, Brent-Kung)

A

 

DS/A; Algo2

 

 

Multicast

N

 

C

Extend broadcast to multicast and explore avenues for communication-efficiency, relative to the MoC.

OS

 

 

Permutation

N

 

C

Understand important permutations (shuffle, transpose etc.) and their implementation complexity issues.

OS

 

 

Critical Regions and Mutual Exclusion

K

Be aware that a solution to the critical section problems must satisfy Mutual Exclusion, Progress, and Bounded Wait Times.

C

 

OS; DistSystems; ParProg

 

 

Termination detection

K

Observe that, unlike the sequential case, processes in parallel and distributed algorithms may not know when the problem has been solved, or even if their part of the problem is solved; so termination has to be addressed explicitly. In some cases (such as reduction, tree algorithms, divide and conquer) it may be possible for a process to terminate on the basis of local information (for example, a node has passed its information to its parent in a reduction tree). In other cases, a global check may be necessary.

C

See examples that suggest the difficulty of proving that algorithms from various classes actually terminate.  For more advanced courses, observe proofs of termination, to understand the conceptual tools needed.

DS/A; OS

 

Sorting

K

Observe at least one parallel sorting algorithm together with analysis.  Parallel merge sort is the simplest example, but other alternatives might be covered as well; more sophisticated algorithms might be covered in more advanced courses.

C

 

CS2; DS/A; Algo2

 

Search

K

With the help of BFS- or DFS-like parallel search in a tree, graph or solution space, understand speedup anomalies and the fact that certain algorithms don't lend themselves to parallelization without modifying the semantics of the original problem. For example, strict DFS order of visiting nodes in a tree cannot be  maintained in parallel. Also, the order in which solutions are found and the time taken to find them could vary unpredictably depending on the degree of parallelism.  Detailed knowledge of parallel search algorithms is not expected.

C

 

DS/A; Algo2

 

Algorithms for streams

K

Comprehend the notion of efficient algorithms (e.g., Bloom filters, heavy hitters) and structures (e.g., distributed hash tables) for stream data, and the difficulty of dealing with limited space.

 

 

CS1; DS/A

 

Spatial Problems

N

 

K

Understand parallel and distributed decomposition, data structures, and solving strategies for problems rooted in a distribution of points in a multidimensional space such as n-body simulations (FMM, Barnes-Hut), data clustering (R* trees, DBSCAN), and classifiers (k-NN)

Algo2

 

Deeper Algorithmic Experience

A

Experience through class instruction and assignment/project the design, analysis, and implementation aspects of at least one parallel or distributed algorithm of choice in detail. Master PDC algorithmic concepts through a detailed exploration, including recognizing how algorithm design reflects the structure of the computational problem(s) and the PDC model/environment.  Possible computational problems to explore include matrix product, map reduce, sorting, search, convolution, a graph algorithm of your choice.

 

 

CS2; DS/A