Topics

Core Bloom level

Learning Outcomes and Teaching Suggestions (core)

Advanced Bloom Level

Learning Outcomes (advanced)

Where Covered

Parallel and Distributed Models and Complexity

Be exposed to the foundational aspects of parallel and distributed algorithms, including computational and communication models, and performance metrics





Concurrency, Asynchrony, Dependencies, and Nondeterminism

C

Qualitatively understand the notion of concurrency, asynchrony, dependencies, and nondeterminism through one or more every day examples illustrating simultaneous events with dependencies. The examples can be noncomputational in CS1; for example, preheating the oven and preparing the batter for a cake can proceed concurrently and asynchronously to save time, but both these events must finish before baking starts (dependency). Computational examples of the appropriate level can be used in CS2. The goal for the instructor is to develop a baseline of ideas upon which to build the PDC concepts.



CS1; CS2


Costs of computation


Be exposed to the broad concepts of parallel time and space complexity






Asymptotics

C

Be able to describe upper (bigO) and lower bounds (big Omega,) in the context of PDC, understanding that the functions whose order is being analyzed may have an additional variable related to the number of processing elements in the PDC context. For example, serial run time for simple matrix multiplication is Theta(n^3), and the parallel run time may be Theta(n^3/p), where the number of processing elements p is the additional variable.



DS/A; CS2



Time (number of operations) complexity

C

Recognize time as a fundamental computational resource that can be influenced by parallelism



DS/A



Work

C

Understand the definition of computational work and how it is different from time. Be able to observe its impact on complexity measures such as time, speedup, energy consumption, etc.



DS/A



Space/Memory

C

Recognize space/memory in the same manner as time.



DS/A



Memory and Communication complexity

C

Understand that data movement (such as memory accesses or communication) may take more walltime and energy than computations. Understand also that in certain distributed "big data" applications, moving data is cost prohibitive.



DS/A


Performance Metrics


Be exposed to a variety of computational costs that are affected by PDC.






Speedup

C

Recognize the use of parallelism either to solve a given problem instance faster or to solve larger instance in the same time. Understand and be able to explain why there's an upper limit on speedup for a given problem of a given size (Amdah's law).



DS/A



Efficiency, Scalability, Throughput

K

Know about the notions of efficiency, strong and weak scaling, and throughput.

C

Comprehend via several examples that having access to more processors does not guarantee faster execution (for example Amdahl's Law)

DS/A; Algo2


Tradeoffs




Recognize the interinfluence of various cost measures




Time vs. space

N


C

Understand through an illustration in the context of, perhaps BigData, that not all information may be saved for future reference. This has an impact on the time needed to perform the computation. Observe several examples of this prime cost tradeoff; lazy vs. eager evaluation supplies many examples. Observe that recomputing a result may sometimes be more energy efficient than storing and retrieving the result.

Algo2; OS



Power vs. time

N


C

Understand through at least one example this prime cost tradeoff (the literature on “VLSI computation”  e.g., the footnoted books[1] [2]  yield many examples). For example, trade powerintensive communication for extra computation, even if the latter takes more time. Understand that an imbalanced load may be better from the power consumption point of view as some processes could be hibernated. Understand that recomputing a result may be more power efficient than sending the computed result to multiple nodes (for example computing the join in a database context).

Algo2; ParAlgo; DistSystems



Power vs. precision

N


C

Understand that power savings using smaller data representation 64bit vs. 32bit vs. 16bit floatingpoint precision, 32bit vs. 16bit integer). Machine learning is driving lower (16bit) floatingpoint precision.

Algo2; OS



Isoefficiency (Work, Speedup, Efficiency tradeoffs)

N


C

Understand the idea of how to increase problem size as a function of the number of processes/threads to keep efficiency the same

ParProg; Algo2


Modelbased notions


Recognize that architectural features can influence amenability to parallel cost reduction and the amount of reduction achievable






Notions from complexitytheory


Understand (via examples) that some computational notions transcend the details of any specific model






● Model(s) to abstract from architectural details (for example PRAM (shared mem) and/or completely connected (network))

K

Understand concurrency basics without the trappings of real systems (routing, data alignment etc.). Recognize the PRAM as embodying the simplest forms of parallel computation: Embarrassingly parallel problems can be sped up easily just by employing many processors. Recognize how a completely connected network abstracts away from routing details. Recognize the difference between the model(s) and real systems. Such a MODEL of CHOICE (MoC) is assumed to be adopted by the instructor on which PDC concepts would be discussed.

C

Observe examples of “modelindependent” algorithms that ignore the details of the platform on which they are executed. Recognize architecture independence as an important avenue toward understanding the core ideas of parallelism. Explore fast algorithms (O(1) and O(log n) time, such as search and prefix computation); explore read/write rules (concurrent/exclusive). Explore communication centric algorithms (broadcast, multicast, gossip)

DS/A; Algo2



● BSP

N


K

Be exposed to higherlevel algorithmic abstractions that encapsulate synchronization and other aspects of real architectures. BSP would be a good option to introduce a higher level programming model and higherlevel notions. Remark that this abstraction has led to programming models.

Algo2



● Simulation/ emulation

N


K

See simple examples of this abstract, formal analogue of the virtual machines that are discussed under programming topics. It is important to stress that (different aspects of the same) central notions of PDC can be observed in all four of our main topic areas.

Algo2



Notions from scheduling


Understand how to decompose a problem into tasks






● Dependencies

A

Understand how dependencies constrain the execution order of subcomputations (thereby lifting one from the limited domain of “embarrassing parallelism” to more complex computational structures) and how deleterious race conditions can occur when dependencies among concurrent tasks are not respected. Also understand that there are situations where not respecting the order of certain computations may result in nondeterministic, but acceptable results. See examples under Mutual Exclusion and Conflict Resolution. Instructors may use this discussion to emphasize that distributed systems generally offer little control of time, and that executions are usually eventdriven. For example, if a decision is to be made on the basis of emails from several sources, then the order of receipt of the emails should not affect the decision (as the order cannot be controlled).



CS1; CS2; DS/A



● Task graphs

C

Show multiple examples of this concrete algorithmic abstraction as a mechanism for exposing intertask dependencies. These graphs, which are used also in compiler analyses, form the level at which parallelism is exposed and exploited. Recognize that Seriesparallel graphs are a special case of task graphs, which can emerge from barrier synchronizations or forkjoin. Understand the possible penalties (in parallelism) that this synchronization incurs (Amdahl's law).



DS/A; SwEng



● Makespan

K

Observe analyses in which makespan is identified with parallel time (basically, time to completion)



DS/A



● Energy aware scheduling

N


C

Understand how processes can be scheduled to minimize energy consumption due, for example, by taking advantage of active and idle power management capabilities.

OS

Algorithmic Techniques


Learn a variety of techniques, patterns, or kernels whose algorithmic utility has been demonstrated over the course of decades.





Decomposition


Recognize that tasks and/or data associated with an algorithm need to be decomposed into parts to expose concurrency that can be exploited by computing elements running in parallel.






Recursion and Divide & Conquer (parallel aspects)

A

Recognize that the same structure that enables divide and conquer (sequential) algorithms exposes opportunities for parallel computation. Examples include mergesort or numerical integration (trapezoid rule, Simpson’s rule) or (at a more advanced level) Strassen's matrixmultiply.



CS2; DS/A; Algo2



Blocking and Striping

N


C

See examples of this algorithmic manifestation of memory hierarchies

Algo2



ArchitectureSpecific decomposition

N


C

Understand that performance and/or energy efficiency can be improved by decomposing an algorithm to exploit features of GPUs (as opposed to, for example, CPUs)

Algo2


Load Balancing

K

Understand that processors equitably sharing the computational/communication load among processors benefits the entire algorithm. That is, the most stretched processor determines the performance of the entire algorithm. Use a simple task graph (for example) with a small number of processors to illustrate the idea.



DS/A; CS2


Multiparty communication


Recognize the semantics of some common multiparty communication operations and how to use them appropriately in parallel algorithms.






Reduction

C

Recognize and use the tree structure implicit in applications such as scalar product, mergesort, histogram, mapreduce, etc.



DS/A



Synchronization

A

Recognize that synchronization is necessary for certain algorithms to work correctly. Also recognize that synchronization should be used only when needed as it entails its own overheads. For example, a reduction tree implemented on multiple processing elements needs synchronization before the operation of an internal node is performed. However, a barrier synchronization across the entire tree at each level imposes an unnecessary logarithmic overhead; in this situation local synchronizations suffice.



CS1; CS2; DS/A



Parallel Prefix (Scan)

K

Observe, via examples this "highlevel" algorithmic tool. For instance, polynomial evaluation can be done as a prefix product (of powers of x), then a set of multiplications, followed by a reduction.



DS/A



Other multiparty communication patterns

K

Recognize common multiparty communication patterns such as broadcast, gather, scatter, alltoall or manytomany communications, and their use as building blocks for parallel and distributed algorithms. Illustrate using block matrix transpose or shuffle from mapreduce, etc.



CS2; DS/A



MapReduce

N



Understand MapReduce as an approach to Big Data that involves data movement and then doing a reduction on like data. Understand a simple example such as word count. It is assumed that students know elements of a distributed model that emphasizes the infeasibility of moving large amounts of data.

Algo2


Mutual Exclusion and Conflict Resolution

C

Understand the need to resolve conflicts among concurrent processes competing for a shared resource. Here the computation may have to grant exclusive access to one process to ensure correctness and/or progress. Be able to identify and mitigate problems due to races. Instructors may provide examples such as (a) selecting an airline seat that may be simultaneously competed for by several passengers, (b) selecting which customer gets an item when more than one tries to buy it simultaneously, (c) mutual exclusion in the context of Java threads, (d) Dining philosophers.

A

This can also be explored in the context of MAC protocols (Networking); leader election (Distributed Systems). Asynchrony and local knowledge are causes of conflict (Distributed systems).

CS2; DS/A; Networking; DistSystems

Algorithmic Problems


The Algorithmic problems section contains parallel algorithms for certain problems. The important thing here is to emphasize the parallel/distributed aspects of the topic





Algorithms for Communication and Synchronization


Understand (at the pseudocode level) how certain patterns of communication can be implemented in a parallel/distributed model; the model(s) of choice (MoCs) could serve as good vehicle(s) on which to explore these ideas consistently across the course. As a corollary, one could also appreciate the cost of communication in PDC.






Reduction and Broadcast for communication and synchronization

C

Understand, for example, how recursive doubling can be used for alltoone reduction, and its dual, onetoall reduction, in log(p) steps. The same applies to alltoall broadcast and alltoall reduction. Be aware of the synonyms for these operations in the jargon associated with different areas; for example, alltoall broadcast may be referred to as "gossip" or "total exchange". Recognize that alltoall broadcast/reduction are synchronizing operations in a distributed (eventdriven) environment.

A


DS/A; Algo2



Parallel Prefix (Scan)

C

Understand the structure of at least one simple parallel prefix algorithm, for example, on a PRAMtype model. One could consider recursive or iterative approaches (such as those of LadnerFischer, KoggeStone, BrentKung)

A


DS/A; Algo2



Multicast

N


C

Extend broadcast to multicast and explore avenues for communicationefficiency, relative to the MoC.

OS



Permutation

N


C

Understand important permutations (shuffle, transpose etc.) and their implementation complexity issues.

OS



Critical Regions and Mutual Exclusion

K

Be aware that a solution to the critical section problems must satisfy Mutual Exclusion, Progress, and Bounded Wait Times.

C


OS; DistSystems; ParProg



Termination detection

K

Observe that, unlike the sequential case, processes in parallel and distributed algorithms may not know when the problem has been solved, or even if their part of the problem is solved; so termination has to be addressed explicitly. In some cases (such as reduction, tree algorithms, divide and conquer) it may be possible for a process to terminate on the basis of local information (for example, a node has passed its information to its parent in a reduction tree). In other cases, a global check may be necessary.

C

See examples that suggest the difficulty of proving that algorithms from various classes actually terminate. For more advanced courses, observe proofs of termination, to understand the conceptual tools needed.

DS/A; OS


Sorting

K

Observe at least one parallel sorting algorithm together with analysis. Parallel merge sort is the simplest example, but other alternatives might be covered as well; more sophisticated algorithms might be covered in more advanced courses.

C


CS2; DS/A; Algo2


Search

K

With the help of BFS or DFSlike parallel search in a tree, graph or solution space, understand speedup anomalies and the fact that certain algorithms don't lend themselves to parallelization without modifying the semantics of the original problem. For example, strict DFS order of visiting nodes in a tree cannot be maintained in parallel. Also, the order in which solutions are found and the time taken to find them could vary unpredictably depending on the degree of parallelism. Detailed knowledge of parallel search algorithms is not expected.

C


DS/A; Algo2


Algorithms for streams

K

Comprehend the notion of efficient algorithms (e.g., Bloom filters, heavy hitters) and structures (e.g., distributed hash tables) for stream data, and the difficulty of dealing with limited space.



CS1; DS/A


Spatial Problems

N


K

Understand parallel and distributed decomposition, data structures, and solving strategies for problems rooted in a distribution of points in a multidimensional space such as nbody simulations (FMM, BarnesHut), data clustering (R* trees, DBSCAN), and classifiers (kNN)

Algo2


Deeper Algorithmic Experience

A

Experience through class instruction and assignment/project the design, analysis, and implementation aspects of at least one parallel or distributed algorithm of choice in detail. Master PDC algorithmic concepts through a detailed exploration, including recognizing how algorithm design reflects the structure of the computational problem(s) and the PDC model/environment. Possible computational problems to explore include matrix product, map reduce, sorting, search, convolution, a graph algorithm of your choice.



CS2; DS/A
