You are here

Appendix II: Suggestions on how to teach topics

 

Architecture

 

Data versus control parallelism:

Superscalar (ILP): Multiple issues can be covered in detail in a compiler class, an intro to systems or assembly language, or in architecture. However, even in an early programming class, students are often curious about the factors that affect performance across different processors, and are receptive to hearing about how different models have different numbers of pipelines and greater or lesser ability to cope with unbalanced and dependent groups of instructions.

SIMD/Vector (e.g., SSE, Cray): This can be mentioned any place that vector/matrix arithmetic algorithms are covered, and even in a data structures class. In an architecture class, it may be part of introducing SSE-style short vector operations in a 64-bit ISA. If chip photos are shown, it can be noted that the control unit section of a processor could be shared among multiple identical ALUs to create an explicitly SIMD architecture. Or, in a survey of supercomputers, Cray vector architectures can be described as a historical example, and the evolution of SIMD to SPMD and streaming data parallel designs can be traced.

Pipelines: Pipelines appear in simple form as one means of implementing a vector multiplier, where stages are essentially identical. Otherwise, they are typically covered in an architecture course. It is possible, however, to introduce the concept earlier in a multithreaded application where a chain of consumer/producer threads feed forward through intermediate buffers or queues.

Single vs. multicycle: The difference between data paths in a single cycle and pipelined processor is most directly treated in an architecture course by walking through the creation of the simpler form and then converting it to a pipeline. However, it can also be shown in an advanced programming course, where the steps of a large, monolithic, task are broken into a chain of threads that each execute in a shorter amount of time, and provide opportunities for the OS to take advantage of multicore capabilities.

Streams (e.g., GPU): Graphics processors are one example of a stream architecture, and can be described in terms of how they marshal a block of identical threads to operate in a sweep over a large array of data, and can be covered in a survey of approaches to data parallelism in an architecture course, or as a performance topic in a graphics course. Streams can also be used as a design pattern for threaded code operating on a large data structure.

MIMD: In a survey of parallelism in an architecture course it is easy to take the step from uniprocessors to multiprocessors, since it is obvious that a CPU can be replicated and used in parallel. In an OS course, multitasking can be described both in a uniprocessor and a multiprocessor context. In an early programming course, it is likely that students will be curious about what the popular term multicore means, and how it could impact their programming.

Simultaneous Multithreading (e.g., Hyperthreading): In an architecture course, different granularities of multithreading should be addressed. Then the impact on the microarchitecture (multiple register sets, more ALU utilization and increased heat, additional logic to manage exceptions, etc.) can be covered. In an early course, where performance is discussed, it follows naturally after explaining superscalar issue and the low rate at which issue slots are filled. It can then be contrasted with multicore in the sense that it does not replicate the entire CPU, but just the parts that are essential to enabling a second thread to run in a way that fills in some underutilized resources. An analogy is using a truck to tow two trailers versus using two trucks to tow the same two trailers.

Multicore: In an architecture course, the rise in power and heat with increased clock rate will naturally follow the idea that pipelining enables faster clock rates. The limited ability of chips to consume power and dissipate heat then motivates the shift away from the trend of using more real estate to build faster processors toward building more cores that operate at a slower rate. Once a chip has multiple cores, the question of how they work together leads to coverage of communication mechanisms. Scaling up the idea of multicores then leads to the question of whether they must all be identical. Tying chip scaling back to fault and yield models provides an indication that cores are likely to be heterogeneous in performance and functionality as a result of manufacturing variations. At a high level of abstraction, some of these concepts can be explained in an early programming course, where factors that affect performance are discussed.

Cluster: Students may first encounter a cluster in a departmental compute server for use an instructional lab. Or it may be mentioned in Internet course when explaining how services such as search engines and on-line auctions are supported. In an architecture class, they are motivated by the practical limitations of motherboard size, and the ease of assembly using off the shelf components. In this context, students should also be cautioned regarding practical issues of power conditioning, thermal management, and mean time to failure of nodes.

Grid/cloud: Students will typically have personal experience with cloud computing through internet services, such as document storage and sharing, that are available anywhere. Questions about how this works can be addressed in programming (where such services may be used by teams), networking, and database courses. More advanced courses can cover grid issues in terms of latency, availability, load distribution and balancing, allocation policies, etc.

Shared versus distributed memory:

SMP:

Buses: In the earliest courses, students often want to understand the differences in the many kinds of buses they hear about (front side, PCI, USB, etc.), and this presents and opportunity to explain how multiple components in a computer can share a common communication link. In an architecture course, it is normal to encounter various internal buses, including the memory bus. Once the idea of DMA and multiple masters is introduced, it is logically a small step to have two or more processors on the same memory bus. As with I/O devices on the bus, a protocol is needed to ensure that intended recipients receive current data.

Message passing (no shared memory):

Topologies: In an architecture course, once the idea of inter-processor communication over a network link is established, going beyond two processors opens up options for the arrangement of links. A few examples illustrate the potential explosion of topologies, so it is then worth mentioning that most can be simulated with constant order slowdown by a few key topologies. Thus, it boils down to more practical considerations, such as the ease of building a mesh on a circuit board, or wiring network cable and routers in a high- degree fat-tree pattern.

Diameter: Although this can also be covered as part of graph theory, it is useful to show the differences in diameter of a few topologies, mainly so students can see that there are some very poor choices possible, such as linear, and that most common topologies seek a much smaller diameter. It can be shown that in packet-switched networks, each hop incurs considerable delay due to routing overhead, which is a reason that students should care about the issue.

Latency: In architecture, latency comes in many forms. Extending the idea to message passing is fairly obvious. What is less obvious is how much of it is due to the software protocol stack. Thus, specialized interfaces and routers can be used to reduce latency as a system scales up. The concept can also be covered in an Internet course, observing round-trip times over different numbers of hops.

Bandwidth: It is fairly obvious that data can be transferred at different rates over different kinds of links. Most students will have experienced this effect via different wireless links, or comparing wireless to Ethernet, etc. It is really just a matter of formalizing the terminology. In an architecture class or as part of graph theory the idea of bisection bandwidth can also be introduced with respect to network topology.

Memory Hierarchy:

Cache organization: At the level of an architecture class, once caching has been covered, as soon as bus-based multiprocessing is introduced, the issue of coherency of shared data arises. In an advanced architecture class, a basic snooping protocol, such as SI can be shown to achieve coherency. After a few examples, it becomes clear that such a simple protocol results in excessive coherency traffic, and this motivates the value of a protocol with more states, such as MSI, MESI, or MOESI, which can be briefly described.

Floating-point representation:

Precision: It is easy to explain that higher precision floating point involves a lengthier calculation, moving more bits to and from memory, and that an array of double precision values occupies twice as much memory as single precision. As a result, there is a tradeoff between precision and performance.

Cycles per instruction (CPI): At one level, once the idea of clock cycle has been explained, and the fact that instructions can take different numbers of cycles to execute, it is easy to add the notion that processors can execute fewer or more than one instruction in each cycle. It can be noted that CPI is the inverse of IPC, and that these can be biased by the instruction mix resulting from compiler optimizations (CPI is affected by instruction scheduling). When pipelining is introduced, the CPI calculation process can be demonstrated via a simple by-hand demonstration of a few instructions passing through. In concept it is easy to imagine that superscalar issue greatly affects CPI, and students should be aware that the benefit is less than what they may expect for normal code sequences.

Benchmarks: Students can be shown that most ad-hoc metrics are poor indicators of performance. For example, a processor with a high clock rate that delivers less performance than one with a lower rate (because of other architectural advantages). Thus, benchmark programs are a better indicator of actual performance. But then it is a question of how to define a good benchmark. One kind of program doesn't predict the behavior of another kind. So a suite of benchmarks helps to broaden coverage. But because a suite is an artificial assemblage of programs, it is inherently biased, and so different suites are needed to represent different kinds of workloads.

SPEC mark: Explain differences between arithmetic, geometric, harmonic, and weighted means. Have students explore their values when applied to different data sets, including one with one or two outliers that are much greater than the rest. Notice the excessive impact that the outliers have on the arithmetic and geometric means. Look at the SPEC results for some machines and notice that most reports have one or two outliers. Recompute the mean using harmonic mean, and omitting the outliers. Careful selection of the reports can show two machines trading places in ranking when outliers are ignored.

Peak performance: Use published parameters for an architecture to compute peak performance in MIPS or FLOPS, then see how this compares with benchmark execution reports.

MIPS/FLOPS: Define these terms.

Sustained performance: Define sustained performance, and show some examples in comparison to peak

Programming

Parallel Programming paradigms and Notations:

By the target machine model:

SIMD: Discuss operating on multiple data elements at a time with 1 operation/instruction, using a simple example (e.g., array add) with F90 syntax, as a loop, etc.

Microprocessor vector extensions: Introduce (or revisit) SIMD parallelism, its pros and cons, give examples in modern microprocessors (SSE, Altivec), and, if possible, have the students experiment with writing simple programs that use SSE.

Shared memory: Examples of thread programs with both array and control parallelism, with locks and synchronization ops, explain that threads may run in parallel in same address space unless prevented from doing so explicitly, definitely programming projects w/threads (Java, pthreads, etc.)

Shared memory notations: Introduce various ways of parallel programming: (1) Parallel languages, which come in very diverse flavors, for example, UPC, Cilk, X10, Erlang. (2) Extensions to existing languages via compiler directives or pragmas, such as OpenMP. (3) Parallel libraries, such as MPI, Pthreads, Pfunc, TBB, (4) Frameworks such as CUDA, OpenCL, etc., which may incorporate elements of all three. If possible, students should write simple parallel programs to implement the same algorithm using as many of the above four notations as time and resources permit.

compiler directives/pragmas: Introduce the basic directives for writing parallel loops, concurrent sections, and parallel tasks using OpenMP. Have the students write simple OpenMP programs.

libraries: The students should be taught how to write parallel programs using a standard language such as C or C++ and a parallel programming library. Depending on the instructor's preference, any library such as Pthreads, Pfunc, TBB, or TPL can be used. An advantage of Pfunc in an educational setting is that it is open source with a fairly unrestricted BSD-type license and advanced/adventurous students can look at or play with the source. It is designed to permit effortless experimentation w/ different scheduling policies and other queue attributes.

Distributed memory: Example of message passing programs that each process has its own address space with one or more threads, only share data via messages

- Client Server: Java RMI or sockets or web services example, notion of invoking a service in another (server) process, and that client and server may run concurrently

Hybrid: Idea of a single parallel program, with each process maybe running on different hardware (CPU, GPU, other co-processor), and that can be client/server, or MIMD program, or something else

By the control statements:

Task/thread spawning: Thread program examples (Java, pthreads, Cilk), with threads creating and joining with other threads, synchronization, locks, etc.

SPMD: Same code, different data, usually in different processes, so with message passing, but also a style of thread programming, need to trace an example with at least 2 threads/processes to see that each one can take a different path through the program

SPMD notations: Introduce/revisit the SPMD model. The students should be taught about programming in a parallel environment where data-access is highly nonuniform. Introduce/reintroduce the notion and importance of locality. Introduce BSP. Introduce/revisit data movement costs and the distinction between costs due to latency and bandwidth. Given examples of (and, if possible, a brief introduction to a select) languages/frameworks, such as MPI, CUDA, etc., which can used to programming in SPMD model.

Data parallel: Example thread and/or message passing programs, SPMD, SIMD, or just shared memory with parallel loops, operating on elements of a large array or other simple data structure

Parallel loop: Examples of data dependences, and that a parallel loop doesn't have any across loop iterations, show that these are typically data parallel, but whole iterations can run concurrently, example in Fortran or C or whatever of a DO-ALL, and maybe a DO-ACROSS

Tools to detect concurrency defects: e.g., Spin, Intel's Parallel Studio/Inspector Performance issues:

Computation: Simple example tracing parallel loop execution, and how different iterations can take different amounts of time, to motivate scheduling (static or dynamic)

Computation decomposition strategies: There are standard strategies for parallelizing a computation and its data for parallel execution

Owner’s compute rule: An example of one decomposition method - assign loop iterations based on which process/thread owns the data for the iteration

Load balancing: What is performance determined by in a parallel program? When all threads/processes finish, so best when all finish at same time. Introduce idea of balancing statically and/or dynamically, and when dynamic might be needed (missing info at decomposition time)

Algorithms

Parallel and Distributed Models and Complexity:

Costs of computation:

Asymptotics: See learning outcome for this topic.

time: (1) Review the notion of O(f(n)) asymptotic time complexity of an algorithm, where n is (somehow) related to problem size (e.g., number of elements to be sorted, side-length of a matrix). (2) Adapt the notion to the parallel context by expressing parallel time complexity as O(g(n,p)), where g depends on problem size n and number of cores/processors p. (3) Emphasize that the run time must include the cost of operations, memory access (with possible contention in shared-memory parallel case), and communication (in the distributed-memory parallel case). (4) Introduce parallel speedup and cost-optimality: a parallel algorithm is asymptotically cost optimal if the product of p (the number of cores/processors) and the parallel run time = O(serial run time).

space: Review serial space bound, introduce the notion of parallel space complexity and space optimality, i.e., when the product of p (the number of cores/processors) and the parallel space is of the same order as serial space.

speedup: Introduce and formally define the notion of speedup. Give a simple example, say, by adding n numbers in O(log n) time in parallel with p = n/2. Relate to cost optimality. Present Brent’s Theorem to illustrate limits to parallelization: problems usually have inherently sequential portions. (Come back to this when dependencies are covered.)

Scalability in algorithms and architectures: Revisiting the (adding n numbers) example, show that speedups higher than O(n/log n) can be obtained when p << n. Use the example to show that speedup depends on both n and p; e.g., here, speedup = np/(n + plog p). Introduce the notion of efficiency = speedup/p or conceptually, the amount of useful/effective work performed per core. Show that efficiency typically drops as p is increased for a fixed n, but can be regained by increasing n as well. Introduce Amdahl's law.

Model-based notions:

Notions from complexity-theory:

PRAM: (i) Introduce PRAM model, highlighting unrealistic assumptions of O(1)-time shared memory access as well as arithmetic and logical operation and global clock synchronizing each step (SIMD architecture). Introduce EREW, CREW, and CRCW (Common, Arbitrary and Priority) versions for dealing with read-write conflicts; (ii) Illustrate PRAMs’ functioning and capability with simple Boolean operations over n bits (OR, AND, NAND, etc.): O(1) time with short circuit evaluation on a common CRCW model vs. O(logn) using a reduction tree on an EREW; show pseudo-codes. Demonstrate that the simple PRAM model empowers one to explore how much concurrency is available in a problem for purely computational reasons --- when not burdened with memory access and synchronization costs. Illustrate by example how unrealistically large PRAMs can be emulated by real parallel computers as a vehicle for obtaining feasible parallel algorithms.

BSP/CILK: Introduce BSP highlighting iterative computation wherein multiple processors compute independent subtasks, followed by periodic global synchronizations that allow processors to intercommunicate. The latency of the underlying network is therefore exposed during the communication/synchronization step (which is ignored in PRAM model). Can illustrate with Boolean OR/AND over n bits or sum/max over n integers resulting in \Omega(n/p + logp) time using p processors. Illustrate by example the use of parallel slack (see the last sentence in the PRAM paragraph).

Notions from scheduling: Take a simple problem such as maximization or summing an array of n integers, and illustrate how the problem can be partitioned into smaller tasks (over subarrays), solved, and then combined (using a task graph structured as a reduction tree or as a centralized “hub-spoke” tree [a/k/a “star”], with all local sums updating a global sum). Use this to illustrate the task graph and the dependencies among parent and child tasks. Alternatively --- or additionally --- consider the floating point sum of two real values, and show its control parallel decomposition into a pipeline. Use this to illustrate task graphs and data dependencies between stages of the pipeline. In either example, calculate the total operation count over all the tasks (work), and identify the critical path determining the lower bound on the parallel time (span).

dependencies: Illustrate data dependencies as above; Mention that handshake synchronization is needed between the producer task and consumer task.

task graphs: Show how to draw task graphs that model dependencies. Demonstrate scheduling among processors when there are fewer processors than the available amount of parallelism at a given level of task graph; illustrate processor reuse from level to level.

work: Calculate work for given task graph using big-O notation.

(make)span: Demonstrate how to identify critical paths in a task graph and calculate a lower bound on parallel time (possibly using big-omega notation). Mention Brent’s Theorem, which is based on the critical-path notion. Give examples (e.g., solving a triangular linear system or performing Gaussian elimination).

Algorithmic Paradigms:

Divide & conquer (parallel aspects): Introduce simple serial algorithms, such as mergesort and/or numerical integration via Simpson’s Rule or the Trapezoid Rule. Illustrate Strassen's matrix-multiply algorithm via the simple recursive formulation of matrix multiplication. Show how to obtain parallel algorithms using the divide-and-conquer technique. For Strassen, this should be done after teaching parallel versions of usual algorithm (Cannon or Scalapack outer product).

Recursion (parallel aspects): Introduce simple recursive algorithm for DFS. Show how a parallel formulation can be obtained by changing recursive calls to spawning parallel tasks. Consider the drawback of this simple parallel formulation; i.e., increased need for stack space.

Series-parallel composition: Illustrate that this pattern is the natural way to solve many problems that need more than one phase/sub- algorithm due to data dependencies. Present one or more examples such as (i) time-series evolution of temperature (or your favorite time-stepped simulation) in a linear or 2D grid (each time step, each grid is computed as the average of itself and its neighbors), (ii) O(n)-time odd-even transposition sort, or (iii) O(1)-time max-finding on a CRCW PRAM (composition of phases comprising all-to-all comparisons followed by row ANDs followed by identification of the overall winner and output of the max value). It would be valuable to show the task graph and identify the critical path as the composition of individual critical paths of the constituent phases. A connection with CILK would be a valuable to expose both top illustrate a practical use and to establish nonobvious connections.

Algorithmic problems:

Communication:

Broadcast: Introduce simple recursive doubling for one-to-all and all-to-all among p processes in log p steps. More advanced efficient broadcast algorithms for large messages could also be taught after covering gather, scatter, etc. For example, one-to-all broadcast = scatter + allgather. Also pipelined broadcast for large messages (split into packets and route along same route or along disjoint paths).

scatter/gather: See above.

Asynchrony: Define asynchronous events and give examples in shared- and distributed-memory contexts.

Synchronization: Define atomic operations, mutual exclusion, barrier synchronization, etc., examples of these and ways of implementing these. Define race conditions with at least one example and show how to rewrite the code to avoid the race condition in the example.

Sorting : (i) Explain the parallelization of mergesort wherein each level starting from bottom to top can be merged in parallel using n/2 processors thus requiring O(2+ 4+ ... + n/4 + n/2 + n) = O(n) time. Using p<=n/2 processors will lead to O(n/plog(n/p) + n) time, hence p=log n is a cost-optimal choice. (ii) Highlight that a barrier (or a lock/Boolean flag per internal node of the recursion tree) on shared memory machine or messages from children processors to parent processors in a local memory machine would be needed to enforce data dependency; (iii) Mention that faster merging of two n/2 size subarray is possible, e.g., in O(log n) time on a CREW PRAM using simultaneous binary search using n processor, thus yielding O(log^2n)-time algorithm.

Selection: (i) mention that min/max are special cases of selection problem and take logarithmic time using a reduction tree; (ii) for general case, sorting (e.g., parallel mergesort) is a solution.

Graph algorithms: Basic parallel algorithms for DFS and BFS. Preferably include deriving expressions for time, space , and speedup requirements (in terms of n and p). Parallel formulations and analyses of Dijkstra's single-source and Floyd's all-source shortest path algorithms.

Specialized computations: Example problem - matrix multiplication (AxB = C, nxn square matrices): (i) Explain the n^3-processor O(logn)-time PRAM CREW algorithm highlighting the amount of parallelism; this yields cost optimality by reducing processors p in O(n^3/logn) ensuring O(n^3/p) time (exercise?). (ii) Explain that a practical shared-memory, statically mapped (cyclic or block) algorithm can be derived for p <= n^2 by computing n^2/p entries of product matrix C in a data independent manner; (iii) For p<= n, the scheduling simplifies to mapping rows or columns of C to processors; mention that memory contention can be reduced by starting calculation at the ith column in the ith row (exercise?). (iii) For a local memory machine with n processors with a cyclic connection, the last approach yields a simple algorithm by distributing the ith row of A and the ith column of B to P_i, and rotating B's columns (row-column algorithm) - yields O(n^2) computation and communication time; mention that for p<n, row="" and="" column="" bands="" of="" a="" b="" can="" be="" employed="" -="" derive="" o(n^3="" p)="" time="" (exercise?).="" (iv)="" for="" 2-d="" mesh,="" explain="" cannon's="" algorithm="" (may="" as="" refinement="" n^2-processor="" shared-memory="" algorithm,="" wherein="" each="" element="" is="" block="" matrix).=""

Termination detection: Define the termination detection problem

  • - simple message based termination detection:
  • - single pass ring termination detection algorithm
  • - double pass ring termination detection algorithm
  • - Dijkstra-Scholten algorithm
  • - Huang algorithm

Leader election/symmetry breaking: define the Leader election problem

  • - Leader election in a ring:
    • - Chang and Roberts algorithm
  • - General, ID based leader election:
    • - Bully algorithm

Crosscutting

Why and what is parallel/distributed computing?: examples: multicores, grid, cloud, etc.

Crosscutting topics: can be covered briefly and then highlighted in various contexts

Concurrency: The notion of inherent parallelism can be illustrated by a high level specification of the process to achieve the desired goal. A simple example to consider is sorting – quick-sort or merge-sort. An important idea to illustrate with respect to inherent parallelism is how the level of abstraction in the specification affects the exposed parallelism -- that is, illustrating how some of the inherent parallelism may be obscured by the way the programmer approaches the problem solution and the constructs provided by the programming language. Another important idea to illustrate is that of nesting – a higher level step may itself allow exploitation of parallelism at a finer grain. A yet another important idea is the need to weigh the available parallelism against the overhead involved in exploiting it.

Non-determinism: Non-determinism is an inherent property when dealing with parallel and distributed computing. It can be easily illustrated by discussing real-life examples where, e.g., different runs of a parallel job give different answers due to non-determinism of floating-point addition. The dangers of this can be illustrating by talking about order of operations and the need for synchronization to avoid undesirable results.

Locality: The performance advantages of locality are easy to explain and can be illustrated by taking examples from a wide spectrum of data access scenarios. This includes cache data locality in the programming context, memory locality in paging context, disk access locality, locality in the context of virtualization and cloud computing, etc. Both spatial and temporal aspects of locality must be clarified by illustrating situations where only or both may be present. Simple eviction/prefetching policies to take advantage of locality should also be illustrated with examples. Relationship of temporal locality to the notion of working set should also be explained.

Power consumption: Power consumption of IT equipment is a topic of increasing importance. Some general principles of power savings such as use of sleep states and reduced voltage/frequency operation can be introduced along with its impacts on power consumption, performance and responsiveness. It is also important to make a distinction between reducing power vs. reducing energy consumption by using simple examples. Finally, this topic provides a perfect opportunity to discuss the role of user behavior and behavior change in truly reducing IT energy consumption.

Fault tolerance: Fault tolerance is a fundamental requirement to ensure robustness and becomes increasingly important as the size of the systems increases. In a system composed of a large number of hardware elements (e.g., processing cores) or software element (e.g., tasks), the failure of a few is almost a given, but this should not disrupt or silently corrupt the overall functioning. Some important aspects to cover include: an introduction to the increasing need for fault-tolerance illustrated by simple equations, a brief classification of faults (transient, stuck-at, byzantine, ...), and illustration of some basic techniques to deal with them (retry, coding, replication and voting, etc.).

Performance modeling: Performance is a fundamental issue at all levels of computing and communications, and thus needs to be addressed in most topics, including architecture, programming, and algorithms. Indeed, performance topics appears in all of these topics. In addition, it is important for students to learn basic techniques for analyzing the performance impact of contention for shared resources. The basic concepts include the idea of a queuing system, infinite server vs. single server queues, stability of queuing systems, utilization law, Little’s law, open and closed networks of resources and applying Little’s and utilization laws, memoryless behavior, and simple M/M/c queuing analysis. The ideas can be illustrated with examples from architecture, networks, and systems.

Current/Hot/Advanced Topics:

Cluster Computing: A cluster is characterized by a set of largely homogeneous nodes connected with a fast interconnect and managed as a single entity for the purposes of scheduling and running parallel and distributed programs. Both shared memory and message passing alternatives can be briefly discussed along with their pros and cons. Cluster computing can be illustrated by using the specific example of a Beowulf cluster, and briefly discussing the use of MPI and MapReduce paradigms for solving a simple distributed computing problem.

Cloud/Grid Computing: The notion of virtualization is crucial for understanding cloud computing and should be briefly covered, along with structure of some popular virtualization software such as VMware and Xen. Cloud computing could then be introduced assisted by machine, network, and storage virtualization. A brief discussion of VM scheduling and migration is essential to provide an overview of how cloud computing works. Cloud storage and cloud services concepts can be easily illustrated by using the example of drop-box – a popular cloud storage service. Alternately (or in addition), a hands on demonstration of how resources can be requested and used on a commercial platform such as Amazon EC2 is very useful introducing cloud computing to the students. Grid computing can be briefly introduced along with a brief mention of Globus toolkit.

Peer to Peer Computing: The students are likely to have already made a good deal of use of available P2P services such as Bit Torrent and those could be used as a starting point of discussion of P2P. The important concepts to get across in P2P are: (a) the notion of give and take in cooperative P2P services, (b) structured vs. unstructured content organization and searches, and (c) pros and cons of P2P vs. client-server based implementations of services. The structure of Bit Torrent, and the key concepts of file segmentation, seed, leecher, tit-for-tat, and chocking should be explained briefly. Skype can be introduced as another form of P2P application.

Distributed Transactions: Consistency maintenance in the face of concurrent updates is a crucial learning outcome that can be illustrated with a simple database update example. The need for both strict consistency and looser forms of consistency should be illustrated by using appropriate examples (e.g., banking vs. web browsing). The notions of optimistic and pessimistic concurrency control can be illustrated by a simple example. Consistency in the presence of multiple copies is a somewhat more advanced topic that can be introduced briefly.

Security and privacy: Security and privacy concerns multiply both with an increase in the size of the systems (in terms of number of independent agents and information repositories), and an increase in intelligence (which requires that more detailed information be learnt and shared). Example to illustrate the risks can be drawn from social networks, customer data learnt/maintained by Google, Amazon and other prominent companies, or even from emerging areas such as matching supply and demand in a smart grid. The tradeoff between security, privacy, and intelligence of operations can also be illustrated with these examples.

Web searching: Web searching requires a substantial amount of distributed processing and pre-computation in order to provide quick answers. A brief discussion of web-crawling to locate new web pages and update deleted pages, building indexes for quick search, parallel search, dealing with multiple cached copies, and ranking of search results is important to convey the range of activities involved in web-search. The ideas are best illustrated via a concrete example assuming a popular search engine such as Google.

Social networking: Social networking is by now well entrenched and it is likely that most beginning CS/CE students have already used one or more such services. The purpose of this topic is to sensitize the students to ways in which social networking information can be exploited to provide enhanced services that account for social context and other information derived from social networking data. The tradeoff between usability and privacy can also be illustrated using these examples.

Collaborative computing: Collaborative computing refers to active involvement of multiple users (or devices) to accomplish some objective. Examples of collaborative computing include shared document editing (e.g., Google docs), multiplayer games, and collaboration between enterprises. Some of these applications can be discussed briefly along with some important distributed systems concepts (e.g., consistency and synchronization) applied to them.

Web services: Web services form the basis for browser based interaction between users and the enterprise servers that provide the relevant data and functionality. The course can illustrate web-services by a simple programming exercise to fetch say, current stock price using Java or .Net framework. The course should also introduce the basic web-services infrastructure such as publication via UDDI, description of functionality via WSDL, invocation via SOAP RPC, and invocation using XML/SOAP.

Pervasive/Mobile computing: Mobile computing, possibly assisted by cloud computing for offloading heavy-duty computation and intelligent decision-making is emerging as a way to support applications of importance to a community or society at large. Such applications include monitoring and understanding of evolving real-world events such as traffic congestion on highways, unfolding disasters, or social mood. Pervasive computing covers an even larger area that includes ad hoc or embedded devices such as surveillance cameras in malls, occupancy sensors in rooms, seismic sensors, etc. While illustrating these as emerging examples of distributed computing, the unique aspects of these environments can be briefly discussed, e.g., no coupling between devices, ad hoc & changing topologies, large scale, possibility of exploiting context, security, privacy and reliability issues, etc.