You are here

NSF/IEEE-TCPP Curriculum Initiative on Parallel and Distributed Computing - Core Topics for Undergraduate

5. Architecture Topics

5.1 Rationale

Existing computer science and engineering curricula generally include the topic of computer architecture. Coverage may range from a required course to a distributed set of concepts that are addressed across multiple courses. 

As an example of the latter, consider an early programming course where the instructor introduces parameter passing by explaining that computer memory is divided into cells, each with a unique address. The instructor could then go on to show how indirect addressing uses the value stored in one such cell as the address from which to fetch a value for use in a computation. There are many concepts from computer architecture bound up in this explanation of a programming language construct. 

While the recommended topics of parallel architecture could be gathered into an upper level course and given explicit treatment, they can be similarly interwoven in lower-level courses. Because multicore, multithreaded designs with vector extensions are now mainstream, more languages and algorithms are moving to support data and thread parallelism. Thus, students are going to naturally start bumping into parallel architecture concepts earlier in their core courses. 

Similarly, with their experience of social networking, cloud computing, and ubiquitous access to the Internet, students are familiar users of distributed computation, and so it is natural for them to want to understand how architecture supports these applications. Opportunities arise at many points, even in discussing remote access to departmental servers for homework, to drop in remarks that open students’ eyes with respect to hardware support for distributed computing. 

Introducing parallel and distributed architecture into the undergraduate curriculum goes hand in hand with adding topics in programming and algorithms. Because practical languages and algorithms bear a relationship to what happens in hardware, explaining the reasoning behind a language construct, or why one algorithmic approach is chosen over another will involve a connection with architecture.

A shift to “thinking in parallel” is often cited as a prerequisite for the transition to widespread use of parallelism. The architecture curriculum described here anticipates that this shift will be holistic in nature, and that many of the fundamental concepts of parallelism and distribution will be interwoven among traditional topics. 

There are many architecture topics that could be included, but the goal is to identify those that most directly impact and inform undergraduates, and which are well established and likely to remain significant over time. For example, GPUs are a current hot topic, but even if they lose favor, the underlying mechanisms of multithreading and vector parallelism have been with us for over four decades and will remain significant, because they arise from fundamental issues of hardware construction. 

The guideline divides architecture topics into six major areas: Classes of parallelism, underlying mechanisms, floating-point representation, performance metrics, power, and scaling. It is assumed that floating point representation is already covered in the standard curriculum, and so it has been included here merely to underscore that for high performance parallel computing, where issues of precision, error, and round off are amplified by the scale of the problems being solved, it is important for students to appreciate the limitations of the representation.

Classes of Parallelism topics are meant to encourage coverage of the major ways in which computation can be carried out in parallel by hardware. Understanding the differences is key to appreciating why different algorithmic and programming paradigms are needed to effectively use different parallel architectures. The classes include data and control parallelism, pipelining, and communication via shared memory or message passing. 

Underlying Mechanisms has replaced the Memory Hierarchy section in this version of the curriculum because many of the associated concepts scale to distributed systems. Caching is covered in the traditional curriculum, but when parallelism and distribution come into play, the issues of atomicity, consistency, and coherence affect the programming paradigm, where they appear, for example, in the explanation of thread safe libraries. The revised curriculum also addresses the problem of false sharing and issues that arise as systems scale up to handle larger data sets. Additionally, interrupts and event handling are fundamental in many contexts involving concurrency and distribution, starting even from a basic understanding of graphical user interfaces. Handshaking as an alternative to synchronous communication is an aspect of communication between processors and devices at multiple levels of granularity. Lastly, we have added a topic on process and system ID as a basic necessity for identifying the source and destination in a communication, as well as symmetry breaking in programming parallel and distributed systems. 

Performance Metrics present unique challenges in the presence of PDC because of asynchrony that results in unrepeatable behavior. In particular, it is much harder to approach peak performance of PDC systems than for serial architectures. 

Power has been added as a section in the revised curriculum because it has grown in significance since the original recommendations were issued. At the low end, mobile devices are very sensitive to power efficiency, and that has resulted in the use of multiple cores, sometimes with heterogeneous performance levels. At the high end, supercomputer architectures are constrained by the availability and cost of power. Most of this material is already being addressed in computer architecture courses. 
 
Scaling has been added in the revision to reflect the realities of HPC and big data applications, where previously minor factors grow in significance to have major effects. These are mostly topics for upper level courses with some preliminary coverage in the core courses so that students are aware that the simple approaches they are initially learning are not going to be suitable for large problems. These topics include reliability, fault tolerance, data bound computation, memory bound computation, scale-out implications, data movement cost versus computation cost. 
 
Many of the architecture learning goals are listed as belonging to an architecture course, or a systems course (which may be a traditional computer organization course, or a more general systems principles course). The teaching examples, however, describe ways that some can be introduced in lower level courses. Some topics are indicated as belonging to other advanced courses. These are not included in the core curriculum. We have included them as guidance for topical coverage in electives, should a department offer such courses. 

5.2 Updates from version 1.0

The updates to the Architecture table are described in detail above. To summarize, the Memory Hierarchy section has been generalized and renamed Underlying Mechanisms with several new topics. Some less relevant benchmarking topics (such as means) were deleted from Performance Metrics. New sections on Power and Scaling have been added. Some updates have been made to the section on Classes of Parallelism to reflect the growth in reliance on GPUs and heterogeneous cores. There is now more coverage of distributed systems in the learning outcomes throughout. 

Table 2: Architecture Topics

Topics

Core BLOOM level

Learning Outcomes and Teaching Suggestions (core)

Advanced Bloom Level

Learning Outcome (advanced)

Where Covered

Classes of Parallelism

 

 

 

 

 

 

 

Taxonomy

C

Flynn's taxonomy, data vs. control parallelism, shared/distributed memory

C

Arch2: Flynn's taxonomy, data vs. control parallelism, shared/distributed memory (additional depth)

Systems; Arch2

 

Data parallelism

 

 

 

 

 

 

 

Superscalar (ILP)

K

Describe opportunities for multiple instruction issue and execution (different instructions on different data)

A

Explain how superscalar works, how to schedule instructions for wider issue

Systems; Arch2

 

 

SIMD/Vector (e.g., AVX, GPU, TPU)

K

Describe uses of SIMD/Vector (same operation on multiple data items), e.g., accelerating graphics for games

C

Know the relationship of vector extensions to separate accelerators, and how they operate

Systems; Arch2

 

 

SIMD/Vector energy effects

K

Saves energy by sharing one instruction over many instruction executions, whether in parallel (SIMD) or pipelined (vector)

K

Arch2: Saves energy by sharing one instruction over many instruction executions, whether in parallel (SIMD) or pipelined (vector) (additional depth)

Systems; Arch2

 

 

Dataflow

N

 

K

Be aware of this alternative execution paradigm

Arch2

 

Pipelines

 

 

 

 

 

 

 

Basic structure and organization

C

Describe basic pipelining process (multiple instructions can execute at the same time), describe stages of instruction execution

C

Arch2: Describe basic and more advanced pipelining process (including superpipelining and superscalar issue and commit, e.g., scoreboard, reservation stations, reorder buffer), describe stages of instruction execution (additional depth, such as forwarding, multithread handling, shelving)

Systems; Arch2

 

 

Data and control hazards

K

Show examples of how one pipe stage can depend on a result from another, or delayed branch resolution can start the wrong instructions in a pipe that can reduce performance

C

Compilers: Instruction scheduling for modern pipe systems.

Arch2: Pipeline effects including stalling, shelving, and restarting. Mechanisms for avoiding performance losses such as branch prediction, predication, forwarding, multithreading.

DistSystems: understanding equivalent ideas as expressions of time and local versus global knowledge

Systems; Compilers; Arch2; DistSystems

 

 

OoO execution, speculation

N

 

C

Be able to work through examples of Scoreboard, Reservation stations, Reorder buffers, difference between complete and commit, speculation (including security issues)

Arch2

 

 

Streams (e.g., GPU)

K

Know that stream-based architecture exists in GPUs

C

Be able to describe basic GPU architecture with streaming multiprocessors, threads, warps, memory hierarchy

Systems; Arch2

 

Control parallelism

 

 

 

 

 

 

 

MIMD

K

Identify MIMD instances in practice (multicore, cluster, e.g.), and know the difference between execution of tasks and threads

 

 

Systems

 

 

Multi-Threading

K

Distinguish multithreading from multicore (based on which resources are shared)

C

Be able to explain the difference between coarse and fine grain multithreading, and simultaneous multithreading.

Systems; Arch2

 

 

Scaling of Multithreading (e.g., GPU, IBM Power series)

N

 

K

Have an awareness of the potential and limitations of thread level parallelism in different kinds of applications

Arch2

 

 

Multicore

C

Describe how cores share resources (cache, memory) and resolve conflicts

C

Be able to describe relationship of cores to memory hierarchy, reason for switching from increasing clock rate to increasing core count

Systems; Arch2

 

 

Heterogeneous (e.g., GPU, security, AI, low-power cores)

K

Recognize that multicore may not all be the same kind of core (mix of organizations, instruction sets, high level explanation of benefits and costs))

C

Be able to describe performance benefits from integrating heterogeneous cores on a chip in context of applications with heterogeneous parallelism. Be able to cite reasons for some high performance accelerators to be on separate chips, due to need for greater silicon real estate

Systems; Arch2

 

 

Heterogeneous architecture energy (small vs. big cores, CPU vs. GPU, accelerators)

K

Know that heterogeneity saves energy by using power-efficient cores when there is sufficient parallelism

C

Enumerate energy benefits from heterogeneity in architecture. Illustrate power/performance tradeoff. Discuss some combination of high variability with small feature sizes and reliability issues with lower voltages, redundancy techniques to enhance reliability, performance/reliability and power/reliability tradeoffs.

Systems; Arch2

 

Shared memory

 

 

 

 

 

 

 

Symmetric Multi-Processor (SMP)

K

Explain concept of uniform access shared memory architecture

 

 

Systems

 

 

Buses for shared-memory implementation

C

Be able to explain issues related to shared memory being a single resource, limited bandwidth and latency, snooping, and scalability

C

Arch2: Describe different arbitration schemes and tradeoffs.

DistSystems: Explain need for symmetry breaking.

Systems; Arch2; DistSystems

 

 

Broadcast (snooping)-based, Cache-Coherent Non-Uniform Memory Access  (CC-NUMA)

N

 

K

Be aware that caches in the context of shared memory depend on coherence protocols, and understand idea of snooping (protocols are addressed in a separate topic)

Arch2

 

 

Directory-based CC-NUMA

N

 

C

Arch2: Be aware that bus-based sharing doesn’t scale, and directories offer an alternative, based on distributed memory hardware. DistSystems: see how shared memory can be extended to a loosely coupled systems with appropriate consistency model.

Arch2; DistSystems

 

Distributed memory

 

 

 

 

 

 

 

Message passing (no shared memory)

N

 

K

Arch2: Shared memory architecture breaks down when scaled due to physical limitations (latency, bandwidth) and results in message passing architectures.

ParProg: Be aware of the effect of network hardware support in context of e.g., MPI, PGAS implementation.

Arch2; ParProg

 

 

Topologies

N

 

C

Algo2: Various graph topologies - linear, ring, mesh/torus, tree, hypercube, clique, crossbar and their properties (e.g., diameter, bisection width, etc.).

DistSystems understand network as means of communication among distributed nodes, and effects of topology.

Algo2; ParProg; DistSystems

 

 

Latency

K

Know the concept, implications for scaling, impact on work/communication ratio to achieve speedup. Awareness of aspects of latency in parallel systems and networks that contribute to the idea of time accumulating over dependent parts or distance

C

Algo2: Give examples of how communication latency contributes to complexity analysis of an algorithm.

DistSystems: Show awareness of hardware aspects of latency in parallel systems and networks that contribute to the concepts of distributed time, and of local versus global knowledge. Differences between synchronous and  asynchronous systems.

Systems; Algo2; DistSystems

 

 

Bandwidth

K

Know the concept, how it limits sharing, and considerations of data movement cost

 

 

Systems

 

 

Circuit and packet switching

N

 

C

Arch2: know that interprocessor communication can be managed using switches in networks of wires to establish different point-to-point connections, that the topology of the network affects efficiency, and that some connections may block others. Know that interprocessor communications can be broken into packets that are redirected at switch nodes in a network, based on header info.

Networking: Be able to explain basic functionality and limitations within routers

Arch2; Networking

 

 

Routing

N

 

C

Arch2: know that messages in a network must follow an algorithm that ensures progress toward their destinations, and be familiar with common techniques such as store-and-forward, or wormhole routing.

Networking: Be able to explain wide area routing, packet latency, fault tolerance.

DistSystems: see impact on distributed systems, see adaptive routing as instance of load balancing

Arch2; Networking; DistSystems

Underlying Mechanisms

 

 

 

 

 

 

 

Caching

C

Know the cache hierarchies, and that shared caches (vs. private caches) result in coherency and performance issues for software

C

Arch2: Explain why coherence is complicated by synonym resolution within a virtual memory system. Know that TLBs are affected by sharing in a multicore context. Be aware of emerging non-volatile memory technology in the hierarchy.

DistSystems: Show how network latency can be hidden by caching, but that it also introduces coherence issues

Systems; Arch2; DistSystems

 

 

Atomicity

K

CS2: Show awareness of the significance of thread-safe data structures in libraries.

Systems: Explain the need for atomic operations and their use in synchronization, both in local and distributed contexts.

C

Arch2: Understand implementation mechanisms for indivisible memory access operations in a multicore system OS: be able to use atomic operations in management of critical sections locally, in a multiprocessor. DistSystems: Be able to recognize or give examples of atomicity in a distributed system

CS2; Systems; Arch2; OS; DistSystems

 

 

Consistency

N

 

C

Arch2: be able to explain implementation mechanisms for consistent views of data in sharing; OS: Be able to explain consistency from a process management perspective; ParProg: Be able to explain consistency from the perspective of language or library models; DistSystems: Be able to explain consistency from perspective of coping with high latency

Arch2; OS; ParProg; DistSystems

 

 

Coherence

N

 

C

Arch2: examine protocols and their implementation and performance differences.

ParProg, OS: describe how cores share cache and resolve conflicts.

DistSystems: protocols for wide area coherence

Arch2; OS; ParProg; DistSystems

 

 

● False sharing

N

 

C

Arch2: explain in context of excess coherence traffic.

ParProg: learn how to avoid by structuring data and orchestrating access.

DistSystems: explain how it can occur with remote data.

Arch2; ParProg; DistSystems

 

 

Interrupts and event handling

K

CS2: know that event handling is as an aspect of graphical user interfaces.

Systems: know that I/O is mostly interrupt-driven.

C

Arch2: know how interrupts are implemented, including the vector table, masking, and timers used in multiprocessing. Be able to explain handshaking protocols.

OS: be able to work with interrupt handling, including from other processors and network sources, and the use of timer interrupts in multiprocessors.

DistSystems: be able to use network interrupts in a distributed system and understand handshaking protocols for communications

CS2, Systems, Arch2; OS; DistSystems

 

 

Handshaking

K

Know the difference between synchronous and asynchronous communication protocols, including context of multiple processors.

C

Arch2: able to explain handshaking protocols for buses.

Networking: able to explain handshaking in network protocols.

DistSystems: be able to use handshaking protocols for communication

Systems; Arch2; Networking; DistSystems

 

 

Process ID, System/Node ID

K

Explain need for a process identifier to enable interprocess communication, and for a node identifier in initialization of multiprocessor systems

C

Arch2: know how process ID is implemented, see need for node ID for symmetry breaking.

ParProg: be able to use process ID for communication, node ID for symmetry breaking in programs.

DistSystems: know how node ID is a key element of local knowledge

Systems; Arch2; ParProg; DistSystems

 

 

Virtualization Support

N

 

C

Arch2, OS, DistSystems: Enumerate pros and cons of virtualization for security, load balancing, quality of service management, and/or issues of supporting with good performance and security

Arch2; OS; DistSystems

Floating Point Representation

 

These topics are supposed to be in the ACM/IEEE core curriculum already – they are included here to emphasize their importance, especially in the context of PDC for large problems.

 

 

 

 

 

Range

K

Explain why range is limited, implications of infinities

 

 

CS1; CS2; Systems

 

 

Precision

K

How single and double precision floating point numbers impact software performance, basic ways to avoid loss of precision in calculations

 

 

CS1; CS2; Systems

 

 

Rounding issues

N

 

C

Arch2: Explain differences in rounding modes.

Algo2: Give examples of accumulation of error and loss of precision.

ParProg: Explain need for numerical analysis to ensure that accumulation of error in long chains of calculation over large data sets is controlled

Arch2; Algo2; ParProg

 

 

Error propagation

K

Able to explain NaN, Infinity values and how they affect computations and exception handling

 

 

CS2

 

 

IEEE 754 standard

K

Representation, range, precision, rounding, NaN, infinities, subnormals, comparison, effects of casting to other types

 

 

CS1, CS2, Systems

Performance Metrics

 

 

 

 

 

 

 

Instructions per cycle

C

Explain pros and cons of IPC as a performance metric, various pipelined implementations

C

Describe impact of multithreading on IPC calculation

Systems; Arch2

 

 

Benchmarks

K

Awareness of various benchmarks (such as LinPack, NAS parallel) and how they test different aspects of performance in parallel systems

 

 

Systems

 

 

● Bandwidth benchmarks

N

 

K

Be aware that there are benchmarks focusing on data movement instead of computation, and that there are different architecture aspects that contribute to bandwidth, and that bandwidth is with respect to the source and destination points across which it is measured

Arch2

 

 

Memory bandwidth

N

 

C

Be able to explain significance of memory bandwidth with respect to multicore access, and different contending workloads, and challenge of measuring

Arch2

 

 

Network bandwidth

N

 

C

Know how network bandwidth is specified and explain the limitations of the metric for predicting performance, given different workloads that communicate with different contending patterns

Arch2

 

 

Peak performance

C

Explain what peak performance is and how it is rarely valid for estimating real performance, illustrate fallacies

 

 

Systems

 

 

● MIPS/FLOPS

K

Know meaning of terms

 

 

Systems

 

 

Sustained performance

C

Know difference between peak and sustained performance, how to define, measure, different benchmarks

 

 

Systems

Power

 

 

 

 

 

 

 

Power, Energy

C

Be able to explain difference between power (rate of energy usage) and energy

 

 

Systems

 

 

Large scale systems, distributed embedded systems

N

 

K

Know about active power states and power reduction of CPUs using dynamic voltage and frequency scaling (DVFS), idle power management using sleep states. In an advanced architecture course, could cover link power reduction of multilane serial links and power management as it relates to large scale systems and low-power IoT type devices

Arch2

 

 

Power density

N

 

K

Be aware that power density can create challenges in thermodynamics of chip designs, for cooling technologies, or for the supply of power to a compute unit (e.g. a rack). Power density challenges for modern architectures

Arch2

 

 

Static vs. dynamic power

N

 

C

Grasp the concept of how some energy usage is relatively constant, and some varies with compute load or data content (1's and 0's). Explain the significance of leakage current as chip feature size shrinks and how scaling up core count may require reduced clock rate to compensate

Arch2

 

 

Clock scaling / clock gating / Power gating /thermal controls

N

 

K

Know that clock scaling/gating is a technology for reducing the power used for latches and other storage units that do not need to be changed. Know that modern systems have thermal sensors that guide clock scaling

Arch2

 

 

Power efficient HW design: simpler cores, short pipelines

N

 

C

Explain the power-efficiency advantages of simpler designs. To apply, could engage in an exercise using out-of-order core vs. in-order core, providing hypothetical performance and energy values.

Arch2

Scaling (HPC, Big Data)

 

 

 

 

 

 

 

Reliability and fault tolerance issues (cross cutting topics of current curriculum)

N

 

K

Large-scale parallel/distributed hardware/software systems are prone to components failing but the system as a whole needs to work.

Arch2

 

 

Hardware support for data bound computation

N

 

C

May include caching, prefetching, RAID storage, high performance networks such as Infiniband. Know that modern architectures are starting to provide hardware support for operations that are crucial for data science. May include GPU collective operations, memory access coalescing

Arch2

 

 

Hardware limitations for data bound computation

K

Comprehend the hardware limitations in the context of Big data and their most relevant v's (volume, velocity)

C

Arch2: considerations of memory system bandwidth, latency hiding, network bandwidth, direct interprocessor links (e.g., QPI, GPUDirect).

ParProg: be able to use tools to identify data transfer bottlenecks.

Systems; Arch2; ParProg

 

 

Pressure imposed by data volume

K

Know of the limitations in terms of storage, memories, and filesystems when dealing with very large data volumes

C

Arch2: Compare and contrast limitations of different storage technologies.

ParProg: be aware of architectural aspects of large systems that affect structuring computation to enable load balancing, and optimizations for throughput.

Systems; Arch2; ParProg

 

 

Pressure imposed by data velocity

K

Know of the limitations in bandwidth, memory, and processing when processing very fast data streams. Include networking and reliability issues

C

Arch2: Explain the need to balance system resources to avoid bottlenecks in processing fast data streams.

ParProg: Explain architectural aspects of data velocity issues of load balancing, jitter, and skew in coordinating flows of data.

Systems; Arch2; ParProg

 

 

Cache growth with scaling out

N

 

K

Know the difference between scaling up and scaling out. As the number of nodes scales, total cache space increases, which can allow partionable problems to also scale more effectively (giving appearance of super-linear speedup), and may impact inter-node communication

Arch2

 

 

Cost of data movement across memory hierarchies

C

Comprehend the difference in speed vs cost (latency, energy) of accessing the different memory hierarchies, and of changing their sizes, in the context of multiple processors and hierarchies

C

Arch2: Enumerate factors affecting bandwidth, latency, power.

ParProg: Describe impact on synchronization, load balancing, tools for measurement. How increased core counts may be seen as a means to increase total cache space in a large system.

Systems; Arch2; ParProg